In classification, skewed datasets for training without proper calibration could lead to models that are biased towards to the majority labels. For e.g., if you have a classification dataset with one class present in 95% of the data, the learned model could simply be one that predicts the majority class irrespective of the features since the misclassification rate is 95%! Similar problems could occur in regression.
To mitigate this problem, you could try a few strategies:
- Sampling — Up/down sampling the dataset to ensure equal representation for all the classes.
- Weighting — Some training algorithms such as tree-based algorithms can take instance weights are parameters. Attach higher weights to instances with less frequent labels.
- Avoid aggressive feature selection without balancing the dataset using sampling or weighting.
- Calibration — For binary classification, calibration of the models (using Platt’s scaling or isotonic regression) after training could also help.
- In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
- Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
- The least frequently occurring 80% of items are more important as a proportion of the total population
- Zipf’s law, Pareto distribution, power laws
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
– “The” accounts for 7% of all word occurrences (70000 over 1 million)
– “of” accounts for 3.5%, followed by “and”…
– Only 135 vocabulary items are needed to account for half the English corpus!
- Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
- File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach