# Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

Data Science Interview QuestionsCategory: Data ScienceExplain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

In classification, skewed datasets for training without proper calibration could lead to models that are biased towards to the majority labels. For e.g., if you have a classification dataset with one class present in 95% of the data, the learned model could simply be one that predicts the majority class irrespective of the features since the misclassification rate is 95%! Similar problems could occur in regression.

To mitigate this problem, you could try a few strategies:

• Sampling — Up/down sampling the dataset to ensure equal representation for all the classes.
• Weighting — Some training algorithms such as tree-based algorithms can take instance weights are parameters. Attach higher weights to instances with less frequent labels.
• Avoid aggressive feature selection without balancing the dataset using sampling or weighting.
• Calibration — For binary classification, calibration of the models (using Platt’s scaling or isotonic regression) after training could also help.

• In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
• Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
• The least frequently occurring 80% of items are more important as a proportion of the total population
• Zipf’s law, Pareto distribution, power laws

Examples:
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
– “The” accounts for 7% of all word occurrences (70000 over 1 million)
– “of” accounts for 3.5%, followed by “and”…
– Only 135 vocabulary items are needed to account for half the English corpus!

1. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
2. File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach
Source