Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

Data Science Interview QuestionsCategory: Data ScienceExplain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
MockInterview Staff asked 3 months ago
2 Answers
Best Answer
Sepideh Hashemzadeh answered 2 months ago

In classification, skewed datasets for training without proper calibration could lead to models that are biased towards to the majority labels. For e.g., if you have a classification dataset with one class present in 95% of the data, the learned model could simply be one that predicts the majority class irrespective of the features since the misclassification rate is 95%! Similar problems could occur in regression.

To mitigate this problem, you could try a few strategies:

  • Sampling — Up/down sampling the dataset to ensure equal representation for all the classes.
  • Weighting — Some training algorithms such as tree-based algorithms can take instance weights are parameters. Attach higher weights to instances with less frequent labels.
  • Avoid aggressive feature selection without balancing the dataset using sampling or weighting.
  • Calibration — For binary classification, calibration of the models (using Platt’s scaling or isotonic regression) after training could also help.

 

MockInterview Staff answered 3 months ago
  • In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
  • Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
  • The least frequently occurring 80% of items are more important as a proportion of the total population
  • Zipf’s law, Pareto distribution, power laws

Examples:
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
– “The” accounts for 7% of all word occurrences (70000 over 1 million)
– “of” accounts for 3.5%, followed by “and”…
– Only 135 vocabulary items are needed to account for half the English corpus!

  1. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
  2. File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach
Source

Your Answer