What is selection bias, why is it important and how can you avoid it?

Data Science Interview QuestionsCategory: Data ScienceWhat is selection bias, why is it important and how can you avoid it?
2 Answers
MockInterview Staff answered 4 years ago

Answer by Matthew Mayo.

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.


[Content Sponsor #1] Start learning through a free trial on Datacamp!
[Content Sponsor #2] $750 discount on any springboard courses including Data Science Career Track which comes with a Job Guarantee!

MockInterview Staff answered 4 years ago

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

  • Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved

Types:
– Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
– Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
– Data: “cherry picking”, when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
– Studies: performing experiments and reporting only the most favorable results
– Can lead to unaccurate or even erroneous conclusions
– Statistical methods can generally not overcome it
Why data handling make it worse?
– Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
– Missing data handling will increase this effect as it’s based on most HIV negative
-Prevalence estimates will be unaccurate
Source


[Content Sponsor #1] Start learning through a free trial on Datacamp!
[Content Sponsor #2] $750 discount on any springboard courses including Data Science Career Track which comes with a Job Guarantee!

Your Answer

12 + 14 =

Data Science Career Bootcamp