How do you handle missing data? What techniques do you recommend?

Data Science Interview QuestionsCategory: Data ScienceHow do you handle missing data? What techniques do you recommend?
3 Answers
MockInterview Staff answered 5 years ago

More reading: Handling missing data (O’Reilly)
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.
In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
Source: Springboard

Palak Shah answered 5 years ago

Here are three ways:

  • Remove rows with missing values – This works well if 1) the values are missing randomly (see Vinay Prabhu’s answer for more details on this) 2) if you don’t lose too much of the dataset after doing so.
  • Build another predictive model to predict the missing values – This could be a whole project in itself, so simple techniques are usually used here.
  • Use a model that can incorporate missing data – Like a random forest, or any tree-based method.


MockInterview Staff answered 5 years ago

Answer from Analytics Vidhya:
You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?
Answer: We can deal with them in the following ways:

  1. Assign a unique category to missing values, who knows the missing values might decipher some trend
  2. We can remove them blatantly.
  3. Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others

Your Answer

2 + 17 =