More reading: Handling missing data (O’Reilly)
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.
In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
Source: Springboard
Here are three ways:
- Remove rows with missing values – This works well if 1) the values are missing randomly (see Vinay Prabhu’s answer for more details on this) 2) if you don’t lose too much of the dataset after doing so.
- Build another predictive model to predict the missing values – This could be a whole project in itself, so simple techniques are usually used here.
- Use a model that can incorporate missing data – Like a random forest, or any tree-based method.
Answer from Analytics Vidhya:
You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?
Answer: We can deal with them in the following ways:
- Assign a unique category to missing values, who knows the missing values might decipher some trend
- We can remove them blatantly.
- Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others