In any of these projects, have you had to deal with dirty data? What kinds of tests do you run to validate the quality of your data? What is the most common problem you find?

Data Science Interview QuestionsCategory: Data ScienceIn any of these projects, have you had to deal with dirty data? What kinds of tests do you run to validate the quality of your data? What is the most common problem you find?
MockInterview Staff asked 6 years ago
2 Answers
MockInterview Staff answered 6 years ago

How to clean data?

  1. First: detect anomalies and contradictions
    Common issues:
  • Tidy data: (Hadley Wickam paper)
    column names are values, not names, e.g. <15-25, >26-45…
    multiple variables are stored in one column, e.g. m1534 (male of 15-34 years’ old age)
    variables are stored in both rows and columns, e.g. tmax, tmin in the same column
    multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
    *a single observational unit is stored in multiple tables (can be combined)
  • Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
  • Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
  • Mandatory constraints: certain columns can’t be empty
  • Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
  • Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
  • Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
  • Misspellings
  • Missing values
  • Outliers
  • Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patient’s date or discharge can’t be earlier than the admission date
  1. Clean the data using:
  • Regular expressions: misspellings, regular expression patterns
  • KNN-impute and other missing values imputing methods
  • Coercing: data-type constraints
  • Melting: tidy data issues
  • Date/time parsing
  • Removing observations

Source

Your Answer

5 + 18 =