2 Answers
Five tests to validate data: http://insightextractor.com/2017/03/07/5-tests-validate-quality-data/
How to clean data?
- First: detect anomalies and contradictions
Common issues:
- Tidy data: (Hadley Wickam paper)
column names are values, not names, e.g. <15-25, >26-45…
multiple variables are stored in one column, e.g. m1534 (male of 15-34 years’ old age)
variables are stored in both rows and columns, e.g. tmax, tmin in the same column
multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
*a single observational unit is stored in multiple tables (can be combined) - Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
- Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
- Mandatory constraints: certain columns can’t be empty
- Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
- Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
- Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
- Misspellings
- Missing values
- Outliers
- Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patient’s date or discharge can’t be earlier than the admission date
- Clean the data using:
- Regular expressions: misspellings, regular expression patterns
- KNN-impute and other missing values imputing methods
- Coercing: data-type constraints
- Melting: tidy data issues
- Date/time parsing
- Removing observations
Your Answer