What is the life cycle of a data science project ?
- Data acquisition
Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes - Data preparation
Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction. - Hypothesis & modelling
Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting - Evaluation & interpretation
Steps 2 to 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include step5 and be performed in a pre-production.
- Deployment
- Operations
Regular maintenance and operations. Includes performance tests to measure model performance, and can alert when performance goes beyond a certain acceptable threshold - Optimization
Can be triggered by failing performance, or due to the need to add new data sources and retraining the model or even to deploy new versions of an improved model
Note: with increasing maturity and well-defined project goals, pre-defined performance can help evaluate feasibility of the data science project early enough in the data-science life cycle. This early comparison helps the team refine hypothesis, discard the project if non-viable, change approaches.
Source
Source: IT Shared
All Machine Learning, Data Mining and Data Science projects should follow some process, so there can be questions about it:
- Can you outline the steps in a data science project?
- Have you heard of CRISP-DM (Cross Industry Standard Process for Data Mining)?
CRISP-DM defines the following steps:
- Problem Definition
- Data Understanding (or Data Exploration)
- Data Preparation
- Modeling
- Evaluation
- Deployment (for the production)
So next you may discuss each of these steps in details
- What is the goal of each step?
- What are possible activities at each step?