Most Popular Data Science Methodology

The Data Science life-cycle

Sangeet Aggarwal
7 min readJun 13, 2022
Photo by Patrick McManaman on Unsplash

As a Data Scientist, you would need to understand many business problems and solve them in a systematic way. Right from understanding the business problem to solving it, you would go through various stages. This systematic process, from understanding the problem to finally providing a tangible solution is called Data Science Methodology.

Data Science Methodology is generally an iterative process which undergoes a lot of reviews and changes. We’ll understand every stage of one of the most popular Data Science methodologies in detail. This is But before that let’s have a look at all the stages at once.

Image by geeksforgeeks

Business Understanding

What is the problem you are trying to solve?

Business understanding is the first step in Data Science Methodology. It’s important to know the problem properly before trying to solve it. In some places its so crucial that it takes days to just understand the business goal or the problem. It’s the foundational base of your project and thus should be clearly defined.

Once the business goal is defined, it’s easy to define the various objectives of the whole project. You can then prioritize these objectives depending on their importance. This also allows you to ask more questions from the stakeholders involved, and in return get more clarity on these broken-down objectives.

Analytic Understanding

How can data be used to solve the problem or answer the question?

Once you have defined your business goal, it’s now time to understand the analytic approach that’s needed to solve the given problem. Analytic understanding is the second step of any Data Science project methodology.

Based on the nature of the business goal, there can be 4 analytic approaches you can take:

  1. Descriptive Approach — When the requirement is to just inform about the current status or the status quo, descriptive analysis is done. This approach is to describe what’s happening and give underlying insights into the current status.
  2. Diagnostic Approach — Also referred to as Statistical Analysis, the Diagnostic Approach answers the questions like “what happened?” and “why is this happening?”. You can think of it with reference to the medical term diagnosis which means determining the underlying disease and identifying its cause.
  3. Predictive Approach — This approach is also referred to as forecasting as it answers the questions like “What if this trend continues?” and “What will happen in the near future?”. The predictive approach as the name suggests is used to predict the future outcome with a certain confidence level if the current trend continues to take place.
  4. Prescriptive Approach — This approach is used to prescribe a solution for the given problem. The prescriptive approach answers the question “How to solve it?”. This is important when the business is interested in having a concrete solution backed by data in order to solve the problem at hand.

Data Requirements

If you want great results at the end of your project, you need to have great beginnings. And defining data requirements is one of the most important beginning stages of all data science methodologies.

Based on the analytic approach you have decided on, you would gather the data requirements. This mainly includes the sources, formats and content of the data that’s needed. Below are some of the questions you need to keep in mind while gathering data requirements.

  1. Which all data is required? Or what kind of data can be helpful?
  2. How to source this data? Is it accessible to the business or do we need to acquire it externally?
  3. How to understand the data? Or how to work with the data?
  4. How to prepare the data to get the desired result?

Data Collection

In this phase of the Data Science methodology, the data requirements are revised to see if we have the needed data or if more data is needed for our final outcome. Here Data scientists can have a better picture of what they would work on.

Since the required data is being collected, the data scientists can prepare basic reports and visualizations around the data. They can also draw the current status or the descriptive aspect of the data using this phase.

Data Understanding

This phase of data science methodology is essential to determine whether the collected data is a true representative of the business problem at hand or not. Various checks are performed to understand the data and to ensure its quality.

Sometimes you may have to go back to the step of Data Collection from Data Understanding based on the inference from this step. If the understanding of the data is not well-formed in order to solve the problem, then it’s imperative that we check the data again.

Data Preparation

Data Preparation is the most important phase of all the data science methodology stages. It’s also the most time-consuming one. This phase involves cleaning the data, exploring it and making it suitable for further processing.

Some of the data preparation steps are:

  1. Handling missing values by methods like imputation
  2. Removing duplicates
  3. Detecting and handling outliers
  4. Feature engineering

Data Preparation is very crucial to the whole data science methodology. If the data is not prepared well, we might end up having a bad output. The kind of data you feed into your modeling stages will determine the kind of results you will get.

Modeling

Modeling is the phase in Data Science methodology where you use Statistics and Machine Learning to build and develop models that either make some predictions or simply draw insights. Modeling is basically using a bunch of algorithms, tweaking them and getting them to answer the business problem we have been trying to solve.

The way modeling generally begins is by training your statistical model or the machine learning model using a part of existing data, also called training data. The model is then fine-tuned using some hyperparameters. This is done based on its performance on the test data which also comes from existing data but was kept away for testing purposes.

There can be two types of modeling based on the nature of the business problem that’s being solved. One is called descriptive modeling, and the other one is called predictive modeling.

Descriptive modeling is done to tackle scenarios like building recommendation systems or defining clusters out of data points. Predictive modeling is done when you predict a future value based on the current data.

Modeling is the phase where you start getting the answers to your problems, and that makes it very important. Therefore it can so happen that you have to keep checking again with the Data preparation phase to re-calibrate your model.

Evaluation

Once your model has been built, its time to evaluate its performance. This is the evaluation phase of the Data Science methodology. Evaluation not only helps you determine the quality of the model but also helps you answer the question “Can we answer our objective question using the current model or do we need to alter it?”.

Model evaluation generally has two main phases:

  1. Diagnostic Measure Phase — This is used to measure whether the model is doing what it is intended to do. If the model is a predictive model, you can use a decision tree to evaluate the model’s output. If it does not align with the expected results, that means there’s a scope for improvement. On the other hand, if the model is a descriptive one, then simply running the model with test data having known outcomes is sufficient. You can measure how different your model’s outcome is from your actual response.
  2. Statistical Significance Testing — This test is designed to check that model is not just second-guessing the result or an answer, and is interpreting the data as intended.

Finally, after the above checks, you also check the interpretability or the explainability of the model. Anyone would prefer a model that could explain its outcome versus a black box whose outcome is not explained. And thus you should be able to tell how the model made a particular decision, what error or performance metrics it used to produce the result, and which features are considered the most important features.

Deployment

Once the model is built and evaluated, and the data science team is confident enough to release it, its time to deploy the model for the real deal. The deployment phase is where you keep your fingers crossed and hope for your model to perform well out there in production.

Deployment is generally done with a small set of people based on whose feedback it would be decided whether to release the model for public use or not.

Feedback

After the model is deployed, you gather feedback from its users. These feedbacks allow you to check for gaps and scope for improvement in your model. You can then revamp your model to fix those gaps and again deploy it. This becomes a cyclic and ever-evolving process where you keep adjusting your model based on the feedback you get.

So with this methodology, you get in-depth perception of your data science journey. This is an agile methodology for data science projects, so with each step in this methodology, you get a deeper understanding and awareness of the model, which lets you achieve great results for businesses.

--

--

Sangeet Aggarwal

Data Enthusiast | I try to simplify Data Science and other concepts through my blogs