Can CRISP-DM and Scrum Coexist in Agile Data Science Projects?

December 11, 2024

Over the past two years, artificial intelligence has leaped from the confines of research and development labs—where data science experts crafted powerful yet often unheralded solutions—to the forefront of every product conversation. To truly excel in building intelligent products, we must critically evaluate our methodologies to get there.

The debate already rages between AI-minded data scientists and Scrum purists in traditional product development environments. As agile practices permeate AI projects, many wonder if Scrum’s structured sprints can harmonize with data science’s exploratory nature. Enter CRISP-DM, a well-established approach to data mining that is seeing renewed interest as a potential bridge. Could a blend of CRISP-DM and Scrum be the answer?

1. Business Comprehension

This initial phase emphasizes a clear grasp of the project’s goals and requirements from a business standpoint. Before diving into data sets and algorithms, data scientists need to understand the larger objectives they are aiming to achieve. This phase sets the stage by aligning data science efforts with organizational goals, ensuring that projects are not only technically sound but also valuable from a business perspective. Without this alignment, data science projects may end up solving the wrong problem or delivering solutions that lack business relevance.

During this phase, the team collaborates with business stakeholders to gather requirements, define the scope, and establish clear objectives. Understanding the project’s strategic importance helps in setting realistic expectations and lays a solid groundwork for the subsequent phases. This foundational step ensures that everyone is on the same page and that the project’s goals are clearly articulated and understood.

2. Data Comprehension

Here, the focus shifts to gathering data and familiarizing the team with its intricacies. This involves exploratory data analysis (EDA) to uncover initial insights, assess data quality, and identify underlying patterns or anomalies. Data comprehension is critical because the quality and relevance of the data directly impact the success of the entire data science project. Without a thorough understanding of the data, the models built may be based on flawed assumptions or incomplete information.

Working closely with data sources, the team collects and examines various data sets. EDA techniques such as visualization and summary statistics help to reveal patterns, correlations, and potential issues within the data. This phase enables teams to grasp the nuances of the data, identify any inconsistencies or gaps, and understand the data’s structure and behavior. By gaining a deep understanding of the data, teams can make informed decisions in the following phases.

3. Data Preparation

Likely the most time-consuming step, data preparation involves cleaning and transforming raw data into a suitable format for modeling. This phase addresses issues like missing values, outliers, and data normalization, which are critical for the success of subsequent modeling efforts. Proper data preparation ensures that the data is accurate, consistent, and ready for analysis, ultimately leading to more reliable and robust models.

During this phase, data scientists perform various tasks such as data cleaning, feature engineering, and data transformation. Cleaning involves removing or imputing missing values, handling outliers, and correcting errors. Feature engineering involves creating new variables or features from the existing data to enhance the model’s predictive power. Data transformation involves normalizing or scaling variables to ensure they are on the same scale. This meticulous preparation process is essential to ensure that the data is in the best possible state for modeling.

4. Modeling

The modeling phase involves selecting and applying various modeling techniques with the prepared data. This experimental phase may include trying multiple algorithms, tuning parameters, and iteratively refining models to improve performance. The goal is to build a predictive model that accurately captures the patterns and relationships within the data and can generalize well to new data.

Data scientists experiment with different algorithms such as decision trees, neural networks, or support vector machines, assessing their performance using metrics like accuracy, precision, and recall. Hyperparameter tuning is performed to optimize the model’s parameters and enhance its performance. This phase often involves multiple iterations as the team tests and refines the models to achieve the best possible results. These efforts are essential to ensure that the model is robust, reliable, and ready for deployment.

5. Assessment

Before deployment, rigorous evaluation is necessary to confirm that the models meet the business objectives established in the first phase. During the assessment phase, we validate model performance, assess whether all critical business issues have been sufficiently addressed, and determine the next steps. This phase ensures that the model is not only technically accurate but also aligned with the project’s business goals and requirements.

Various evaluation metrics and techniques are used to assess the model’s performance, including cross-validation, hold-out validation, and A/B testing. The model’s predictions are compared against actual outcomes to measure its effectiveness. This phase also involves assessing the model’s interpretability and ensuring that it can provide actionable insights. If the model meets the established criteria, we proceed to deployment; otherwise, further refinements are made.

6. Implementation

Over the last couple of years, artificial intelligence has transitioned from the secluded realms of research and development labs—where data science experts crafted powerful but often uncelebrated solutions—to the forefront of product discussions. To truly excel in creating intelligent products, we must rigorously evaluate our methodologies to reach that point.

A lively debate is ongoing between AI-focused data scientists and Scrum purists in traditional product development settings. As agile practices integrate into AI projects, many question if Scrum’s organized sprints can align with the exploratory nature of data science. Enter CRISP-DM, a proven approach to data mining that is gaining renewed interest as a potential bridge. This method involves phases like business understanding, data preparation, modeling, and evaluation, which are crucial for AI projects. Could a hybrid of CRISP-DM and Scrum be the solution? By blending CRISP-DM’s structured phases with Scrum’s agile framework, we might find a seamless way to reconcile these two essential processes.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later