Thursday, September 29, 2022

Crisp-DM: the 6 steps of the methodology of the future

In an increasingly volatile world, where a small event can generate disastrous consequences for companies across the planet, it is increasingly necessary to use strategies based on data. So, the Crisp-DM methodology can be an essential tool in the current business world, allowing to predict problems and create solutions from existing data.

The name Crisp-DM is an acronym for Cross Industry Standard Process for Data Mining. The objective of this methodology is to develop models from the analysis of information and data from a business to predict future failures and solutions.

To better understand how this methodology works, we talked to Helder Prado, professor of the MBA USP/Esalq in Data Science and Analytics. He explains all the steps of Crisp-DM and how we can use the methodology to our benefit. Keep reading!

The data

Crisp-DM can be used in any type of business.

According to the professor, Crisp-DM is divided into six fundamental steps, each with its particularities and features. “The first three steps aim to collect and organize the data to be analyzed. They are business understanding, data understanding and data preparation”, he details.

Business understanding: The first step is possibly the most important of the entire process. If it is not done the right way, the rest of the project can be invalidated in the future. In this step, the objective of the project and the needs of the company or project being analyzed is defined. Therefore, it is necessary for everyone to be well informed and completely aligned.

Data understanding: After the first step, we can start to think about the data that will be used in the process. To do so, we can ask several questions, such as: “Does the company have a database? Will the data be accessed, and in which way? How many data sources will be used? What will be the data formats? Is the data structured?”. From them, data collection is done, taking care so that no important information is left out.

Data preparation: With the data already collected, it is necessary to organize them so that we can see what they tell. This step can also be guided by some questions: “How should null values be treated? Are the attributes in the correct formats? Is it necessary to merge with other data? What variables will be used in modeling?”. This is usually the most time-consuming and toilsome of all, but a good work here means less future rework.

Creating a model

According to Prado, the last three steps aim to create the model, based on the previous steps, and the placement of this model in practice. This is where all the previous work will be tested and, if necessary, redone. They are modeling, evaluation and implementation (or deployment).

Modeling: In this step the model starts to take shape and we can see the first results. The type of modeling to be used is usually defined according to the need of the business and the type of variable to be analyzed. With the definition of which model will be used, the attributes that will be variables in the construction of this model should be defined. “Here it can be very useful to return to the first step to check out objectives and find new possibilities”, Prado advises.

Evaluation: With the model already in hands, we can evaluate if the result corresponds to the expectations of the project. If the answer is negative or the team considers that there is room for improvements, all the forces should be directed to make the necessary changes. These changes can have several forms, such as the removal of statistically insignificant attributes, correction in the data input, correction in the treatment of attributes etc.

Implementation (deployment): If the process has been done correctly, this will be the last step. Here, the model should be put in production, so as to add value to the business. The way this is done varies a lot, depending on the type of model and project. This model should be exposed and accessible, usually stored in the cloud or in local servers of the company itself.

Crisp-DM in practice

Crisp-DM is a versatile methodology which combines many atributes in only one project.

To show how this methodology works in practice, the professor uses the example of a specific machine that is crucial for an operation and therefore should always be in operation. Besides, the machine has an internal sensor that tabulates several parameters of itself every hour.

“This way, in the step of business understanding, we define that the objective of the model to be created using the Crisp-DM will be to predict when the machine is close to defect, using the parameters of the machine itself. In the second step, we will consider which data tabulated by the machine will be used in the construction of the model, since not all of them are useful to us at the moment”, he exemplifies.

With the definition of which data we want, we can then go to the step of data preparation, when they will be organized. This implies change of format, junction of tables, among other measures. With all the data ready, we can start modeling when choosing the algorithm that best represents the needs of the project, in this case, whether the machine will work or not.

If the created model does not have a good enough performance, it will be necessary, then go back to the previous steps and see what can be changed. “One thing that can be done is to identify other explanatory variables that can help to estimate the model better; another is to use an algorithm to remove attributes that are not statistically significant, among several other strategies”, explains Prado.

After all the necessary reviews are made and the acquisition of a functional model, it is time to put the model for production. “In this specific case, an Application Programming Interface (API) can be done that receives a request whenever the sensor captures a new line of information and the model identifies if the machine is close or not to fail and change the parameters of the machine to reduce the chance of failure”, the professor says.

The advantages

Prado highlights that the great benefit of the Crisp-DM methodology in relation to other data mining tools is the integration of the creation of a model with business understanding. This is why it would be so used in companies, besides being used in any type of business.

Even so, he makes it clear that it is not perfect: “Throughout the life of this operation, this model will probably need to be estimated again with new sensor data and so the cycle is renewed”, he concludes.

Did you like learning about Crisp-DM? Do you want to understand how to put this methodology and several others in practice? Then, register for the MBA USP/Esalq in Data Science and Analytics! Check it out!


Caio Roberto
Caio Roberto
Jornalista e amante de história, línguas estrangeiras, cinema, literatura e videogames. Utilizo minha curiosidade natural e minha facilidade de me comunicar para descobrir mais sobre o mundo e tentar passar isso adiante. Acredito que nasci para contar histórias, independente da história, da mídia em que ela será contada e do meu papel nela.