What is Data Preparation and Why Does it Matter?

by Pranav Ramesh
April 30, 2021
What is Data Preparation and How it Works

A 2012 article in the Harvard Business Review suggested that ‘data scientist’ was the “sexiest job of the 21st century”, but eight years down the line, data scientists themselves aren’t feeling too happy with their job. Their main gripe—data preparation.

Research suggests that data scientists spend anywhere between 50% to 80% of their time preparing data for analysis, which is not something they are happy about. Despite it being a large part of the job, scientists find data preparation to be tedious, “janitorial” work which, according to VP of Jawbone Monica Rogati, “at times, feels like everything we do”.

In the following article, we will take a look at the data preparation process, why it matters, who needs it, and where it’s headed.

Topics covered:

  • What is data preparation?
  • Why does data preparation matter?
  • How does data preparation work?
  • Where is data preparation used?
  • What is the future of data preparation?

What is Data Preparation?

Data preparation is the process of cleaning and transforming large amounts of raw data for analysis. It is the first step in the data science process, leading to data exploration and analytics. Data preparation takes a lot of time and effort to accomplish, but data scientists need to prepare their data before they can extract any useful insights from it. Data from different sources with differing levels of quality can be merged through data preparation, to create a clean, uniform format. Most artificial intelligence and machine learning (AI/ML) systems depend on big data to function and for big data to be useful it needs to be prepared first.

For example, the e-commerce site Groupon depends heavily on data preparation services to connect its subscribers with activities, travel opportunities, retailers, and other services. Groupon collects up to one terabyte of data per day, which is cleaned, transformed, and analyzed using a big data management platform, and shared with the sales and marketing team for insight.

Why does data preparation matter?

Achieving a clean and consistent format is crucial when data needs to be mined for insights. Without preparation, machine learning programs can miss important patterns within the data, and not include them during analysis. Data prep (KW) can help:

  • Fix errors: Data prepping helps catch errors at the source. Once the data has been moved to the analysis stage, it would be difficult to catch.
  • Improve quality: Data cleansing and transformation will ensure that all data is accurate and of high quality.
  • Increase efficiency: High-quality data can be processed and analyzed more quickly, leading to more efficient decision-making.

Looking for more AI/ML insights? Check out:

Supervised vs. Unsupervised Learning in Machine Learning

Artificial Intelligence vs. Machine Learning

Artificial Intelligence in Cyber Security

How does data preparation work?

Data type can change depending on the business it is being prepared for, but the basic framework of data preparation remains the same. The following five steps are common to all forms of data preparation:

  1. Accumulation: The first step