Clean and Organize Data
Episode #5 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc
Yesterday, we walked through the most regularly used data files and untangled the reasoning behind why there’s more than one. We also revealed common data types and exemplified their importance through real-world experiences.
Today, we’ll change gears and move from data theory to practical data knowledge and techniques. We believe that there are three critical areas, and the first is to clean and organize data. Here’s what we mean.
In a perfect data world, you’d have the luxury to spend the bulk of your energy concentrated on identifying patterns, building beautiful charts, further enhancing a predictive model, and sharing the most exciting of insights. Basically, you’d be doing actual analysis (the fun part of working with data).
However, in the vast majority of projects, data practitioners spend the vast majority of their time simply preparing (cleaning and organizing) data for analysis. In fact, working with data is commonly associated with the Pareto Principle (80:20 Rule), in that you’ll spend 80% of your time preparing data and 20% performing analysis—it’s unevenly distributed.
But it’s for good reason: garbage in = garbage out. An analysis won’t be effective if you’re working with inaccurate data in the first place. You’ll have a hard time getting clarity on typical customer behavior if your customer database is missing 50% of its records.
The point is, data quality is directly correlated to analysis quality. There’s a motto in the data profession that “quality data beats fancy algorithms,” implying that sophistication and fancy math won’t fix low-quality data. Here are a few specific examples of “garbage in” (many times known as “dirty data”) that we’ve personally experienced:
• Global finance data was supposed to be converted to a single currency, but upon further digging, we discovered that many entries were still being stored in their native currency (e.g., Mexican pesos weren’t being converted to USD).
• Customer support data was supposed to be unique (e.g., we should receive one entry for every customer phone call). However, we discovered that duplicate entries were finding their way into our systems.
• Product data was supposed to be up to date in our analysis of an experiment, yet an issue in our database was causing many records to never get created in the first place.
In all scenarios, we would have shared an inaccurate analysis—not because our analysis was wrong, but simply due to the use of “dirty data.” Hopefully, it’s becoming clear as to just how important this is.
Below is a general framework/checklist that you can use in your daily data extravaganzas to verify whether you’re working with clean data:
Valid. Is the data accessible, up to date, and properly stored?
If you need access to time-series data with timestamps and timezones, but your data is being rounded to the day, this would be invalid data.
Accurate. Is the data actually correct?
If you realize that your sales team has been “fudging the numbers” on their pipeline forecast, you’d be working with inaccurate data.
Complete. Is there any data missing and/or duplicated?
If you search for customer invoice data with a specific tag, but invoices have been improperly tagged, you’d be working with incomplete data.
Uniform. Is the data inputted the same way every time?
If you notice that customer phone numbers are stored in various ways (e.g., some with dashes and some without, some with country codes and some without, etc.), you’d be working with inconsistent data.
In summary, cleaning and organizing data is where analytic professionals spend most of their time. This is due to data being so easily prone to error, in combination with the importance of analyzing clean data (i.e., garbage in = garbage out).
Tomorrow, we’ll discuss our second practical area, which is the joining of multiple datasets!
Before you make any changes to a dataset, make a copy first. Then, while you work through the framework above, track all your changes. This way, you can clearly articulate to others exactly what’s changed and show them the “before vs. after.”
—Colby and Serge
How to Lie with Statistics by Darrell Huff
Share with friends