Joining Data Together
Episode #6 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc
Welcome back!
Yesterday, we shared the first critical area for practically working with and understanding data. Analytic professionals spend most of their time preparing data for analysis, because as the saying goes, “garbage in, garbage out.” Making sure data is valid, accurate, complete, and uniform is one of the most important things you can do with data.
Today, we’ll uncover the second critical area for practically working with and understanding data: joining data together. The proper way to address this topic is relational data modeling. Let’s work through an analogy to help bring this concept home: Have you ever heard the “Dem Bones” melody?
If not, it goes something like this: “Toe bone connected to the foot bone. Foot bone connected to the heel bone. Heel bone connected to the ankle bone. Ankle bone connected to the shin bone …”
The point is, each bone is connected (i.e., relational) to another bone, thus enabling the entire skeletal system to properly connect and work together. There’s an established way to build a data model in this same way, and it’s known as a relational data model.
In a relational data model, each dataset is built so it can also connect to another or multiple other datasets. Ultimately, they can all join together! Before we dive into examples, let’s first emphasize the reasoning behind why datasets get separated in the first place.
Imagine yourself building a web application where users can:
• Register on your website.
• Purchase a product on your website.
• Fill out and submit a support form on your website for help.
Now, imagine you’re forced to store all the data that’s produced from these actions into a spreadsheet. How would you store all the data? Does it make sense to try and store everything in one spreadsheet tab?
If so, maybe it would look something like this for a single user:
User ID | Register Date | Purchase Date | Support Date |
C072X | 2019-01-01 | 2019-01-15 | 2019-01-16 |
Seems simple, right? But what if our user makes another purchase? Do you add another row and just leave the other cells blank, like this?
User ID | Register Date | Purchase Date | Support Date |
C072X | 2019-01-01 | 2019-01-15 | 2019-01-16 |
C072X | 2019-01-20 |
Or, is it better to just override our original purchase date with the most recent?
What if you decide to start selling more than one product, allow users to do more actions on your website, and want to see more data than just the date for help requests. Do you just keep adding new columns?
Hopefully, you’re starting to see the lack of flexibility inherently created with a single “master” table. Besides the headaches you’d get from trying to clean the data, any errors can also find their way in (and trust us, they always do).
It’s best to separate the data and architect a link between them. Each dataset lives in its own domain and can thus be analyzed individually, though the architected link provides the power to analyze holistically when necessary. For instance, we could have three tabs in our spreadsheet, and each tab/dataset can join together via the User ID (the architected link):
Registrations
|
Payments
|
Support Inquiries
|
This way, you can simply append more data and/or expand the number of fields (i.e., columns) without having to worry about leaving cells empty or deleting historical data. You also have the ability to create a “master” table if you so choose by joining the data together via the link (i.e., User ID)—this is a Relational Data model.
To close out, have you ever used the VLOOKUP formula in a spreadsheet? If so, congratulations, you’ve joined data together and leveraged the concept of a relational data model! In SQL, joining data together is performed via a JOIN statement. And BI software leverages SQL and is thus doing the same thing behind the scenes (i.e., BI software writes SQL to a database). Tomorrow, we’ll address our final area for practically working with data: sort, filter, and segment!
—Colby and Serge
Recommended book
Lean Analytics: Use Data to Build a Better Startup Faster by Alistair Croll, Benjamin Yoskovitz
Share with friends