Doing Exploratory Data Analysis
Episode #4 of the course An introduction to data science by Roger Peng
In this lesson we will run through an informal “checklist” of things to do when embarking on an exploratory data analysis. The elements of the checklist are:
1. Formulate your question. Formulating a question can be a useful way to guide the exploratory data analysis process and limit the exponential number of paths that can be taken with any sizeable dataset. In particular, a sharp question or hypothesis can serve as a dimension reduction tool that can eliminate variables that are not immediately relevant to the question.
2. Read in your data. Sometimes the data will come in a very messy format and you’ll need to do some cleaning. Other times, someone else will have cleaned up that data for you.
3. Check the packaging. Assuming you don’t get any warnings or errors when reading in the dataset, it’s usually a good idea to poke the data a little bit before you break open the wrapping paper. For example, you should check the number of rows and columns. Often, with just a few simple maneuvers that perhaps don’t qualify as real data analysis, you can nevertheless identify potential problems with the data before plunging head first into a complicated data analysis.
4. Look at the top and bottom of your data. It’s often useful to look at the “beginning” and “end” of a dataset right after you check the packaging. This lets you know if the data were read in properly, things are properly formatted, and that everything is there.
5. Check your “n”s. In general, counting things is usually a good way to figure out if anything is wrong or not. In the simplest case, if you’re expecting there to be 1,000 observations and it turns out there are only 20, you know something must have gone wrong somewhere. For example, if you are collecting data on people, such as in a survey or clinical trial, then you should know how many people there are in your study.
6. Validate with at least one external data source. Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be, and it serves as a check on what other things might be wrong in your dataset. External validation can often be as simple as checking your data against a single number.
7. Make a plot. Making a plot to visualize your data is a good way to further your understanding of your question and data. Plots can help you create expectations and check deviations from expectations. At the early stages of analysis, you may be equipped with a question/hypothesis, but you may have little sense of what is going on in the data. Making some sort of plot, which serves as a summary, will be a useful tool for setting expectations for what the data should look like. Making a plot can also be a useful tool to see how well the data match your expectations. Plots are particularly good at letting you see deviations from what you might expect. Tables typically are good at summarizing data by presenting things like means, medians, or other statistics. Plots, however, can show you those things, as well as show you things that are far from the mean or median, so you can check to see if something is supposed to be that far away. Often, what is obvious in a plot can be hidden away in a table.
8. Try the easy solution first. What’s the simplest answer you could provide to answer your question? For the moment, don’t worry about whether the answer is 100% correct; the point is how could you provide prima facie evidence for your hypothesis or question? You may refute that evidence later, but if you do not find evidence of a signal in the data using just a simple plot or analysis, then often it is unlikely that you will find something using a more sophisticated analysis.
In this lesson, we’ve presented some simple steps to take when starting off on an exploratory analysis. Exploratory analysis will help get you thinking about the data and the question of interest, while also giving you a number of things to follow up on if you continue to be interested in this question.
Tomorrow, you’ll learn about using data to make an inference about a larger population.
“Big Data: A Revolution That Will Transform How We Live, Work, and Think” by Viktor Mayer-Schönberger, Kenneth Cukier
Share with friends