# The Data Analysis Epicycle

06.02.2017

Episode #2 of the course An introduction to data science by Roger Peng

In this lesson, you’ll learn about the data analysis epicycle, which will help guide the process of your analyses.

To the uninitiated, a data analysis may appear to follow a linear, one-step-after-the-other process that at the end arrives at a nicely packaged and coherent result. In reality, data analysis is a highly iterative and non-linear process better reflected by a series of epicycles (see figure) in which information is learned at each step, which then informs whether (and how) to refine and redo the step that was just performed, or whether (and how) to proceed to the next step.

An epicycle is a small circle whose center moves around the circumference of a larger circle. In data analysis, the iterative process that is applied to all steps of the data analysis can be conceived of as an epicycle that is repeated for each step along the circumference of the entire data analysis process.

There are five core activities of data analysis:

1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results

These five activities can occur at different time scales; for example, you might go through all five in the course of a day but also deal with each (for a large project) over the course of many months. Although there are many different types of activities that you might engage in while doing data analysis, every aspect of the entire process can be approached through an iterative process that we call the epicycle of data analysis. More specifically, for each of the five core activities, it is critical that you engage in the following steps:

1. Setting expectations,

2. Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,

3. Revising your expectations or fixing the data so your data and your expectations match. As you go through every stage of an analysis, you will need to go through the epicycle to continuously refine your question, your exploratory data analysis, your formal models, your interpretation, and your communication.

The repeated cycling through each of these five core activities that is done to complete a data analysis forms the larger circle of data analysis.

Developing expectations is the process of deliberately thinking about what you expect before you do anything, such as inspect your data, perform a procedure, or enter a command. For experienced data analysts, in some circumstances, developing expectations may be an automatic, almost subconscious process, but it’s an important activity to cultivate and be deliberate about.

Collecting information entails collecting information about your question or data. For your question, you collect information by performing a literature search or asking experts in order to ensure that your question is a good one. For your data, after you have some expectations about what the result will be when you inspect your data or perform the analysis procedure, you then perform the operation. The results of that operation are the data you need to collect, and then you determine if the data you collected matches your expectations.

When you have data in hand, the next step is to compare your expectations to the data. There are two possible outcomes: either your expectations match the data or they do not. If your expectations and the data match, terrific, you can move on to the next activity. If your expectations and the data do not match, there are two possible explanations for the discordance—first, your expectations were wrong and need to be revised, or second, the data were wrong and contain an error.

One key indicator of how well your data analysis is going is how easy or difficult it is to match the data you collected to your original expectations. You want to set up your expectations and your data so that matching the two up is easy.

Tomorrow, you’ll learn about applying the data analysis epicycle when exploring your data.

Recommended book

“The Signal and the Noise: Why So Many Predictions Fail” by Nate Silver

Share with friends