In the Beginning There Was the Question

06.02.2017 | 0

Episode #1 of the course An introduction to data science by Roger Peng

Welcome to the course! In this lesson, you’ll learn about the six types of data analysis questions.

Many of the “fatal” pitfalls of a data analysis can be avoided by expending the mental energy to get your question right. Understanding the type of question you are asking may be the most fundamental step you can take to ensure that, in the end, your interpretation of the results is correct. The six types of questions are:

1. Descriptive
2. Exploratory
3. Inferential
4. Predictive
5. Causal
6. Mechanistic

A descriptive question is one that seeks to summarize a characteristic of a set of data. Examples include determining the proportion of males, the mean number of servings of fresh fruits and vegetables per day, or the frequency of viral illnesses in a set of data collected from a group of individuals. There is no interpretation of the result itself, as the result is a fact, an attribute of the set of data that you are working with.

An exploratory question is one in which you analyze the data to see if there are patterns, trends, or relationships between variables. If you had a general thought that diet was linked somehow to viral illnesses, you might explore this idea by examining relationships between a range of dietary factors and viral illnesses. You find in your exploratory analysis that individuals who ate a diet high in certain foods had fewer viral illnesses than those whose diet was not enriched with these foods, so you propose the hypothesis that among adults, eating at least five servings a day of fresh fruit and vegetables is associated with fewer viral illnesses per year.

An inferential question would be a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data, which in this example is a representative sample of adults in the US. By analyzing this different set of data, you are both determining if the association you observed in your exploratory analysis holds in a different sample and whether it holds in a sample that is representative of the adult US population, which would suggest that the association is applicable to all adults in the US.

A predictive question would be one where you ask what types of people will eat a diet high in fresh fruits and vegetables during the next year. In this type of question, you are less interested in what causes someone to eat a certain diet, just what predicts whether someone will eat this certain diet. For example, higher income may be one of the final set of predictors, and you may not know (or even care) why people with higher incomes are more likely to eat a diet high in fresh fruits and vegetables. What is most important is that income is a factor that predicts this behavior. Although an inferential question might tell us that people who eat a certain type of food tend to have fewer viral illnesses, the answer to this question does not tell us if eating these foods causes a reduction in the number of viral illnesses, which would be the case for a causal question.

A causal question asks about whether changing one factor will change another factor, on average, in a population. Sometimes the underlying design of the data collection, by default, allows for the question that you ask to be causal. An example of this would be data collected in the context of a randomized trial in which people were randomly assigned to eat a diet high in fresh fruits and vegetables or one that was low in fresh fruits and vegetables. In other instances, even if your data are not from a randomized trial, you can take an analytic approach designed to answer a causal question.

Finally, none of the questions described so far will lead to an answer that will tell us, if the diet does indeed cause a reduction in the number of viral illnesses, how the diet leads to a reduction in the number of viral illnesses. A question that asks how a diet high in fresh fruits and vegetables leads to a reduction in the number of viral illnesses would be a mechanistic question.

Tomorrow, you’ll learn about the data analysis epicycle, which will help guide the process of your analyses.

Recommended book

“The Art of Data Science” by Roger Peng, Elizabeth Matsui

Share with friends