# Genie in a Model

**Episode #6 of the course An introduction to data science by Roger Peng**

Today, you’ll learn about using models to find associations and make predictions.

This lesson is typically the part of the statistics textbook or course where people tend to hit a wall. In particular, there’s often a lot of math. Math is good, but gratuitous math is not good. We are not in favor of that. It’s important to realize that often it is useful to represent a model using mathematical notation because it is a compact notation and can be easy to interpret once you get used to it. Also, writing down a statistical model using mathematical notation, as opposed to just natural language, forces you to be precise in your description of the model and in your statement of what you are trying to accomplish, such as estimating a parameter.

**Associational Analyses**

Associational analyses are ones where we are looking at an association between two or more features in the presence of other potentially confounding factors. There are three classes of variables that are important to think about in an associational analysis.

1. **Outcome**. The outcome is the feature of your dataset that is thought to change along with your **key predictor**. Even if you are not asking a causal or mechanistic question, so you don’t necessarily believe that the outcome *responds* to changes in the key predictor, an outcome still needs to be defined for most formal modeling approaches.

2. **Key predictor**. Often for associational analyses there is one key predictor of interest (there may be a few of them). We want to know how the outcome changes with this key predictor. However, our understanding of that relationship may be challenged by the presence of **potential confounders**.

3. **Potential confounders**. This is a large class of predictors that are both related to the key predictor and the outcome. It’s important to have a good understanding of what these are and whether they are available in your dataset. If a key confounder is not available in the dataset, sometimes there will be a proxy that is related to that key confounder that can be substituted instead.

Once you have identified these three classes of variables in your dataset, you can start to think about formal modeling in an associational setting.

**Prediction Analyses**

In the previous section we described associational analyses, where the goal is to see if a key predictor *x* and an outcome *y* are associated. But sometimes the goal is to use all of the information available to you to predict *y*. Furthermore, it doesn’t matter if the variables would be considered unrelated in a causal way to the outcome you want to predict because the objective is prediction, not developing an understanding about the relationships between features.

With prediction models, we have outcome variables—features about which we would like to make predictions—but we typically do not make a distinction between “key predictors” and other predictors. In most cases, any predictor that might be of use in predicting the outcome would be considered in an analysis and might, *a priori*, be given equal weight in terms of its importance in predicting the outcome. Prediction analyses will often leave it to the prediction algorithm to determine the importance of each predictor and the functional form of the model.

For many prediction analyses, it is not possible to literally write down the model that is being used to predict because it cannot be represented using standard mathematical notation. Many modern prediction routines are structured as algorithms or procedures that take inputs and transform them into outputs. The path that the inputs take to be transformed into outputs may be highly nonlinear, and predictors may interact with other predictors on the way. Typically, there are no parameters of interest that we try to estimate; in fact, many algorithmic procedures do not have any estimable parameters at all.

The key thing to remember with prediction analyses is that we usually do not care about the specific details of the model. In most cases, as long as the method “works,” is reproducible, and produces good predictions with minimal error, then we have achieved our goals.

With prediction analyses, the precise type of analysis you do depends on the nature of the outcome (as it does with all analyses). Prediction problems typically come in the form of a **classification problem** where the outcome is binary. In some cases the outcome can take more than two levels, but the binary case is by far the most common.

In the next lesson, you’ll learn about the differences between making an inference and a prediction.

**Recommended book**

**Share with friends**