# Logistic Regression

**Episode #7 of the course Business analysis fundamentals by Polina Durneva**

Good morning! Yesterday, we discussed linear regression, a predictive tool commonly used in data mining. Today, we will talk about logistic regression, which is used primarily for classification. Logistic regression resembles decision tree analysis to some extent but is less prone to overfitting (if you remember, overfitting occurs when you account for all noise in your data and it distorts your analysis).

**What Is Logistic Regression?**

To some extent, logistic regression is similar to linear regression. The main difference between these two is that logistic regression uses a categorical variable as its dependent variable (remember, in the linear regression y = k*x + b, y is the outcome, or dependent variable, and x is the factor, or independent variable).

The dependent variable in the logistic regression classifies a new record and determines the probability of this new record to belong to one of the classes.

The most popular type of logistic regression has a binary dependent variable (basically, it means you can assign 1(yes) and 0(no) to each outcome in your dataset). For instance, if you want to predict the probability of surviving on the *Titanic *based on demographic factors of its passengers, you would assign 1 to those who survived and 0 to those who didn’t.

Of course, there can be more than two distinct outcomes in the logistic regression, but each outcome will always belong to a separate category. (This is the main difference between the logistic and linear regressions; in a linear regression, you normally predict continuous outcome.)

**How Is Logistic Regression Estimated?**

There are two main steps in estimating logistic regression:

**Step 1: Estimation of probability of belonging to a certain class.**

If, for instance, we have only two classes, Class 1 and Class 0, the probability of belonging to Class 1 is estimated (p = P(Y=1)), and the probability of belonging to Class 0 is derived from that. The main formula for probability is **Probability(class = 1 | x) = 1/(1 + e ^{-(k*x+b)})**, in which the probability of belonging to Class 1 is estimated using the independent variable x. The function contains the exponential constant e in order to avoid unacceptable probability values (higher than 1 and lower than 0).

**Step 2: Estimation of a cut-off (threshold) value and classification of records.**

Once you decide on the cut-off value, you can classify your records. For instance, if the cut-off value is 0.2, we can classify any values below or equal as the ones belonging to Class 1, and the rest to Class 0.

**How Does Logistic Regression Work?**

To better understand how logistic regression works, let’s proceed to an example. Let’s assume we want to estimate the probability of survival on the *Titanic *using passengers’ income. For the sake of simplicity, we will only use one independent variable, x = person’s income (measured in dollars), against y = person’s survival on the ship (1 = yes, 0 = no).

If we were to use simple linear regression, we would have survival = k*income + b. However, we risk getting values below 0 and above 1, which is unacceptable for estimating the probability (the range should be between 0 and 1 inclusive).

To fix this, we will use a nonlinear function that takes the following form:

**Probability(survival = 1 | income = x) = 1/(1 + e**^{-(k*income+b)}**)**. Then, the odds ratio (belonging to class 1 over class 0) is estimated: **Probability(survival = 1 | income = x) / (1 – Probability(survival = 1 | income = x)**.

After some algebraic manipulation of the above-mentioned equations, we can derive **Odds(Survival = 1) = e**** ^{k*income+b}**. Finally, logit is created by taking natural log from both sides. The final form of logistic regression is

**log(Odds) = k*Income + b**.

Logit takes a form of a sigmoid function and can take any value. Logit is used mainly for easier understanding of estimated values because its form is similar to that of linear regression. So, if we had a real data set on survival on the *Titanic*, we would accept k>0 because it is known that people with higher income had a better chance of survival than those with low income. We could also calculate the odds and probability using the above-mentioned equations.

**Most Popular Applications of Logistic Regression**

Here are the most popular examples of the application of logistic regression:

• **Classification in healthcare.** Logistic regression can be used in healthcare to determine a patient’s diagnosis or predisposition to certain medical conditions.

• **Financial predictions.** Logistic regression can be used to forecast financial conditions on macro- and microeconomic levels.

• **Image categorization.** In certain cases, logistic regression is used to classify images based on a variety of factors.

That’s it. Tomorrow, we’ll talk about optimization.

See you,

Polina

**Recommended book**

**Share with friends**