Logistic Regression
Binary Logistic Regression
What Is Logistic Regression?
Binary logistic regression predicts the probability of a dichotomous outcome (e.g., pass/fail, admitted/rejected, disease/no disease) from one or more predictor variables. Unlike simple linear regression, which predicts a continuous value, logistic regression predicts the log-odds (logit) of the outcome occurring.
The core idea is that we cannot model a binary outcome with a straight line — predicted probabilities must stay between 0 and 1. Logistic regression achieves this by applying the logistic (sigmoid) function:
The model works on the logit scale:
Where is the odds of the outcome occurring, and the left-hand side is the natural log of the odds (the log-odds or logit).
When to Use It
Use binary logistic regression when:
- Your dependent variable is dichotomous (exactly two categories, coded 0 and 1).
- You want to know which predictors increase or decrease the probability of the outcome.
- You want to quantify the effect of each predictor as an odds ratio.
- Your predictors may be continuous, categorical, or a mix of both.
Examples:
- Predicting whether a student passes or fails based on study hours and attendance
- Predicting whether a patient develops a disease based on risk factors
- Predicting whether a customer will purchase a product based on demographics
If your outcome has three or more categories, use multinomial logistic regression. If your outcome is continuous, use linear regression.
Assumptions
-
Binary dependent variable. The outcome must have exactly two mutually exclusive categories (coded 0 and 1).
-
Independence of observations. Each case is independent. Repeated measures or clustered data require extensions such as mixed-effects logistic regression.
-
No multicollinearity. Predictors should not be too highly correlated. Check VIF values as in multiple regression.
-
Linearity of the logit. Continuous predictors must have a linear relationship with the log-odds of the outcome — not with the outcome itself. Test by including an interaction between the predictor and its natural log (); if significant, the assumption is violated.
-
Adequate sample size. A common guideline is at least 10-20 events per predictor variable. With 2 predictors and a 30% event rate, you need at least observations.
Note: Logistic regression does not assume normality of residuals or homoscedasticity — those are linear regression assumptions.
Formula
The Logit Function
Converting to Probability
Odds Ratio (OR)
The odds ratio for predictor is:
An OR of 1.0 means no effect. An OR greater than 1 means higher odds of the outcome; an OR less than 1 means lower odds. For example, means that a one-unit increase in multiplies the odds of by 2.5.
Wald Test
Each coefficient is tested individually using the Wald statistic:
The Wald statistic follows a standard normal distribution (or follows a chi-square with ). A significant Wald test indicates that the predictor contributes to the model beyond chance.
Model Fit: Log-Likelihood and Deviance
Logistic regression uses maximum likelihood estimation (not least squares). Model fit is assessed by:
- -2 Log-Likelihood (-2LL): Lower values indicate better fit.
- Likelihood Ratio Test: Compares the fitted model to a null model (intercept only):
With (number of predictors).
Pseudo- Measures
Logistic regression does not have a true . Common pseudo- approximations include:
- Cox & Snell : Cannot reach 1.0, which limits interpretation.
- Nagelkerke : Adjusts Cox & Snell to range from 0 to 1. Most commonly reported.
Worked Example
Scenario: An education researcher wants to predict whether students pass () or fail () a certification exam based on weekly study hours () and attendance rate as a percentage (). Data from students:
| Student | Study Hours () | Attendance % () | Pass () |
|---|---|---|---|
| 1 | 5 | 60 | 0 |
| 2 | 10 | 80 | 1 |
| 3 | 3 | 50 | 0 |
| 4 | 12 | 90 | 1 |
| 5 | 8 | 70 | 0 |
| 6 | 15 | 85 | 1 |
| 7 | 6 | 65 | 0 |
| 8 | 14 | 95 | 1 |
| 9 | 9 | 75 | 1 |
| 10 | 4 | 55 | 0 |
Step 1: Fit the logistic regression model.
Using maximum likelihood estimation (via software), suppose the model yields:
Step 2: Interpret the coefficients as odds ratios.
- : Each additional weekly study hour multiplies the odds of passing by 1.57 (a 57% increase in odds), holding attendance constant.
- : Each additional percentage point of attendance multiplies the odds of passing by 1.08 (an 8% increase in odds), holding study hours constant.
Step 3: Predict a specific case.
For a student with 10 study hours and 75% attendance:
The predicted probability of passing is about 17.6%.
Step 4: Evaluate model fit.
- Likelihood ratio test: , — the model is significantly better than the null.
- Nagelkerke — the predictors explain a substantial portion of variance in the outcome.
- Classification accuracy: The model correctly classifies 90% of cases (using a 0.50 probability cutoff).
Step 5: Test individual predictors.
- Study hours: Wald , — significant.
- Attendance: Wald , — not significant at .
Study hours is a significant unique predictor; attendance does not add significant predictive value beyond study hours in this small sample.
Interpretation
The model indicates that study hours is the stronger predictor of exam success. Each additional weekly study hour nearly doubles the odds of passing (). Although attendance shows a positive trend, it does not reach statistical significance, possibly due to the small sample size ().
Understanding Odds Ratios
- : No effect
- : Higher odds of the outcome (the predictor increases the probability)
- : Lower odds of the outcome (the predictor decreases the probability)
- The 95% confidence interval for the OR should not include 1.0 for the effect to be statistically significant
Predicted Probabilities vs. Odds Ratios
Odds ratios describe multiplicative changes in odds, which are not intuitive for everyone. Converting predictions to probabilities (as in Step 3) often helps with interpretation and communication to non-statistical audiences.
Common Mistakes
-
Using linear regression for a binary outcome. Linear regression can produce predicted values below 0 or above 1, which are nonsensical for probabilities. Always use logistic regression for binary outcomes.
-
Interpreting coefficients as changes in probability. The logistic regression coefficient represents a change in log-odds, not probability. Probability changes depend on the baseline probability and are nonlinear.
-
Ignoring the events-per-variable ratio. With too few events relative to the number of predictors, the model will overfit and produce unstable coefficient estimates. Aim for at least 10 events per predictor.
-
Confusing odds with probability. An OR of 2.0 does not mean "twice as likely." It means twice the odds. If the baseline probability is 0.10 (odds = 0.111), doubling the odds gives 0.222, or a probability of 0.182 — not 0.20.
-
Reporting only classification accuracy. A model predicting a rare event (5% prevalence) can achieve 95% accuracy by always predicting "no." Report sensitivity, specificity, and the area under the ROC curve (AUC) alongside accuracy.
-
Forgetting to check the linearity of the logit. Continuous predictors must be linearly related to the log-odds. Violations can be addressed with polynomial terms or categorization.
How to Run It
Odds ratios and confidence intervals
exp(cbind(OR = coef(model), confint(model)))
Model fit: likelihood ratio test
library(lmtest) lrtest(model)
Classification accuracy
predicted <- ifelse(predict(model, type = "response") > 0.5, 1, 0) table(Predicted = predicted, Actual = mydata$pass)
```python
import statsmodels.api as sm
import numpy as np
# Fit the model
X = df[['study_hours', 'attendance']]
X = sm.add_constant(X)
model = sm.Logit(df['pass'], X).fit()
print(model.summary())
# Odds ratios
print("Odds Ratios:")
print(np.exp(model.params))
# Predicted probabilities
df['pred_prob'] = model.predict(X)
# Using sklearn for classification metrics
from sklearn.metrics import classification_report, roc_auc_score
y_pred = (df['pred_prob'] > 0.5).astype(int)
print(classification_report(df['pass'], y_pred))
```
Go to Analyze > Regression > Binary Logistic
Move your dependent variable (e.g., Pass/Fail) into the Dependent box
Move your predictor variables (e.g., Study Hours, Attendance) into the Covariates box
Click Options and check Classification plots, Hosmer-Lemeshow goodness-of-fit, CI for exp(B), and Iteration history
Click OK
SPSS outputs the Omnibus Test of Model Coefficients (likelihood ratio chi-square), the Model Summary (Cox & Snell R² and Nagelkerke R²), the Hosmer-Lemeshow test, the Classification Table, and the Variables in the Equation table (B, SE, Wald, df, p, Exp(B) with 95% CI).
Excel does not have a built-in logistic regression tool. You have two main options:
Solver approach: Set up the log-likelihood function manually using formulas, then use Data > Solver to maximize it by adjusting the coefficient cells. This is tedious but possible for simple models.
Add-ins: Install a statistics add-in such as Real Statistics Resource Pack (free), which adds a logistic regression function:
After installing, go to Real Statistics > Regression > Logistic Regression
Select your input ranges and click OK
For most applications, R, Python, or SPSS are more practical choices for logistic regression than Excel.
Ready to calculate?
Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.
Related Concepts
Simple Linear Regression
Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.
Chi-Square Test of Independence
Learn how to perform a chi-square test of independence to analyze associations between categorical variables, with formulas, examples, and Cramer's V.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.