Skip to main content
Stats for Scholars
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Home Concepts Logistic Regression

Descriptive Statistics

  • Descriptive Statistics

Inferential Statistics

  • Chi-Square Test of Independence
  • Independent Samples t-Test
  • Kruskal-Wallis H Test
  • Logistic Regression
  • Mann-Whitney U Test
  • Multiple Linear Regression
  • One-Way ANOVA
  • Paired Samples t-Test
  • Pearson Correlation
  • Repeated Measures ANOVA
  • Simple Linear Regression
  • Two-Way (Factorial) ANOVA
  • Wilcoxon Signed-Rank Test

Effect Size & Power

  • Effect Size
  • Sample Size Determination
  • Statistical Power & Power Analysis

Reliability & Validity

  • Cronbach's Alpha
  • Inter-Rater Reliability

Logistic Regression

advanced Inferential Statistics

Binary Logistic Regression

Purpose
Predicts the probability of a binary outcome (e.g., pass/fail, yes/no) from one or more predictor variables.
When to Use
When your dependent variable is dichotomous (two categories) and you want to model how one or more predictors influence the probability of one outcome.
Data Type
One binary outcome variable; one or more continuous or categorical predictors
Key Assumptions
Binary dependent variable, independence of observations, no multicollinearity among predictors, linearity between continuous predictors and the log-odds of the outcome, adequate sample size.
Tools
Subthesis Research Tools on Subthesis →

What Is Logistic Regression?

Binary logistic regression predicts the probability of a dichotomous outcome (e.g., pass/fail, admitted/rejected, disease/no disease) from one or more predictor variables. Unlike simple linear regression, which predicts a continuous value, logistic regression predicts the log-odds (logit) of the outcome occurring.

The core idea is that we cannot model a binary outcome with a straight line — predicted probabilities must stay between 0 and 1. Logistic regression achieves this by applying the logistic (sigmoid) function:

P(Y=1)=11+e−(b0+b1X1+b2X2+⋯+bpXp)P(Y = 1) = \frac{1}{1 + e^{-(b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p)}} P(Y=1)=1+e−(b0​+b1​X1​+b2​X2​+⋯+bp​Xp​)1​

The model works on the logit scale:

ln⁡(P1−P)=b0+b1X1+b2X2+⋯+bpXp\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p ln(1−PP​)=b0​+b1​X1​+b2​X2​+⋯+bp​Xp​

Where P1−P\frac{P}{1-P}1−PP​ is the odds of the outcome occurring, and the left-hand side is the natural log of the odds (the log-odds or logit).

When to Use It

Use binary logistic regression when:

  • Your dependent variable is dichotomous (exactly two categories, coded 0 and 1).
  • You want to know which predictors increase or decrease the probability of the outcome.
  • You want to quantify the effect of each predictor as an odds ratio.
  • Your predictors may be continuous, categorical, or a mix of both.

Examples:

  • Predicting whether a student passes or fails based on study hours and attendance
  • Predicting whether a patient develops a disease based on risk factors
  • Predicting whether a customer will purchase a product based on demographics

If your outcome has three or more categories, use multinomial logistic regression. If your outcome is continuous, use linear regression.

Assumptions

  1. Binary dependent variable. The outcome must have exactly two mutually exclusive categories (coded 0 and 1).

  2. Independence of observations. Each case is independent. Repeated measures or clustered data require extensions such as mixed-effects logistic regression.

  3. No multicollinearity. Predictors should not be too highly correlated. Check VIF values as in multiple regression.

  4. Linearity of the logit. Continuous predictors must have a linear relationship with the log-odds of the outcome — not with the outcome itself. Test by including an interaction between the predictor and its natural log (X×ln⁡XX \times \ln XX×lnX); if significant, the assumption is violated.

  5. Adequate sample size. A common guideline is at least 10-20 events per predictor variable. With 2 predictors and a 30% event rate, you need at least 2×100.30≈67\frac{2 \times 10}{0.30} \approx 670.302×10​≈67 observations.

Note: Logistic regression does not assume normality of residuals or homoscedasticity — those are linear regression assumptions.

Formula

The Logit Function

logit(P)=ln⁡(P1−P)=b0+b1X1+⋯+bpXp\text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = b_0 + b_1 X_1 + \cdots + b_p X_p logit(P)=ln(1−PP​)=b0​+b1​X1​+⋯+bp​Xp​

Converting to Probability

P(Y=1)=eb0+b1X1+⋯+bpXp1+eb0+b1X1+⋯+bpXpP(Y = 1) = \frac{e^{b_0 + b_1 X_1 + \cdots + b_p X_p}}{1 + e^{b_0 + b_1 X_1 + \cdots + b_p X_p}} P(Y=1)=1+eb0​+b1​X1​+⋯+bp​Xp​eb0​+b1​X1​+⋯+bp​Xp​​

Odds Ratio (OR)

The odds ratio for predictor XjX_jXj​ is:

ORj=ebjOR_j = e^{b_j} ORj​=ebj​

An OR of 1.0 means no effect. An OR greater than 1 means higher odds of the outcome; an OR less than 1 means lower odds. For example, OR=2.5OR = 2.5OR=2.5 means that a one-unit increase in XjX_jXj​ multiplies the odds of Y=1Y = 1Y=1 by 2.5.

Wald Test

Each coefficient is tested individually using the Wald statistic:

z=bjSEbjz = \frac{b_j}{SE_{b_j}} z=SEbj​​bj​​

The Wald statistic follows a standard normal distribution (or z2z^2z2 follows a chi-square with df=1df = 1df=1). A significant Wald test indicates that the predictor contributes to the model beyond chance.

Model Fit: Log-Likelihood and Deviance

Logistic regression uses maximum likelihood estimation (not least squares). Model fit is assessed by:

  • -2 Log-Likelihood (-2LL): Lower values indicate better fit.
  • Likelihood Ratio Test: Compares the fitted model to a null model (intercept only):

χ2=−2LLnull−(−2LLmodel)\chi^2 = -2LL_{null} - (-2LL_{model}) χ2=−2LLnull​−(−2LLmodel​)

With df=pdf = pdf=p (number of predictors).

Pseudo-R2R^2R2 Measures

Logistic regression does not have a true R2R^2R2. Common pseudo-R2R^2R2 approximations include:

  • Cox & Snell R2R^2R2: Cannot reach 1.0, which limits interpretation.
  • Nagelkerke R2R^2R2: Adjusts Cox & Snell to range from 0 to 1. Most commonly reported.

Worked Example

Scenario: An education researcher wants to predict whether students pass (Y=1Y = 1Y=1) or fail (Y=0Y = 0Y=0) a certification exam based on weekly study hours (X1X_1X1​) and attendance rate as a percentage (X2X_2X2​). Data from n=10n = 10n=10 students:

Student Study Hours (X1X_1X1​) Attendance % (X2X_2X2​) Pass (YYY)
1 5 60 0
2 10 80 1
3 3 50 0
4 12 90 1
5 8 70 0
6 15 85 1
7 6 65 0
8 14 95 1
9 9 75 1
10 4 55 0

Step 1: Fit the logistic regression model.

Using maximum likelihood estimation (via software), suppose the model yields:

logit(P^)=−12.04+0.45X1+0.08X2\text{logit}(\hat{P}) = -12.04 + 0.45 X_1 + 0.08 X_2 logit(P^)=−12.04+0.45X1​+0.08X2​

Step 2: Interpret the coefficients as odds ratios.

  • ORX1=e0.45=1.57OR_{X_1} = e^{0.45} = 1.57ORX1​​=e0.45=1.57: Each additional weekly study hour multiplies the odds of passing by 1.57 (a 57% increase in odds), holding attendance constant.
  • ORX2=e0.08=1.08OR_{X_2} = e^{0.08} = 1.08ORX2​​=e0.08=1.08: Each additional percentage point of attendance multiplies the odds of passing by 1.08 (an 8% increase in odds), holding study hours constant.

Step 3: Predict a specific case.

For a student with 10 study hours and 75% attendance:

logit(P^)=−12.04+0.45(10)+0.08(75)=−12.04+4.50+6.00=−1.54\text{logit}(\hat{P}) = -12.04 + 0.45(10) + 0.08(75) = -12.04 + 4.50 + 6.00 = -1.54 logit(P^)=−12.04+0.45(10)+0.08(75)=−12.04+4.50+6.00=−1.54

P^=e−1.541+e−1.54=0.2141.214=0.176\hat{P} = \frac{e^{-1.54}}{1 + e^{-1.54}} = \frac{0.214}{1.214} = 0.176 P^=1+e−1.54e−1.54​=1.2140.214​=0.176

The predicted probability of passing is about 17.6%.

Step 4: Evaluate model fit.

  • Likelihood ratio test: χ2(2)=11.36\chi^2(2) = 11.36χ2(2)=11.36, p=.003p = .003p=.003 — the model is significantly better than the null.
  • Nagelkerke R2=.82R^2 = .82R2=.82 — the predictors explain a substantial portion of variance in the outcome.
  • Classification accuracy: The model correctly classifies 90% of cases (using a 0.50 probability cutoff).

Step 5: Test individual predictors.

  • Study hours: Wald z=2.14z = 2.14z=2.14, p=.032p = .032p=.032 — significant.
  • Attendance: Wald z=1.58z = 1.58z=1.58, p=.114p = .114p=.114 — not significant at α=.05\alpha = .05α=.05.

Study hours is a significant unique predictor; attendance does not add significant predictive value beyond study hours in this small sample.

Interpretation

The model indicates that study hours is the stronger predictor of exam success. Each additional weekly study hour nearly doubles the odds of passing (OR=1.57OR = 1.57OR=1.57). Although attendance shows a positive trend, it does not reach statistical significance, possibly due to the small sample size (n=10n = 10n=10).

Understanding Odds Ratios

  • OR=1OR = 1OR=1: No effect
  • OR>1OR > 1OR>1: Higher odds of the outcome (the predictor increases the probability)
  • OR<1OR < 1OR<1: Lower odds of the outcome (the predictor decreases the probability)
  • The 95% confidence interval for the OR should not include 1.0 for the effect to be statistically significant

Predicted Probabilities vs. Odds Ratios

Odds ratios describe multiplicative changes in odds, which are not intuitive for everyone. Converting predictions to probabilities (as in Step 3) often helps with interpretation and communication to non-statistical audiences.

Common Mistakes

  1. Using linear regression for a binary outcome. Linear regression can produce predicted values below 0 or above 1, which are nonsensical for probabilities. Always use logistic regression for binary outcomes.

  2. Interpreting coefficients as changes in probability. The logistic regression coefficient bbb represents a change in log-odds, not probability. Probability changes depend on the baseline probability and are nonlinear.

  3. Ignoring the events-per-variable ratio. With too few events relative to the number of predictors, the model will overfit and produce unstable coefficient estimates. Aim for at least 10 events per predictor.

  4. Confusing odds with probability. An OR of 2.0 does not mean "twice as likely." It means twice the odds. If the baseline probability is 0.10 (odds = 0.111), doubling the odds gives 0.222, or a probability of 0.182 — not 0.20.

  5. Reporting only classification accuracy. A model predicting a rare event (5% prevalence) can achieve 95% accuracy by always predicting "no." Report sensitivity, specificity, and the area under the ROC curve (AUC) alongside accuracy.

  6. Forgetting to check the linearity of the logit. Continuous predictors must be linearly related to the log-odds. Violations can be addressed with polynomial terms or categorization.

How to Run It

```r # Binary logistic regression in R model <- glm(pass ~ study_hours + attendance, data = mydata, family = binomial) summary(model)

Odds ratios and confidence intervals

exp(cbind(OR = coef(model), confint(model)))

Model fit: likelihood ratio test

library(lmtest) lrtest(model)

Classification accuracy

predicted <- ifelse(predict(model, type = "response") > 0.5, 1, 0) table(Predicted = predicted, Actual = mydata$pass)

```python import statsmodels.api as sm import numpy as np # Fit the model X = df[['study_hours', 'attendance']] X = sm.add_constant(X) model = sm.Logit(df['pass'], X).fit() print(model.summary()) # Odds ratios print("Odds Ratios:") print(np.exp(model.params)) # Predicted probabilities df['pred_prob'] = model.predict(X) # Using sklearn for classification metrics from sklearn.metrics import classification_report, roc_auc_score y_pred = (df['pred_prob'] > 0.5).astype(int) print(classification_report(df['pass'], y_pred)) ```
  1. Go to Analyze > Regression > Binary Logistic
  2. Move your dependent variable (e.g., Pass/Fail) into the Dependent box
  3. Move your predictor variables (e.g., Study Hours, Attendance) into the Covariates box
  4. Click Options and check Classification plots, Hosmer-Lemeshow goodness-of-fit, CI for exp(B), and Iteration history
  5. Click OK

SPSS outputs the Omnibus Test of Model Coefficients (likelihood ratio chi-square), the Model Summary (Cox & Snell R² and Nagelkerke R²), the Hosmer-Lemeshow test, the Classification Table, and the Variables in the Equation table (B, SE, Wald, df, p, Exp(B) with 95% CI).

Excel does not have a built-in logistic regression tool. You have two main options:

  1. Solver approach: Set up the log-likelihood function manually using formulas, then use Data > Solver to maximize it by adjusting the coefficient cells. This is tedious but possible for simple models.
  2. Add-ins: Install a statistics add-in such as Real Statistics Resource Pack (free), which adds a logistic regression function:
    • After installing, go to Real Statistics > Regression > Logistic Regression
    • Select your input ranges and click OK

For most applications, R, Python, or SPSS are more practical choices for logistic regression than Excel.

## How to Report in APA Format > A binary logistic regression was conducted to predict exam pass/fail status from weekly study hours and attendance rate. The overall model was statistically significant, $\chi^2(2) = 11.36$, $p = .003$, Nagelkerke $R^2 = .82$, and correctly classified 90% of cases. Study hours was a significant predictor ($b = 0.45$, $SE = 0.21$, Wald $= 4.59$, $p = .032$, $OR = 1.57$, 95% CI [1.04, 2.37]); each additional study hour multiplied the odds of passing by 1.57. Attendance was not a significant predictor ($b = 0.08$, $SE = 0.05$, Wald $= 2.50$, $p = .114$, $OR = 1.08$, 95% CI [0.98, 1.20]). Key elements to include: - The likelihood ratio chi-square test with degrees of freedom - A pseudo-$R^2$ measure (specify which: Nagelkerke, Cox & Snell, or McFadden) - Classification accuracy (and sensitivity/specificity if relevant) - For each predictor: $b$, $SE$, Wald statistic, $p$-value, odds ratio with 95% CI

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Related Concepts

Simple Linear Regression

Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.

Chi-Square Test of Independence

Learn how to perform a chi-square test of independence to analyze associations between categorical variables, with formulas, examples, and Cramer's V.

Effect Size

Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.

Stats for Scholars

Statistics for Researchers, Not Statisticians

A Subthesis Resource

Learn

  • Statistical Concepts
  • Choose a Test
  • APA Reporting
  • Blog

Resources

  • Calculators
  • Cheat Sheets
  • About
  • FAQ
  • Accessibility
  • Privacy
  • Terms

© 2026 Angel Reyes / Subthesis. All rights reserved.

Privacy Policy Terms of Use