Binary Logistic Regression

Purpose

Predicts the probability of a binary outcome (e.g., pass/fail, yes/no) from one or more predictor variables.

When to Use

When your dependent variable is dichotomous (two categories) and you want to model how one or more predictors influence the probability of one outcome.

Data Type

One binary outcome variable; one or more continuous or categorical predictors

Key Assumptions

Binary dependent variable, independence of observations, no multicollinearity among predictors, linearity between continuous predictors and the log-odds of the outcome, adequate sample size.

Tools

Subthesis Research Tools on Subthesis →

What Is Logistic Regression?

Binary logistic regression predicts the probability of a dichotomous outcome (e.g., pass/fail, admitted/rejected, disease/no disease) from one or more predictor variables. Unlike simple linear regression, which predicts a continuous value, logistic regression predicts the log-odds (logit) of the outcome occurring.

The core idea is that we cannot model a binary outcome with a straight line — predicted probabilities must stay between 0 and 1. Logistic regression achieves this by applying the logistic (sigmoid) function:

P(Y = 1) = \frac{1}{1 + e^{-(b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p)}}

The model works on the logit scale:

\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p

Where $\frac{P}{1-P}$ is the odds of the outcome occurring, and the left-hand side is the natural log of the odds (the log-odds or logit).

When to Use It

Use binary logistic regression when:

Your dependent variable is dichotomous (exactly two categories, coded 0 and 1).
You want to know which predictors increase or decrease the probability of the outcome.
You want to quantify the effect of each predictor as an odds ratio.
Your predictors may be continuous, categorical, or a mix of both.

Examples:

Predicting whether a student passes or fails based on study hours and attendance
Predicting whether a patient develops a disease based on risk factors
Predicting whether a customer will purchase a product based on demographics

If your outcome has three or more categories, use multinomial logistic regression. If your outcome is continuous, use linear regression.

Assumptions

Binary dependent variable. The outcome must have exactly two mutually exclusive categories (coded 0 and 1).
Independence of observations. Each case is independent. Repeated measures or clustered data require extensions such as mixed-effects logistic regression.
No multicollinearity. Predictors should not be too highly correlated. Check VIF values as in multiple regression.
Linearity of the logit. Continuous predictors must have a linear relationship with the log-odds of the outcome — not with the outcome itself. Test by including an interaction between the predictor and its natural log ( $X \times \ln X$ ); if significant, the assumption is violated.
Adequate sample size. A common guideline is at least 10-20 events per predictor variable. With 2 predictors and a 30% event rate, you need at least $\frac{2 \times 10}{0.30} \approx 67$ observations.

Note: Logistic regression does not assume normality of residuals or homoscedasticity — those are linear regression assumptions.

Formula

The Logit Function

\text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = b_0 + b_1 X_1 + \cdots + b_p X_p

Converting to Probability

P(Y = 1) = \frac{e^{b_0 + b_1 X_1 + \cdots + b_p X_p}}{1 + e^{b_0 + b_1 X_1 + \cdots + b_p X_p}}

Odds Ratio (OR)

The odds ratio for predictor $X_j$ is:

OR_j = e^{b_j}

An OR of 1.0 means no effect. An OR greater than 1 means higher odds of the outcome; an OR less than 1 means lower odds. For example, $OR = 2.5$ means that a one-unit increase in $X_j$ multiplies the odds of $Y = 1$ by 2.5.

Wald Test

Each coefficient is tested individually using the Wald statistic:

z = \frac{b_j}{SE_{b_j}}

The Wald statistic follows a standard normal distribution (or $z^2$ follows a chi-square with $df = 1$ ). A significant Wald test indicates that the predictor contributes to the model beyond chance.

Model Fit: Log-Likelihood and Deviance

Logistic regression uses maximum likelihood estimation (not least squares). Model fit is assessed by:

-2 Log-Likelihood (-2LL): Lower values indicate better fit.
Likelihood Ratio Test: Compares the fitted model to a null model (intercept only):

\chi^2 = -2LL_{null} - (-2LL_{model})

With $df = p$ (number of predictors).

Pseudo- $R^2$ Measures

Logistic regression does not have a true $R^2$ . Common pseudo- $R^2$ approximations include:

Cox & Snell $R^2$ : Cannot reach 1.0, which limits interpretation.
Nagelkerke $R^2$ : Adjusts Cox & Snell to range from 0 to 1. Most commonly reported.

Worked Example

Scenario: An education researcher wants to predict whether students pass ( $Y = 1$ ) or fail ( $Y = 0$ ) a certification exam based on weekly study hours ( $X_1$ ) and attendance rate as a percentage ( $X_2$ ). Data from $n = 10$ students:

Student	Study Hours ( $X_1$ )	Attendance % ( $X_2$ )	Pass ( $Y$ )
1	5	60	0
2	10	80	1
3	3	50	0
4	12	90	1
5	8	70	0
6	15	85	1
7	6	65	0
8	14	95	1
9	9	75	1
10	4	55	0

Step 1: Fit the logistic regression model.

Using maximum likelihood estimation (via software), suppose the model yields:

\text{logit}(\hat{P}) = -12.04 + 0.45 X_1 + 0.08 X_2

Step 2: Interpret the coefficients as odds ratios.

$OR_{X_1} = e^{0.45} = 1.57$ : Each additional weekly study hour multiplies the odds of passing by 1.57 (a 57% increase in odds), holding attendance constant.
$OR_{X_2} = e^{0.08} = 1.08$ : Each additional percentage point of attendance multiplies the odds of passing by 1.08 (an 8% increase in odds), holding study hours constant.

Step 3: Predict a specific case.

For a student with 10 study hours and 75% attendance:

\text{logit}(\hat{P}) = -12.04 + 0.45(10) + 0.08(75) = -12.04 + 4.50 + 6.00 = -1.54

\hat{P} = \frac{e^{-1.54}}{1 + e^{-1.54}} = \frac{0.214}{1.214} = 0.176

The predicted probability of passing is about 17.6%.

Step 4: Evaluate model fit.

Likelihood ratio test: $\chi^2(2) = 11.36$ , $p = .003$ — the model is significantly better than the null.
Nagelkerke $R^2 = .82$ — the predictors explain a substantial portion of variance in the outcome.
Classification accuracy: The model correctly classifies 90% of cases (using a 0.50 probability cutoff).

Step 5: Test individual predictors.

Study hours: Wald $z = 2.14$ , $p = .032$ — significant.
Attendance: Wald $z = 1.58$ , $p = .114$ — not significant at $\alpha = .05$ .

Study hours is a significant unique predictor; attendance does not add significant predictive value beyond study hours in this small sample.

Interpretation

The model indicates that study hours is the stronger predictor of exam success. Each additional weekly study hour nearly doubles the odds of passing ( $OR = 1.57$ ). Although attendance shows a positive trend, it does not reach statistical significance, possibly due to the small sample size ( $n = 10$ ).

Understanding Odds Ratios

$OR = 1$ : No effect
$OR > 1$ : Higher odds of the outcome (the predictor increases the probability)
$OR < 1$ : Lower odds of the outcome (the predictor decreases the probability)
The 95% confidence interval for the OR should not include 1.0 for the effect to be statistically significant

Predicted Probabilities vs. Odds Ratios

Odds ratios describe multiplicative changes in odds, which are not intuitive for everyone. Converting predictions to probabilities (as in Step 3) often helps with interpretation and communication to non-statistical audiences.

Common Mistakes

Using linear regression for a binary outcome. Linear regression can produce predicted values below 0 or above 1, which are nonsensical for probabilities. Always use logistic regression for binary outcomes.
Interpreting coefficients as changes in probability. The logistic regression coefficient $b$ represents a change in log-odds, not probability. Probability changes depend on the baseline probability and are nonlinear.
Ignoring the events-per-variable ratio. With too few events relative to the number of predictors, the model will overfit and produce unstable coefficient estimates. Aim for at least 10 events per predictor.
Confusing odds with probability. An OR of 2.0 does not mean "twice as likely." It means twice the odds. If the baseline probability is 0.10 (odds = 0.111), doubling the odds gives 0.222, or a probability of 0.182 — not 0.20.
Reporting only classification accuracy. A model predicting a rare event (5% prevalence) can achieve 95% accuracy by always predicting "no." Report sensitivity, specificity, and the area under the ROC curve (AUC) alongside accuracy.
Forgetting to check the linearity of the logit. Continuous predictors must be linearly related to the log-odds. Violations can be addressed with polynomial terms or categorization.

How to Run It

```r # Binary logistic regression in R model <- glm(pass ~ study_hours + attendance, data = mydata, family = binomial) summary(model)

Odds ratios and confidence intervals

exp(cbind(OR = coef(model), confint(model)))

Model fit: likelihood ratio test

library(lmtest) lrtest(model)

Classification accuracy

predicted <- ifelse(predict(model, type = "response") > 0.5, 1, 0) table(Predicted = predicted, Actual = mydata$pass)

```python
import statsmodels.api as sm
import numpy as np

# Fit the model
X = df[['study_hours', 'attendance']]
X = sm.add_constant(X)
model = sm.Logit(df['pass'], X).fit()
print(model.summary())

# Odds ratios
print("Odds Ratios:")
print(np.exp(model.params))

# Predicted probabilities
df['pred_prob'] = model.predict(X)

# Using sklearn for classification metrics
from sklearn.metrics import classification_report, roc_auc_score
y_pred = (df['pred_prob'] > 0.5).astype(int)
print(classification_report(df['pass'], y_pred))
```


Go to Analyze > Regression > Binary Logistic
Move your dependent variable (e.g., Pass/Fail) into the Dependent box
Move your predictor variables (e.g., Study Hours, Attendance) into the Covariates box
Click Options and check Classification plots, Hosmer-Lemeshow goodness-of-fit, CI for exp(B), and Iteration history
Click OK

SPSS outputs the Omnibus Test of Model Coefficients (likelihood ratio chi-square), the Model Summary (Cox & Snell R² and Nagelkerke R²), the Hosmer-Lemeshow test, the Classification Table, and the Variables in the Equation table (B, SE, Wald, df, p, Exp(B) with 95% CI).


Excel does not have a built-in logistic regression tool. You have two main options:

Solver approach: Set up the log-likelihood function manually using formulas, then use Data > Solver to maximize it by adjusting the coefficient cells. This is tedious but possible for simple models.
Add-ins: Install a statistics add-in such as Real Statistics Resource Pack (free), which adds a logistic regression function:
  
  After installing, go to Real Statistics > Regression > Logistic Regression
  Select your input ranges and click OK
  


For most applications, R, Python, or SPSS are more practical choices for logistic regression than Excel.



## How to Report in APA Format

> A binary logistic regression was conducted to predict exam pass/fail status from weekly study hours and attendance rate. The overall model was statistically significant, $\chi^2(2) = 11.36$, $p = .003$, Nagelkerke $R^2 = .82$, and correctly classified 90% of cases. Study hours was a significant predictor ($b = 0.45$, $SE = 0.21$, Wald $= 4.59$, $p = .032$, $OR = 1.57$, 95% CI [1.04, 2.37]); each additional study hour multiplied the odds of passing by 1.57. Attendance was not a significant predictor ($b = 0.08$, $SE = 0.05$, Wald $= 2.50$, $p = .114$, $OR = 1.08$, 95% CI [0.98, 1.20]).

Key elements to include:

- The likelihood ratio chi-square test with degrees of freedom
- A pseudo-$R^2$ measure (specify which: Nagelkerke, Cox & Snell, or McFadden)
- Classification accuracy (and sensitivity/specificity if relevant)
- For each predictor: $b$, $SE$, Wald statistic, $p$-value, odds ratio with 95% CI

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Logistic Regression

Binary Logistic Regression

What Is Logistic Regression?

When to Use It

Assumptions

Formula

The Logit Function

Converting to Probability

Odds Ratio (OR)

Wald Test

Model Fit: Log-Likelihood and Deviance

Pseudo- $R^2$ Measures

Worked Example

Interpretation

Understanding Odds Ratios

Predicted Probabilities vs. Odds Ratios

Common Mistakes

How to Run It

Odds ratios and confidence intervals

Model fit: likelihood ratio test

Classification accuracy

Related Concepts

Simple Linear Regression

Chi-Square Test of Independence

Effect Size

Binary Logistic Regression

What Is Logistic Regression?

When to Use It

Assumptions

Formula

The Logit Function

Converting to Probability

Odds Ratio (OR)

Wald Test

Model Fit: Log-Likelihood and Deviance

Pseudo-R2R^2R2 Measures

Worked Example

Interpretation

Understanding Odds Ratios

Predicted Probabilities vs. Odds Ratios

Common Mistakes

How to Run It

Odds ratios and confidence intervals

Model fit: likelihood ratio test

Classification accuracy

Related Concepts

Simple Linear Regression

Chi-Square Test of Independence

Effect Size

Pseudo- $R^2$ Measures