Getting Started with Python for Research

Python is a free, general-purpose programming language that has become one of the most popular tools for data analysis and research. Its ecosystem of scientific libraries — scipy, pandas, statsmodels, and pingouin — gives you everything from basic descriptive statistics to advanced machine learning. If your field is moving toward data science, or your advisor uses Python, learning it is a strong investment.

This guide will get you from zero to running your first analysis.

Installing Python

You need two things:

  1. Python — Download it from python.org or, better yet, install the Anaconda distribution which bundles Python with all the scientific packages you will need. Anaconda is the easiest option for researchers.
  2. An editor — Anaconda comes with Jupyter Notebook, which lets you write code and see results in the same document — ideal for analysis and reporting. Alternatively, VS Code with the Python extension is a lightweight, full-featured editor.

If you install Anaconda, scipy, pandas, and numpy are already included. You are ready to go.

Jupyter Notebook Basics

When you open Jupyter Notebook (launch it from Anaconda Navigator or type jupyter notebook in your terminal), you work in cells:

  • Code cells — Where you type and run Python code. Press Shift + Enter to execute a cell.
  • Markdown cells — Where you type notes, headings, and explanations. Use these to document your analysis.

Think of a Jupyter Notebook as a lab notebook: code, output, and your reasoning all live together. This makes your analysis inherently reproducible.

Basic Python Syntax

Python is known for clean, readable syntax. Here are the essentials:

# Assign a value to a variable
x = 5

# Create a list of numbers
scores = [85, 90, 78, 92, 88, 76, 95, 83]

# Import numpy for basic statistics
import numpy as np

# Calculate the mean and standard deviation
np.mean(scores)
np.std(scores, ddof=1)  # ddof=1 for sample SD

Important: Use ddof=1 when calculating standard deviation from a sample (which is almost always the case in research). By default, numpy uses ddof=0 which gives the population SD.

Loading Data

Most researchers work with CSV or Excel files. Pandas is the standard library for data manipulation:

import pandas as pd

# Load a CSV file
df = pd.read_csv("path/to/your/datafile.csv")

# View the first few rows
df.head()

# Get a quick summary of all variables
df.describe()

# Check data types and missing values
df.info()

For Excel files:

# Read an Excel file
df = pd.read_excel("path/to/your/datafile.xlsx")

Pandas DataFrames work like spreadsheets in code — each column is a variable, each row is an observation. You access columns with df['column_name'].

Running a t-Test

Suppose you have two groups and want to compare their means. Use scipy:

from scipy import stats

# Separate the groups
treatment = df[df['group'] == 'treatment']['score']
control = df[df['group'] == 'control']['score']

# Independent samples t-test (Welch's, the default)
t_stat, p_value = stats.ttest_ind(treatment, control)
print(f"t = {t_stat:.2f}, p = {p_value:.4f}")

For a paired samples t-test (pre-test and post-test on the same participants):

# Paired samples t-test
t_stat, p_value = stats.ttest_rel(df['pretest'], df['posttest'])

For richer output including effect sizes, use pingouin:

import pingouin as pg

# Independent t-test with effect size included
result = pg.ttest(treatment, control, correction=True)
print(result)

Pingouin returns the t-statistic, degrees of freedom, p-value, Cohen's d, confidence interval, and statistical power all in one clean table.

Reading Python Output

When you run a t-test with pingouin, you get a DataFrame like this:

          T    dof  alternative    p-val    CI95%       cohen-d  power
t-test  2.45  38.7    two-sided   0.019  [0.85, 9.15]    0.78   0.87

Here is what matters:

  • T = 2.45 — the test statistic
  • dof = 38.7 — degrees of freedom
  • p-val = 0.019 — below .05, so the difference is statistically significant
  • cohen-d = 0.78 — a medium-to-large effect size
  • CI95% — 95% confidence interval for the mean difference
  • power = 0.87 — the statistical power of this test

Recommended Packages

Install these research-focused packages early:

pip install pandas scipy statsmodels pingouin matplotlib seaborn
  • pandas — Data manipulation and cleaning. The DataFrame is the foundation of almost every analysis workflow in Python.
  • scipy — Core scientific library. The stats module has t-tests, ANOVA, chi-square, correlation, and non-parametric tests.
  • statsmodels — Regression, ANOVA, and time series analysis with detailed statistical output (R-squared, p-values, confidence intervals). Closer to R's output style than scipy.
  • pingouin — Built specifically for researchers. Provides clean, APA-ready output with effect sizes and confidence intervals. Excellent for t-tests, ANOVA, correlation, and reliability analysis.
  • matplotlib and seaborn — Visualization. Seaborn builds on matplotlib and creates publication-quality statistical plots with minimal code.

Tips for Beginners

  • Use Jupyter Notebooks for exploration, scripts for final analyses. Notebooks are great for figuring things out, but save your final, clean analysis as a .py script for reproducibility.
  • Learn pandas first. Before running any statistical test, you need to load, clean, and reshape your data. Pandas is the gateway to everything else.
  • Use pingouin for standard tests. Scipy is powerful but bare-bones — it gives you a t-statistic and p-value, and you calculate everything else yourself. Pingouin gives you effect sizes, confidence intervals, and power in one call.
  • Google your errors. Copy error messages into a search engine. Stack Overflow has answers to almost every Python error you will encounter.
  • Do not fight the ecosystem. If your advisor uses R, learn R. If your lab uses Python, learn Python. Both are excellent for research. See our comparison guide for help deciding.

When You Need a Quick Calculation

Sometimes you need a fast effect size or power analysis result without writing code — especially during a meeting or proposal defense. The free calculators on Subthesis let you compute effect sizes, power analyses, and reliability coefficients in your browser. They pair well with Python for when you want to double-check a result or get a quick estimate.

Python has a gentler learning curve than many researchers expect. Write your first notebook, run your first t-test, and you will see why data science teams around the world have standardized on it.