Topics¶

Statistics basics 📝¶
- Descriptive statistics and inferential statistics
- Continuous data and discrete data
Visualization 📊¶
- Histograms, bar charts, pie charts, scatter plots.
Central tendency metrics and dispersion metrics 📈¶
- Average, median, mode.
- Variance, Standard deviation, Range, Interquartile Range
Probabilities 🎲¶
- Probability Distributions (Normal Distribution)
- Random Variables
- Probability Rules
- Conditional Probability and Bayes' Theorem
- Independence and Dependence
- Correlation and Covariance
- Central Limit Theorem
Statistical hypothesis testing ❓¶
- p-value, confidence interval
- type-1 error and type 2 error
- Z-test and t-test
- ANOVA

Introduction¶

Probability and statistics are the cornerstones of machine learning. Understanding these concepts allows you to make sense of data, build predictive models, and make informed decisions. This guide will take you through the fundamentals, applications, and advanced topics in probability and statistics, complete with Python code examples to help you become an expert.

Statistics¶

Statistics is a powerful and indispensable field of study that plays a pivotal role in helping us make sense of the world around us. It is the science of collecting, analyzing, interpreting, presenting, and organizing data. By harnessing the tools and techniques of statistics, we can uncover valuable insights, identify patterns, and make informed decisions in a wide range of disciplines — from science and business to healthcare and social sciences.

Statistics can be broadly divided into two main categories: descriptive statistics and inferential statistics, each serving a distinct but interconnected purpose.

Descriptive Statistics¶

Descriptive statistics involve methods used to summarize, simplify, and describe the key features of a dataset. This branch aims to provide a concise and meaningful overview of data through measures like:

Central tendency (mean, median, mode)
Dispersion (range, variance, standard deviation)
Graphical representations such as histograms, bar charts, and scatter plots.

Descriptive statistics help us understand the fundamental characteristics of a dataset and make it more manageable for analysis.

Inferential Statistics¶

Inferential statistics go beyond description and seek to draw meaningful conclusions or make predictions based on sample data. It involves techniques such as:

Hypothesis testing
Confidence intervals
Regression analysis

Inferential statistics allow us to make inferences about a larger population based on a representative sample, providing insights into patterns, relationships, and probabilities of outcomes.

Descriptive Statistics Concepts¶

Measures of Central Tendency¶

Refers to the average or “typical” value of a dataset.

Mean – The arithmetic average.
Median – The middle value when data is sorted.
Mode – The most frequent value.

Measures of Dispersion¶

Refers to how spread out the data values are.

Range – Difference between the maximum and minimum values.
Variance – Average of squared deviations from the mean.
Standard Deviation – Square root of variance (measures spread).
Interquartile Range (IQR) – Middle 50% of the data (Q3 - Q1).

Variance¶

Variance quantifies the average of the squared differences between each data point and the mean. It provides information about how individual data points deviate from the mean.

Population Variance¶

The population variance (σ²) is defined as:

\[ \sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N} \]

Where:

\( x_i \): an individual data point
\( \mu \): the population mean
\( N \): total number of data points in the population

Sample Variance¶

The sample variance (s²) — an unbiased estimator of the population variance when calculated from a sample — is defined as:

\[ s^2 = \frac{\sum_{i=1}^{N} (x_i - \overline{x})^2}{N - 1} \]

Where:

\( x_i \): an individual data point in the sample
\( \overline{x} \): the sample mean
\( N \): the number of observations in the sample

When to Use Which¶

Use population variance when you have data for the entire population.
Use sample variance when you only have a subset of the population.

Example Calculation¶

Given the dataset: 2, 4, 6, 8

Mean (x̄) = (2 + 4 + 6 + 8) / 4 = 5
Squared deviations:
(2−5)² = 9
(4−5)² = 1
(6−5)² = 1
(8−5)² = 9
Sum of squared deviations = 20
Population variance (σ²) = 20 / 4 = 5
Sample variance (s²) = 20 / (4−1) = 6.67

Python Example¶

# Variance Calculation Examples

# Example dataset
data = [2, 4, 6, 8]

# --- Manual Calculation ---
N = len(data)
mean = sum(data) / N

# Calculate squared differences
squared_diffs = [(x - mean)**2 for x in data]

# Population variance (divide by N)
pop_variance = sum(squared_diffs) / N

# Sample variance (divide by N - 1)
sample_variance = sum(squared_diffs) / (N - 1)

print('Mean =', mean)
print('Population Variance =', pop_variance)
print('Sample Variance =', sample_variance)

# --- Using NumPy ---
import numpy as np
arr = np.array(data)

print('NumPy Population Variance =', arr.var())        # ddof=0 (default)
print('NumPy Sample Variance =', arr.var(ddof=1))      # ddof=1 for sample

Notes¶

Dividing by N−1 in the sample variance formula (Bessel’s correction) removes bias in estimating population variance.
Variance units are squared; take the square root to obtain the standard deviation.