Normal Distribution
π§#π§ π§ππ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
The Normal Distribution (also called Gaussian Distribution) is the most important continuous probability distribution in statistics and machine learning, characterized by its symmetric bell-shaped curve and defined by two parameters: mean and standard deviation.
Resources: SciPy Stats Documentation | Khan Academy Statistics | Elements of Statistical Learning
π§#π§#π§ π§ Summary
The Normal Distribution is a continuous probability distribution that is symmetric around its mean, with the shape determined by its standard deviation. It's fundamental to statistics and machine learning due to the Central Limit Theorem and its mathematical properties.
Key Characteristics: - Bell-shaped curve: Symmetric around the mean - Unimodal: Single peak at the mean - Asymptotic: Tails approach zero but never reach it - Defined by two parameters: Mean (ΒΌ) and standard deviation (Γ) - 68-95-99.7 rule: Empirical rule for data spread
Standard Normal Distribution: - Mean (ΒΌ) = 0 - Standard deviation (Γ) = 1 - Used for standardization and z-scores
Applications in Machine Learning: - Assumption in algorithms: Linear regression, Naive Bayes, LDA - Initialization: Weight initialization in neural networks - Regularization: Gaussian priors in Bayesian methods - Noise modeling: Gaussian noise assumptions - Feature engineering: Normalization and standardization - Hypothesis testing: Statistical significance testing - Confidence intervals: Uncertainty quantification
Real-world Examples: - Heights and weights of populations - Measurement errors in experiments - Financial returns (approximately) - IQ scores - Blood pressure measurements - Test scores and grades
π§#π§#π§ π§π§ π§ π§Iπ§nπ§tπ§uπ§iπ§tπ§iπ§oπ§nπ§
π§#π§#π§#π§ π§Mπ§aπ§tπ§hπ§eπ§mπ§aπ§tπ§iπ§cπ§aπ§lπ§ π§Fπ§oπ§uπ§nπ§dπ§aπ§tπ§iπ§oπ§nπ§
π§#π§#π§#π§#π§ π§Pπ§rπ§oπ§bπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§Dπ§eπ§nπ§sπ§iπ§tπ§yπ§ π§Fπ§uπ§nπ§cπ§tπ§iπ§oπ§nπ§ π§(π§Pπ§Dπ§Fπ§)π§
The Normal Distribution is defined by its PDF:
Where: - \(\mu\) is the mean (location parameter) - \(\sigma\) is the standard deviation (scale parameter)
- \(\sigma^2\) is the variance - \(e \approx 2.718\) (Euler's number) - \(\pi \approx 3.14159\)
π§#π§#π§#π§#π§ π§Sπ§tπ§aπ§nπ§dπ§aπ§rπ§dπ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
When \(\mu = 0\) and \(\sigma = 1\):
π§#π§#π§#π§#π§ π§Cπ§uπ§mπ§uπ§lπ§aπ§tπ§iπ§vπ§eπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§Fπ§uπ§nπ§cπ§tπ§iπ§oπ§nπ§ π§(π§Cπ§Dπ§Fπ§)π§
For standard normal: \(\Phi(z) = \int_{-\infty}^{z} \phi(t) dt\)
π§#π§#π§#π§#π§ π§Zπ§-π§Sπ§cπ§oπ§rπ§eπ§ π§Tπ§rπ§aπ§nπ§sπ§fπ§oπ§rπ§mπ§aπ§tπ§iπ§oπ§nπ§
Convert any normal distribution to standard normal:
Where \(Z \sim N(0, 1)\)
π§#π§#π§#π§ π§Kπ§eπ§yπ§ π§Pπ§rπ§oπ§pπ§eπ§rπ§tπ§iπ§eπ§sπ§
π§#π§#π§#π§#π§ π§Mπ§oπ§mπ§eπ§nπ§tπ§ π§Pπ§rπ§oπ§pπ§eπ§rπ§tπ§iπ§eπ§sπ§
Mean (First Moment): \(\(E[X] = \mu\)\)
Variance (Second Central Moment): \(\(\text{Var}(X) = E[(X-\mu)^2] = \sigma^2\)\)
Skewness (Third Standardized Moment): \(\(\text{Skewness} = E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right] = 0\)\)
Kurtosis (Fourth Standardized Moment): \(\(\text{Kurtosis} = E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right] = 3\)\)
π§#π§#π§#π§#π§ π§Tπ§hπ§eπ§ π§6π§8π§-π§9π§5π§-π§9π§9π§.π§7π§ π§Rπ§uπ§lπ§eπ§ π§(π§Eπ§mπ§pπ§iπ§rπ§iπ§cπ§aπ§lπ§ π§Rπ§uπ§lπ§eπ§)π§
For any normal distribution: - 68% of data falls within 1 standard deviation: \(P(\mu - \sigma \leq X \leq \mu + \sigma) = 0.68\) - 95% of data falls within 2 standard deviations: \(P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) = 0.95\) - 99.7% of data falls within 3 standard deviations: \(P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) = 0.997\)
π§#π§#π§#π§#π§ π§Lπ§iπ§nπ§eπ§aπ§rπ§ π§Cπ§oπ§mπ§bπ§iπ§nπ§aπ§tπ§iπ§oπ§nπ§sπ§
If \(X \sim N(\mu_1, \sigma_1^2)\) and \(Y \sim N(\mu_2, \sigma_2^2)\) are independent:
π§#π§#π§#π§#π§ π§Cπ§eπ§nπ§tπ§rπ§aπ§lπ§ π§Lπ§iπ§mπ§iπ§tπ§ π§Tπ§hπ§eπ§oπ§rπ§eπ§mπ§
For any population with mean \(\mu\) and finite variance \(\sigma^2\), the sampling distribution of the sample mean approaches normal as sample size increases:
π§#π§#π§#π§ π§Mπ§aπ§xπ§iπ§mπ§uπ§mπ§ π§Lπ§iπ§kπ§eπ§lπ§iπ§hπ§oπ§oπ§dπ§ π§Eπ§sπ§tπ§iπ§mπ§aπ§tπ§iπ§oπ§nπ§
Given observations \(x_1, x_2, ..., x_n\) from \(N(\mu, \sigma^2)\):
Log-likelihood: \(\(\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\)\)
MLE Estimators: \(\(\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x}\)\)
π§#π§#π§ π§=π§"π§ π§Iπ§mπ§pπ§lπ§eπ§mπ§eπ§nπ§tπ§aπ§tπ§iπ§oπ§nπ§ π§uπ§sπ§iπ§nπ§gπ§ π§Lπ§iπ§bπ§rπ§aπ§rπ§iπ§eπ§sπ§
π§#π§#π§#π§ π§Uπ§sπ§iπ§nπ§gπ§ π§Nπ§uπ§mπ§Pπ§yπ§ π§aπ§nπ§dπ§ π§Sπ§cπ§iπ§Pπ§yπ§
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
π§#π§ π§Sπ§eπ§tπ§ π§sπ§tπ§yπ§lπ§eπ§ π§fπ§oπ§rπ§ π§bπ§eπ§tπ§tπ§eπ§rπ§ π§pπ§lπ§oπ§tπ§sπ§
plt.style.use('seaborn-v0_8')
np.random.seed(42)
π§#π§ π§Gπ§eπ§nπ§eπ§rπ§aπ§tπ§eπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§sπ§aπ§mπ§pπ§lπ§eπ§sπ§
def generate_normal_samples(mu=0, sigma=1, size=1000):
"""
Generate samples from normal distribution
Args:
mu: Mean parameter
sigma: Standard deviation parameter
size: Number of samples
Returns:
Array of samples
"""
return np.random.normal(mu, sigma, size)
π§#π§ π§Bπ§aπ§sπ§iπ§cπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§oπ§pπ§eπ§rπ§aπ§tπ§iπ§oπ§nπ§sπ§
mu, sigma = 5, 2
samples = generate_normal_samples(mu, sigma, 10000)
print(f"True parameters: ΒΌ={mu}, Γ={sigma}")
print(f"Sample statistics: ΒΌ={np.mean(samples):.3f}, Γ={np.std(samples, ddof=1):.3f}")
print(f"Sample size: {len(samples)}")
π§#π§ π§Uπ§sπ§iπ§nπ§gπ§ π§sπ§cπ§iπ§pπ§yπ§.π§sπ§tπ§aπ§tπ§sπ§ π§fπ§oπ§rπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
normal_dist = stats.norm(loc=mu, scale=sigma)
π§#π§ π§Cπ§aπ§lπ§cπ§uπ§lπ§aπ§tπ§eπ§ π§pπ§rπ§oπ§bπ§aπ§bπ§iπ§lπ§iπ§tπ§iπ§eπ§sπ§ π§aπ§nπ§dπ§ π§qπ§uπ§aπ§nπ§tπ§iπ§lπ§eπ§sπ§
x_values = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
pdf_values = normal_dist.pdf(x_values)
cdf_values = normal_dist.cdf(x_values)
print(f"\nProbability calculations:")
print(f"P(X d 7) = {normal_dist.cdf(7):.4f}")
print(f"P(X e 3) = {1 - normal_dist.cdf(3):.4f}")
print(f"P(3 d X d 7) = {normal_dist.cdf(7) - normal_dist.cdf(3):.4f}")
π§#π§ π§Qπ§uπ§aπ§nπ§tπ§iπ§lπ§eπ§sπ§ π§(π§iπ§nπ§vπ§eπ§rπ§sπ§eπ§ π§Cπ§Dπ§Fπ§)π§
print(f"\nQuantiles:")
print(f"25th percentile: {normal_dist.ppf(0.25):.3f}")
print(f"50th percentile (median): {normal_dist.ppf(0.5):.3f}")
print(f"75th percentile: {normal_dist.ppf(0.75):.3f}")
print(f"95th percentile: {normal_dist.ppf(0.95):.3f}")
π§#π§ π§Eπ§mπ§pπ§iπ§rπ§iπ§cπ§aπ§lπ§ π§rπ§uπ§lπ§eπ§ π§vπ§eπ§rπ§iπ§fπ§iπ§cπ§aπ§tπ§iπ§oπ§nπ§
within_1_sigma = np.sum(np.abs(samples - mu) <= sigma) / len(samples)
within_2_sigma = np.sum(np.abs(samples - mu) <= 2*sigma) / len(samples)
within_3_sigma = np.sum(np.abs(samples - mu) <= 3*sigma) / len(samples)
print(f"\nEmpirical Rule Verification:")
print(f"Within 1Γ: {within_1_sigma:.3f} (expected: 0.683)")
print(f"Within 2Γ: {within_2_sigma:.3f} (expected: 0.954)")
print(f"Within 3Γ: {within_3_sigma:.3f} (expected: 0.997)")
π§#π§ π§Vπ§iπ§sπ§uπ§aπ§lπ§iπ§zπ§aπ§tπ§iπ§oπ§nπ§
plt.figure(figsize=(15, 12))
π§#π§ π§Pπ§Dπ§Fπ§ π§aπ§nπ§dπ§ π§hπ§iπ§sπ§tπ§oπ§gπ§rπ§aπ§mπ§
plt.subplot(3, 2, 1)
plt.hist(samples, bins=50, density=True, alpha=0.7, color='skyblue', edgecolor='black')
plt.plot(x_values, pdf_values, 'r-', linewidth=2, label=f'PDF: N({mu}, {sigma}Β²)')
plt.axvline(mu, color='red', linestyle='--', alpha=0.8, label=f'Mean = {mu}')
plt.axvline(mu + sigma, color='orange', linestyle='--', alpha=0.8, label=f'ΒΌ + Γ')
plt.axvline(mu - sigma, color='orange', linestyle='--', alpha=0.8, label=f'ΒΌ - Γ')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Normal Distribution PDF with Histogram')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§Cπ§Dπ§Fπ§
plt.subplot(3, 2, 2)
plt.plot(x_values, cdf_values, 'b-', linewidth=2, label='CDF')
plt.axhline(0.5, color='red', linestyle='--', alpha=0.8, label='P = 0.5')
plt.axvline(mu, color='red', linestyle='--', alpha=0.8, label=f'Median = {mu}')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('Normal Distribution CDF')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§Qπ§-π§Qπ§ π§pπ§lπ§oπ§tπ§
plt.subplot(3, 2, 3)
stats.probplot(samples, dist="norm", plot=plt)
plt.title('Q-Q Plot: Testing Normality')
plt.grid(True, alpha=0.3)
π§#π§ π§Eπ§mπ§pπ§iπ§rπ§iπ§cπ§aπ§lπ§ π§rπ§uπ§lπ§eπ§ π§vπ§iπ§sπ§uπ§aπ§lπ§iπ§zπ§aπ§tπ§iπ§oπ§nπ§
plt.subplot(3, 2, 4)
x_emp = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
y_emp = stats.norm.pdf(x_emp, mu, sigma)
plt.plot(x_emp, y_emp, 'k-', linewidth=2, label='PDF')
π§#π§ π§Sπ§hπ§aπ§dπ§eπ§ π§rπ§eπ§gπ§iπ§oπ§nπ§sπ§
plt.fill_between(x_emp, 0, y_emp, where=((x_emp >= mu-sigma) & (x_emp <= mu+sigma)),
alpha=0.3, color='blue', label='68% (1Γ)')
plt.fill_between(x_emp, 0, y_emp, where=((x_emp >= mu-2*sigma) & (x_emp <= mu+2*sigma)),
alpha=0.2, color='green', label='95% (2Γ)')
plt.fill_between(x_emp, 0, y_emp, where=((x_emp >= mu-3*sigma) & (x_emp <= mu+3*sigma)),
alpha=0.1, color='red', label='99.7% (3Γ)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Empirical Rule (68-95-99.7)')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§Sπ§tπ§aπ§nπ§dπ§aπ§rπ§dπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§cπ§oπ§mπ§pπ§aπ§rπ§iπ§sπ§oπ§nπ§
plt.subplot(3, 2, 5)
z_scores = (samples - mu) / sigma
plt.hist(z_scores, bins=50, density=True, alpha=0.7, color='lightgreen', edgecolor='black')
x_std = np.linspace(-4, 4, 1000)
y_std = stats.norm.pdf(x_std, 0, 1)
plt.plot(x_std, y_std, 'r-', linewidth=2, label='Standard Normal N(0,1)')
plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('Standardized Data vs Standard Normal')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§Mπ§uπ§lπ§tπ§iπ§pπ§lπ§eπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§sπ§ π§cπ§oπ§mπ§pπ§aπ§rπ§iπ§sπ§oπ§nπ§
plt.subplot(3, 2, 6)
params = [(0, 1), (0, 2), (2, 1), (-1, 0.5)]
colors = ['blue', 'red', 'green', 'orange']
x_comp = np.linspace(-6, 6, 1000)
for (m, s), color in zip(params, colors):
y_comp = stats.norm.pdf(x_comp, m, s)
plt.plot(x_comp, y_comp, color=color, linewidth=2,
label=f'N({m}, {s}Β²)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Comparison of Different Normal Distributions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
π§#π§#π§#π§ π§Sπ§tπ§aπ§tπ§iπ§sπ§tπ§iπ§cπ§aπ§lπ§ π§Tπ§eπ§sπ§tπ§sπ§ π§fπ§oπ§rπ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§iπ§tπ§yπ§
from scipy.stats import shapiro, normaltest, anderson, kstest
from sklearn.datasets import make_regression
def test_normality(data, alpha=0.05):
"""
Perform multiple tests for normality
Args:
data: Array of data to test
alpha: Significance level
Returns:
Dictionary with test results
"""
results = {}
# Shapiro-Wilk test
shapiro_stat, shapiro_p = shapiro(data)
results['Shapiro-Wilk'] = {
'statistic': shapiro_stat,
'p_value': shapiro_p,
'is_normal': shapiro_p > alpha,
'interpretation': 'Normal' if shapiro_p > alpha else 'Not Normal'
}
# D'Agostino-Pearson test
dp_stat, dp_p = normaltest(data)
results["D'Agostino-Pearson"] = {
'statistic': dp_stat,
'p_value': dp_p,
'is_normal': dp_p > alpha,
'interpretation': 'Normal' if dp_p > alpha else 'Not Normal'
}
# Anderson-Darling test
ad_result = anderson(data, dist='norm')
# Use critical value for 5% significance level
critical_value = ad_result.critical_values[2] # 5% level
results['Anderson-Darling'] = {
'statistic': ad_result.statistic,
'critical_value': critical_value,
'is_normal': ad_result.statistic < critical_value,
'interpretation': 'Normal' if ad_result.statistic < critical_value else 'Not Normal'
}
# Kolmogorov-Smirnov test
# First estimate parameters
mu_est, sigma_est = np.mean(data), np.std(data, ddof=1)
ks_stat, ks_p = kstest(data, lambda x: stats.norm.cdf(x, mu_est, sigma_est))
results['Kolmogorov-Smirnov'] = {
'statistic': ks_stat,
'p_value': ks_p,
'is_normal': ks_p > alpha,
'interpretation': 'Normal' if ks_p > alpha else 'Not Normal'
}
return results
π§#π§ π§Tπ§eπ§sπ§tπ§ π§wπ§iπ§tπ§hπ§ π§dπ§iπ§fπ§fπ§eπ§rπ§eπ§nπ§tπ§ π§tπ§yπ§pπ§eπ§sπ§ π§oπ§fπ§ π§dπ§aπ§tπ§aπ§
datasets = {
'Normal Data': np.random.normal(0, 1, 1000),
'Uniform Data': np.random.uniform(-2, 2, 1000),
'Exponential Data': np.random.exponential(1, 1000),
'Mixed Normal': np.concatenate([np.random.normal(-2, 1, 500),
np.random.normal(2, 1, 500)])
}
print("Normality Test Results:")
print("=" * 80)
for name, data in datasets.items():
print(f"\n{name}:")
print(f"Mean: {np.mean(data):.3f}, Std: {np.std(data, ddof=1):.3f}")
results = test_normality(data)
for test_name, result in results.items():
if 'p_value' in result:
print(f"{test_name:20}: p={result['p_value']:.4f}, {result['interpretation']}")
else:
print(f"{test_name:20}: stat={result['statistic']:.4f}, {result['interpretation']}")
π§#π§ π§Vπ§iπ§sπ§uπ§aπ§lπ§iπ§zπ§aπ§tπ§iπ§oπ§nπ§ π§oπ§fπ§ π§dπ§iπ§fπ§fπ§eπ§rπ§eπ§nπ§tπ§ π§dπ§aπ§tπ§aπ§ π§tπ§yπ§pπ§eπ§sπ§
plt.figure(figsize=(16, 10))
for i, (name, data) in enumerate(datasets.items()):
# Histogram
plt.subplot(2, 4, i+1)
plt.hist(data, bins=50, density=True, alpha=0.7, color=f'C{i}', edgecolor='black')
# Fit normal distribution
mu_fit, sigma_fit = stats.norm.fit(data)
x_fit = np.linspace(data.min(), data.max(), 100)
y_fit = stats.norm.pdf(x_fit, mu_fit, sigma_fit)
plt.plot(x_fit, y_fit, 'r-', linewidth=2, label=f'Fitted Normal')
plt.title(f'{name}')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
# Q-Q plot
plt.subplot(2, 4, i+5)
stats.probplot(data, dist="norm", plot=plt)
plt.title(f'{name} - Q-Q Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
π§#π§#π§#π§ π§Nπ§oπ§rπ§mπ§aπ§lπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§iπ§nπ§ π§Mπ§aπ§cπ§hπ§iπ§nπ§eπ§ π§Lπ§eπ§aπ§rπ§nπ§iπ§nπ§gπ§ π§Cπ§oπ§nπ§tπ§eπ§xπ§tπ§
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
π§#π§ π§Dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§eπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§aπ§sπ§sπ§uπ§mπ§pπ§tπ§iπ§oπ§nπ§sπ§ π§iπ§nπ§ π§Mπ§Lπ§
def demonstrate_ml_normality():
"""
Show how normal distribution is used in machine learning
"""
# Generate dataset
X, y = make_classification(n_samples=1000, n_features=4, n_redundant=0,
n_informative=4, n_clusters_per_class=1,
random_state=42)
feature_names = [f'Feature_{i+1}' for i in range(X.shape[1])]
print("Dataset Analysis:")
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
# Analyze feature distributions
plt.figure(figsize=(16, 12))
for i in range(X.shape[1]):
# Histogram with normal overlay
plt.subplot(3, 4, i+1)
plt.hist(X[:, i], bins=30, density=True, alpha=0.7, color=f'C{i}')
# Fit and plot normal distribution
mu, sigma = stats.norm.fit(X[:, i])
x_range = np.linspace(X[:, i].min(), X[:, i].max(), 100)
plt.plot(x_range, stats.norm.pdf(x_range, mu, sigma), 'r-', linewidth=2)
plt.title(f'{feature_names[i]} Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
# Q-Q plot
plt.subplot(3, 4, i+5)
stats.probplot(X[:, i], dist="norm", plot=plt)
plt.title(f'{feature_names[i]} Q-Q Plot')
# Test normality
_, p_value = shapiro(X[:, i])
print(f"{feature_names[i]} Shapiro-Wilk p-value: {p_value:.4f}")
# Before and after standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Plot comparison
for i in range(2): # Just first 2 features for space
plt.subplot(3, 4, i+9)
plt.hist(X_scaled[:, i], bins=30, density=True, alpha=0.7, color='green')
x_std = np.linspace(-4, 4, 100)
plt.plot(x_std, stats.norm.pdf(x_std, 0, 1), 'r-', linewidth=2)
plt.title(f'{feature_names[i]} After Standardization')
plt.xlabel('Standardized Value')
plt.ylabel('Density')
plt.tight_layout()
plt.show()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_scaled, X_test_scaled = train_test_split(X_scaled, test_size=0.2, random_state=42)
# Compare models with and without standardization
models = {
'Logistic Regression (Original)': LogisticRegression(random_state=42),
'Logistic Regression (Scaled)': LogisticRegression(random_state=42),
'Gaussian Naive Bayes (Original)': GaussianNB(),
'Gaussian Naive Bayes (Scaled)': GaussianNB()
}
results = {}
# Train and evaluate models
for i, (name, model) in enumerate(models.items()):
if 'Original' in name:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
else:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = np.mean(y_pred == y_test)
results[name] = accuracy
print(f"\n{name}:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
return results
π§#π§ π§Rπ§uπ§nπ§ π§dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§iπ§oπ§nπ§
ml_results = demonstrate_ml_normality()
π§#π§ π§Cπ§oπ§mπ§pπ§aπ§rπ§eπ§ π§rπ§eπ§sπ§uπ§lπ§tπ§sπ§
plt.figure(figsize=(10, 6))
names = list(ml_results.keys())
accuracies = list(ml_results.values())
colors = ['blue', 'lightblue', 'red', 'lightcoral']
bars = plt.bar(range(len(names)), accuracies, color=colors, alpha=0.7, edgecolor='black')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Performance: Original vs Standardized Features')
plt.xticks(range(len(names)), [name.replace(' (Original)', '\n(Original)').replace(' (Scaled)', '\n(Scaled)')
for name in names], rotation=0)
plt.ylim(0, 1)
π§#π§ π§Aπ§dπ§dπ§ π§vπ§aπ§lπ§uπ§eπ§ π§lπ§aπ§bπ§eπ§lπ§sπ§ π§oπ§nπ§ π§bπ§aπ§rπ§sπ§
for bar, acc in zip(bars, accuracies):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{acc:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
π§#π§#π§ π§Βπ§π§ π§Fπ§rπ§oπ§mπ§ π§Sπ§cπ§rπ§aπ§tπ§cπ§hπ§ π§Iπ§mπ§pπ§lπ§eπ§mπ§eπ§nπ§tπ§aπ§tπ§iπ§oπ§nπ§
π§#π§#π§#π§ π§Cπ§oπ§mπ§pπ§lπ§eπ§tπ§eπ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§Iπ§mπ§pπ§lπ§eπ§mπ§eπ§nπ§tπ§aπ§tπ§iπ§oπ§nπ§
import numpy as np
import matplotlib.pyplot as plt
from typing import Union, List, Tuple
class NormalDistribution:
"""
Complete implementation of Normal Distribution from scratch
"""
def __init__(self, mu: float = 0, sigma: float = 1):
"""
Initialize Normal Distribution
Args:
mu: Mean parameter
sigma: Standard deviation parameter (must be positive)
"""
if sigma <= 0:
raise ValueError("Standard deviation must be positive")
self.mu = mu
self.sigma = sigma
self.variance = sigma ** 2
def pdf(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
"""
Probability Density Function
Args:
x: Value(s) to evaluate
Returns:
PDF value(s)
"""
x = np.asarray(x)
coefficient = 1 / (self.sigma * np.sqrt(2 * np.pi))
exponent = -0.5 * ((x - self.mu) / self.sigma) ** 2
return coefficient * np.exp(exponent)
def cdf(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
"""
Cumulative Distribution Function using error function approximation
Args:
x: Value(s) to evaluate
Returns:
CDF value(s)
"""
x = np.asarray(x)
z = (x - self.mu) / (self.sigma * np.sqrt(2))
return 0.5 * (1 + self._erf(z))
def _erf(self, z: np.ndarray) -> np.ndarray:
"""
Error function approximation using Abramowitz and Stegun formula
Args:
z: Input values
Returns:
Error function values
"""
# Constants for approximation
a1, a2, a3, a4, a5 = 0.254829592, -0.284496736, 1.421413741, -1.453152027, 1.061405429
p = 0.3275911
# Save the sign of z
sign = np.sign(z)
z = np.abs(z)
# A&S formula 7.1.26
t = 1 / (1 + p * z)
y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * np.exp(-z * z)
return sign * y
def ppf(self, p: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
"""
Percent Point Function (inverse CDF) using Beasley-Springer-Moro algorithm
Args:
p: Probability values (0 < p < 1)
Returns:
Quantile values
"""
p = np.asarray(p)
if np.any(p <= 0) or np.any(p >= 1):
raise ValueError("Probabilities must be between 0 and 1")
# Convert to standard normal quantiles first
z = self._standard_normal_ppf(p)
# Transform to desired distribution
return self.mu + self.sigma * z
def _standard_normal_ppf(self, p: np.ndarray) -> np.ndarray:
"""
Standard normal PPF using Beasley-Springer-Moro algorithm
"""
# Constants
a = [-3.969683028665376e+01, 2.209460984245205e+02, -2.759285104469687e+02,
1.383577518672690e+02, -3.066479806614716e+01, 2.506628277459239e+00]
b = [-5.447609879822406e+01, 1.615858368580409e+02, -1.556989798598866e+02,
6.680131188771972e+01, -1.328068155288572e+01]
c = [-7.784894002430293e-03, -3.223964580411365e-01, -2.400758277161838e+00,
-2.549732539343734e+00, 4.374664141464968e+00, 2.938163982698783e+00]
d = [7.784695709041462e-03, 3.224671290700398e-01, 2.445134137142996e+00,
3.754408661907416e+00]
p_low = 0.02425
p_high = 1 - p_low
result = np.zeros_like(p)
# Low region
mask_low = p < p_low
if np.any(mask_low):
q = np.sqrt(-2 * np.log(p[mask_low]))
result[mask_low] = (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) / \
((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1)
# Central region
mask_central = (p >= p_low) & (p <= p_high)
if np.any(mask_central):
q = p[mask_central] - 0.5
r = q * q
result[mask_central] = (((((a[0] * r + a[1]) * r + a[2]) * r + a[3]) * r + a[4]) * r + a[5]) * q / \
(((((b[0] * r + b[1]) * r + b[2]) * r + b[3]) * r + b[4]) * r + 1)
# High region
mask_high = p > p_high
if np.any(mask_high):
q = np.sqrt(-2 * np.log(1 - p[mask_high]))
result[mask_high] = -(((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) / \
((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1)
return result
def sample(self, size: int = 1) -> Union[float, np.ndarray]:
"""
Generate random samples using Box-Muller transformation
Args:
size: Number of samples to generate
Returns:
Random samples
"""
# Box-Muller transformation
if size % 2 == 1:
size += 1
trim = True
else:
trim = False
# Generate uniform random numbers
u1 = np.random.uniform(0, 1, size // 2)
u2 = np.random.uniform(0, 1, size // 2)
# Box-Muller transformation
r = np.sqrt(-2 * np.log(u1))
theta = 2 * np.pi * u2
z1 = r * np.cos(theta)
z2 = r * np.sin(theta)
# Combine and transform to desired distribution
samples = np.concatenate([z1, z2])
samples = self.mu + self.sigma * samples
if trim:
samples = samples[:-1]
return samples[0] if len(samples) == 1 else samples
def fit(self, data: np.ndarray) -> 'NormalDistribution':
"""
Fit normal distribution to data using Maximum Likelihood Estimation
Args:
data: Sample data
Returns:
Fitted NormalDistribution object
"""
data = np.asarray(data)
mu_mle = np.mean(data)
sigma_mle = np.std(data, ddof=0) # MLE uses population std
return NormalDistribution(mu_mle, sigma_mle)
def log_likelihood(self, data: np.ndarray) -> float:
"""
Calculate log-likelihood of data
Args:
data: Sample data
Returns:
Log-likelihood value
"""
data = np.asarray(data)
n = len(data)
ll = (-n/2) * np.log(2 * np.pi) - n * np.log(self.sigma) - \
np.sum((data - self.mu)**2) / (2 * self.sigma**2)
return ll
def __str__(self) -> str:
return f"Normal(ΒΌ={self.mu:.3f}, Γ={self.sigma:.3f})"
def __repr__(self) -> str:
return f"NormalDistribution(mu={self.mu}, sigma={self.sigma})"
π§#π§ π§Dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§iπ§oπ§nπ§ π§oπ§fπ§ π§cπ§uπ§sπ§tπ§oπ§mπ§ π§iπ§mπ§pπ§lπ§eπ§mπ§eπ§nπ§tπ§aπ§tπ§iπ§oπ§nπ§
def demo_custom_normal():
"""Demonstrate custom normal distribution implementation"""
print("Custom Normal Distribution Implementation Demo")
print("=" * 50)
# Create distribution
norm = NormalDistribution(mu=2, sigma=1.5)
print(f"Distribution: {norm}")
# Generate samples
samples = norm.sample(10000)
print(f"\nGenerated {len(samples)} samples")
print(f"Sample mean: {np.mean(samples):.3f} (expected: {norm.mu})")
print(f"Sample std: {np.std(samples, ddof=1):.3f} (expected: {norm.sigma})")
# Test PDF, CDF, PPF
test_values = np.array([-1, 0, 1, 2, 3, 4, 5])
print(f"\nFunction evaluations:")
print("Value\tPDF\t\tCDF\t\tPPF(CDF)")
for val in test_values:
pdf_val = norm.pdf(val)
cdf_val = norm.cdf(val)
ppf_val = norm.ppf(cdf_val) if 0 < cdf_val < 1 else np.nan
print(f"{val:.1f}\t{pdf_val:.6f}\t{cdf_val:.6f}\t{ppf_val:.3f}")
# Compare with scipy
import scipy.stats as stats
scipy_norm = stats.norm(norm.mu, norm.sigma)
print(f"\nComparison with SciPy (first 5 test values):")
print("Value\tCustom PDF\tSciPy PDF\tDiff PDF\tCustom CDF\tSciPy CDF\tDiff CDF")
for val in test_values[:5]:
custom_pdf = norm.pdf(val)
scipy_pdf = scipy_norm.pdf(val)
custom_cdf = norm.cdf(val)
scipy_cdf = scipy_norm.cdf(val)
print(f"{val:.1f}\t{custom_pdf:.6f}\t{scipy_pdf:.6f}\t{abs(custom_pdf-scipy_pdf):.2e}\t"
f"{custom_cdf:.6f}\t{scipy_cdf:.6f}\t{abs(custom_cdf-scipy_cdf):.2e}")
# Fit to data
fitted_norm = NormalDistribution().fit(samples)
print(f"\nFitted distribution: {fitted_norm}")
print(f"Log-likelihood: {fitted_norm.log_likelihood(samples):.2f}")
# Visualization
plt.figure(figsize=(15, 10))
# PDF comparison
x = np.linspace(norm.mu - 4*norm.sigma, norm.mu + 4*norm.sigma, 1000)
custom_pdf = norm.pdf(x)
scipy_pdf = scipy_norm.pdf(x)
plt.subplot(2, 3, 1)
plt.plot(x, custom_pdf, 'b-', linewidth=2, label='Custom Implementation')
plt.plot(x, scipy_pdf, 'r--', linewidth=2, label='SciPy', alpha=0.7)
plt.xlabel('Value')
plt.ylabel('PDF')
plt.title('PDF Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
# CDF comparison
custom_cdf = norm.cdf(x)
scipy_cdf = scipy_norm.cdf(x)
plt.subplot(2, 3, 2)
plt.plot(x, custom_cdf, 'b-', linewidth=2, label='Custom Implementation')
plt.plot(x, scipy_cdf, 'r--', linewidth=2, label='SciPy', alpha=0.7)
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
# Sample histogram with fitted PDF
plt.subplot(2, 3, 3)
plt.hist(samples, bins=50, density=True, alpha=0.7, color='skyblue', edgecolor='black')
plt.plot(x, norm.pdf(x), 'r-', linewidth=2, label='Original PDF')
plt.plot(x, fitted_norm.pdf(x), 'g--', linewidth=2, label='Fitted PDF')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Samples with Original and Fitted PDF')
plt.legend()
plt.grid(True, alpha=0.3)
# PPF comparison
p_values = np.linspace(0.01, 0.99, 100)
custom_ppf = norm.ppf(p_values)
scipy_ppf = scipy_norm.ppf(p_values)
plt.subplot(2, 3, 4)
plt.plot(p_values, custom_ppf, 'b-', linewidth=2, label='Custom Implementation')
plt.plot(p_values, scipy_ppf, 'r--', linewidth=2, label='SciPy', alpha=0.7)
plt.xlabel('Probability')
plt.ylabel('Quantile')
plt.title('PPF (Quantile Function) Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
# Error analysis
pdf_errors = np.abs(custom_pdf - scipy_pdf)
cdf_errors = np.abs(custom_cdf - scipy_cdf)
plt.subplot(2, 3, 5)
plt.semilogy(x, pdf_errors, 'b-', linewidth=2, label='PDF Error')
plt.semilogy(x, cdf_errors, 'r-', linewidth=2, label='CDF Error')
plt.xlabel('Value')
plt.ylabel('Absolute Error (log scale)')
plt.title('Implementation Error Analysis')
plt.legend()
plt.grid(True, alpha=0.3)
# Box-Muller samples vs normal samples
plt.subplot(2, 3, 6)
box_muller_samples = norm.sample(1000)
scipy_samples = scipy_norm.rvs(1000, random_state=42)
plt.hist(box_muller_samples, bins=30, alpha=0.5, label='Box-Muller', density=True)
plt.hist(scipy_samples, bins=30, alpha=0.5, label='SciPy', density=True)
plt.plot(x, norm.pdf(x), 'k-', linewidth=2, label='True PDF')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Sample Generation Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return norm, samples
π§#π§ π§Rπ§uπ§nπ§ π§dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§iπ§oπ§nπ§
custom_norm, custom_samples = demo_custom_normal()
π§#π§#π§#π§ π§Aπ§dπ§vπ§aπ§nπ§cπ§eπ§dπ§ π§Sπ§tπ§aπ§tπ§iπ§sπ§tπ§iπ§cπ§aπ§lπ§ π§Mπ§eπ§tπ§hπ§oπ§dπ§sπ§
class AdvancedNormalAnalysis:
"""
Advanced methods for normal distribution analysis
"""
@staticmethod
def confidence_interval(data: np.ndarray, confidence: float = 0.95) -> Tuple[float, float]:
"""
Calculate confidence interval for the mean
Args:
data: Sample data
confidence: Confidence level (0 < confidence < 1)
Returns:
Tuple of (lower_bound, upper_bound)
"""
n = len(data)
mean = np.mean(data)
std_err = np.std(data, ddof=1) / np.sqrt(n)
# For large n, use normal distribution
if n >= 30:
from scipy import stats
z_score = stats.norm.ppf((1 + confidence) / 2)
margin_error = z_score * std_err
else:
# For small n, use t-distribution
from scipy import stats
t_score = stats.t.ppf((1 + confidence) / 2, n-1)
margin_error = t_score * std_err
return (mean - margin_error, mean + margin_error)
@staticmethod
def hypothesis_test_mean(data: np.ndarray, null_mean: float = 0,
alternative: str = 'two-sided', alpha: float = 0.05) -> dict:
"""
One-sample t-test for the mean
Args:
data: Sample data
null_mean: Hypothesized population mean
alternative: 'two-sided', 'greater', or 'less'
alpha: Significance level
Returns:
Dictionary with test results
"""
from scipy import stats
n = len(data)
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
# Test statistic
t_stat = (sample_mean - null_mean) / (sample_std / np.sqrt(n))
# P-value calculation
if alternative == 'two-sided':
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n-1))
elif alternative == 'greater':
p_value = 1 - stats.t.cdf(t_stat, n-1)
elif alternative == 'less':
p_value = stats.t.cdf(t_stat, n-1)
else:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
# Critical value
if alternative == 'two-sided':
critical_value = stats.t.ppf(1 - alpha/2, n-1)
else:
critical_value = stats.t.ppf(1 - alpha, n-1)
reject_null = p_value < alpha
return {
'sample_mean': sample_mean,
'null_mean': null_mean,
't_statistic': t_stat,
'p_value': p_value,
'critical_value': critical_value,
'reject_null': reject_null,
'conclusion': f"{'Reject' if reject_null else 'Fail to reject'} the null hypothesis",
'alpha': alpha,
'alternative': alternative
}
@staticmethod
def power_analysis(effect_size: float, alpha: float = 0.05,
power: float = 0.8, alternative: str = 'two-sided') -> int:
"""
Calculate required sample size for given power
Args:
effect_size: Cohen's d (standardized effect size)
alpha: Significance level
power: Desired statistical power
alternative: Type of test
Returns:
Required sample size
"""
from scipy import stats
# Z-scores for alpha and power
if alternative == 'two-sided':
z_alpha = stats.norm.ppf(1 - alpha/2)
else:
z_alpha = stats.norm.ppf(1 - alpha)
z_power = stats.norm.ppf(power)
# Sample size calculation
if alternative == 'two-sided':
n = ((z_alpha + z_power) / effect_size) ** 2
else:
n = ((z_alpha + z_power) / effect_size) ** 2
return int(np.ceil(n))
@staticmethod
def transformation_analysis(data: np.ndarray) -> dict:
"""
Analyze data and suggest transformations to achieve normality
Args:
data: Sample data
Returns:
Dictionary with transformation analysis
"""
from scipy import stats
results = {
'original': {
'data': data,
'shapiro_p': stats.shapiro(data)[1],
'skewness': stats.skew(data),
'kurtosis': stats.kurtosis(data)
}
}
# Log transformation (for positive data)
if np.all(data > 0):
log_data = np.log(data)
results['log'] = {
'data': log_data,
'shapiro_p': stats.shapiro(log_data)[1],
'skewness': stats.skew(log_data),
'kurtosis': stats.kurtosis(log_data)
}
# Square root transformation (for non-negative data)
if np.all(data >= 0):
sqrt_data = np.sqrt(data)
results['sqrt'] = {
'data': sqrt_data,
'shapiro_p': stats.shapiro(sqrt_data)[1],
'skewness': stats.skew(sqrt_data),
'kurtosis': stats.kurtosis(sqrt_data)
}
# Box-Cox transformation
if np.all(data > 0):
try:
boxcox_data, lambda_param = stats.boxcox(data)
results['boxcox'] = {
'data': boxcox_data,
'lambda': lambda_param,
'shapiro_p': stats.shapiro(boxcox_data)[1],
'skewness': stats.skew(boxcox_data),
'kurtosis': stats.kurtosis(boxcox_data)
}
except:
pass
# Yeo-Johnson transformation (can handle negative values)
try:
yeojohnson_data, lambda_param = stats.yeojohnson(data)
results['yeojohnson'] = {
'data': yeojohnson_data,
'lambda': lambda_param,
'shapiro_p': stats.shapiro(yeojohnson_data)[1],
'skewness': stats.skew(yeojohnson_data),
'kurtosis': stats.kurtosis(yeojohnson_data)
}
except:
pass
# Find best transformation
best_transform = max(results.keys(),
key=lambda k: results[k]['shapiro_p'])
results['best_transformation'] = best_transform
return results
π§#π§ π§Dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§iπ§oπ§nπ§ π§oπ§fπ§ π§aπ§dπ§vπ§aπ§nπ§cπ§eπ§dπ§ π§mπ§eπ§tπ§hπ§oπ§dπ§sπ§
def demo_advanced_analysis():
"""Demonstrate advanced normal distribution analysis"""
# Generate different types of data
np.random.seed(42)
datasets = {
'Normal Data': np.random.normal(10, 2, 100),
'Skewed Data': np.random.exponential(1, 100),
'Heavy-tailed Data': np.random.standard_t(3, 100),
'Bimodal Data': np.concatenate([np.random.normal(5, 1, 50),
np.random.normal(15, 1, 50)])
}
analyzer = AdvancedNormalAnalysis()
print("Advanced Normal Distribution Analysis")
print("=" * 60)
for name, data in datasets.items():
print(f"\n{name}:")
print(f"Mean: {np.mean(data):.3f}, Std: {np.std(data, ddof=1):.3f}")
# Confidence interval
ci = analyzer.confidence_interval(data, 0.95)
print(f"95% CI for mean: ({ci[0]:.3f}, {ci[1]:.3f})")
# Hypothesis test (test if mean = 10)
test_result = analyzer.hypothesis_test_mean(data, null_mean=10)
print(f"T-test (H0: ΒΌ=10): t={test_result['t_statistic']:.3f}, "
f"p={test_result['p_value']:.4f}, {test_result['conclusion']}")
# Power analysis
effect_size = abs(np.mean(data) - 10) / np.std(data, ddof=1)
required_n = analyzer.power_analysis(effect_size, power=0.8)
print(f"Required sample size for 80% power: {required_n}")
# Transformation analysis
transforms = analyzer.transformation_analysis(data)
print(f"Best transformation: {transforms['best_transformation']} "
f"(Shapiro p-value: {transforms[transforms['best_transformation']]['shapiro_p']:.4f})")
# Visualization of transformations
plt.figure(figsize=(16, 12))
for i, (name, data) in enumerate(datasets.items()):
transforms = analyzer.transformation_analysis(data)
# Original data
plt.subplot(4, 4, i*4 + 1)
plt.hist(data, bins=20, density=True, alpha=0.7, color=f'C{i}')
plt.title(f'{name}\n(Original)')
plt.xlabel('Value')
plt.ylabel('Density')
# Best transformation
best_transform = transforms['best_transformation']
if best_transform != 'original':
best_data = transforms[best_transform]['data']
plt.subplot(4, 4, i*4 + 2)
plt.hist(best_data, bins=20, density=True, alpha=0.7, color=f'C{i}')
plt.title(f'{name}\n({best_transform.title()})')
plt.xlabel('Transformed Value')
plt.ylabel('Density')
# Q-Q plots
plt.subplot(4, 4, i*4 + 3)
stats.probplot(data, dist="norm", plot=plt)
plt.title('Original Q-Q Plot')
plt.subplot(4, 4, i*4 + 4)
stats.probplot(best_data, dist="norm", plot=plt)
plt.title(f'{best_transform.title()} Q-Q Plot')
else:
plt.subplot(4, 4, i*4 + 2)
plt.text(0.5, 0.5, 'No transformation\nneeded',
ha='center', va='center', transform=plt.gca().transAxes)
plt.axis('off')
plt.subplot(4, 4, i*4 + 3)
stats.probplot(data, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.subplot(4, 4, i*4 + 4)
plt.axis('off')
plt.tight_layout()
plt.show()
π§#π§ π§Rπ§uπ§nπ§ π§aπ§dπ§vπ§aπ§nπ§cπ§eπ§dπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§ π§dπ§eπ§mπ§oπ§nπ§sπ§tπ§rπ§aπ§tπ§iπ§oπ§nπ§
demo_advanced_analysis()
π§#π§#π§ π§Β π§π§ π§Aπ§sπ§sπ§uπ§mπ§pπ§tπ§iπ§oπ§nπ§sπ§ π§aπ§nπ§dπ§ π§Lπ§iπ§mπ§iπ§tπ§aπ§tπ§iπ§oπ§nπ§sπ§
π§#π§#π§#π§ π§Aπ§sπ§sπ§uπ§mπ§pπ§tπ§iπ§oπ§nπ§sπ§
Mathematical Assumptions: - Continuous data: Variables are measured on a continuous scale - Independence: Observations are independent of each other
- Infinite support: Theoretically, values can range from - to + - Symmetry: Distribution is perfectly symmetric around the mean - Single mode: Only one peak in the distribution
Statistical Assumptions in ML: - IID samples: Data points are independently and identically distributed - Stationarity: Distribution parameters don't change over time - Linearity: Linear relationships between variables (in linear models) - Homoscedasticity: Constant variance across all levels of independent variables - No outliers: Extreme values don't significantly affect the distribution
π§#π§#π§#π§ π§Lπ§iπ§mπ§iπ§tπ§aπ§tπ§iπ§oπ§nπ§sπ§
Theoretical Limitations: - Infinite tails: Assigns non-zero probability to extreme values (can be unrealistic) - Symmetry assumption: Real-world data often shows skewness - Single modality: Cannot model multimodal distributions - Parameter sensitivity: Small changes in ΒΌ or Γ can significantly affect probabilities - Curse of dimensionality: In high dimensions, most data lies far from the center
Practical Limitations: - Finite data: Real datasets have finite ranges, unlike theoretical normal distribution - Measurement precision: Discrete measurements approximate continuous distributions - Outlier sensitivity: Sample statistics heavily influenced by extreme values - Model assumptions: Many statistical tests assume normality but real data may not follow this - Transformation needs: Data often requires preprocessing to achieve normality
π§#π§#π§#π§ π§Wπ§hπ§eπ§nπ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§iπ§tπ§yπ§ π§Aπ§sπ§sπ§uπ§mπ§pπ§tπ§iπ§oπ§nπ§sπ§ π§Fπ§aπ§iπ§lπ§
Common Violations:
Issue | Description | Detection | Solutions |
---|---|---|---|
Skewness | Asymmetric distribution | Histogram, skewness statistic | Log transform, Box-Cox |
Heavy tails | More extreme values than expected | Kurtosis, Q-Q plots | Robust methods, t-distribution |
Multimodality | Multiple peaks | Histogram, density plots | Mixture models, clustering |
Discrete data | Integer or categorical values | Data inspection | Poisson, binomial models |
Bounded data | Limited range (e.g., percentages) | Domain knowledge | Beta distribution, logit transform |
Diagnostic Tools: - Visual: Histograms, Q-Q plots, box plots - Statistical: Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov tests - Descriptive: Skewness, kurtosis, range checks
π§#π§#π§#π§ π§Rπ§oπ§bπ§uπ§sπ§tπ§ π§Aπ§lπ§tπ§eπ§rπ§nπ§aπ§tπ§iπ§vπ§eπ§sπ§
When to Use Alternatives:
- t-Distribution: For heavy-tailed data or small samples
- Log-normal: For positively skewed data
- Gamma/Exponential: For non-negative, skewed data
- Beta: For bounded data (0,1)
- Mixture Models: For multimodal data
- Non-parametric Methods: When no distributional assumptions can be made
π§#π§#π§ π§=π§Β‘π§ π§Iπ§nπ§tπ§eπ§rπ§vπ§iπ§eπ§wπ§ π§Qπ§uπ§eπ§sπ§tπ§iπ§oπ§nπ§sπ§
Q1: Explain the difference between normal distribution and standard normal distribution.
Answer:
Normal Distribution: - General form: \(N(\mu, \sigma^2)\) - Can have any mean \(\mu\) and standard deviation \(\sigma > 0\) - PDF: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
Standard Normal Distribution: - Special case: \(N(0, 1)\) - Mean = 0, Standard deviation = 1 - PDF: \(\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}}\) - Also called z-distribution
Relationship: Any normal distribution can be converted to standard normal using z-score transformation: \(\(Z = \frac{X - \mu}{\sigma}\)\)
Why it's important: - Standardizes comparisons across different scales - Simplifies probability calculations - Used in hypothesis testing and confidence intervals - Tables and software often reference standard normal
Q2: What is the Central Limit Theorem and why is it important for the normal distribution?
Answer:
Central Limit Theorem (CLT) states: For any population with mean \(\mu\) and finite variance \(\sigma^2\), the distribution of sample means approaches normal as sample size increases:
Key Points: - Original population can have ANY distribution - Sample size typically needs to be e30 for good approximation - Larger samples Β better normal approximation - Standard error decreases as \(\frac{\sigma}{\sqrt{n}}\)
Importance: 1. Statistical Inference: Enables confidence intervals and hypothesis tests 2. Machine Learning: Justifies normality assumptions in many algorithms 3. Quality Control: Control charts based on sample means 4. A/B Testing: Comparison of group means
Example: Even if individual heights are not perfectly normal, the average height of groups of 30+ people will be approximately normal.
Q3: How do you test if data follows a normal distribution?
Answer:
Visual Methods: 1. Histogram: Should show bell-shaped curve 2. Q-Q Plot: Points should lie on straight line 3. Box Plot: Should be symmetric with few outliers
Statistical Tests:
- Shapiro-Wilk Test (best for n < 50):
- HΒ: Data is normally distributed
- Most powerful test for normality
-
scipy.stats.shapiro(data)
-
Anderson-Darling Test:
- More sensitive to tail deviations
-
scipy.stats.anderson(data, dist='norm')
-
Kolmogorov-Smirnov Test:
- Tests against fitted normal distribution
-
Less powerful than others
-
D'Agostino-Pearson Test:
- Based on skewness and kurtosis
scipy.stats.normaltest(data)
Rule of Thumb: - Use multiple methods together - Visual inspection is crucial - Tests may reject normality for large samples due to minor deviations - Consider practical significance, not just statistical significance
Q4: What are the parameters of normal distribution and how do they affect the shape?
Answer:
Parameters:
- Mean (ΒΌ) - Location parameter:
- Determines center of distribution
- Peak of bell curve occurs at ΒΌ
- Shifting ΒΌ moves entire curve left/right
-
Range: \(-\infty < \mu < \infty\)
-
Standard Deviation (Γ) - Scale parameter:
- Determines spread/width of distribution
- Controls how dispersed values are around mean
- Larger Γ Β wider, flatter curve
- Smaller Γ Β narrower, taller curve
- Range: \(Γ > 0\)
Shape Effects:
ΒΌ = 0, Γ = 1: Standard normal (tall, narrow)
ΒΌ = 0, Γ = 2: Same center, wider spread
ΒΌ = 5, Γ = 1: Shifted right, same spread
ΒΌ = 5, Γ = 2: Shifted right, wider spread
Mathematical Properties: - Mode = Median = Mean = ΒΌ - Inflection points at ΒΌ Β± Γ - 68% of data within ΒΌ Β± Γ - 95% of data within ΒΌ Β± 2Γ - 99.7% of data within ΒΌ Β± 3Γ
Q5: Explain the 68-95-99.7 rule (Empirical Rule) and its applications.
Answer:
The Empirical Rule states: For any normal distribution: - 68% of values lie within 1 standard deviation: \(P(\mu - \sigma \leq X \leq \mu + \sigma) = 0.6827\) - 95% of values lie within 2 standard deviations: \(P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) = 0.9545\) - 99.7% of values lie within 3 standard deviations: \(P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) = 0.9973\)
Applications:
- Quality Control:
- Products outside 3Γ limits considered defective
-
Six Sigma methodology aims for 6Γ quality
-
Outlier Detection:
- Values beyond 2Γ or 3Γ flagged as outliers
-
Z-scores > 3 are rare (0.3% probability)
-
Risk Assessment:
- Financial returns: VaR calculations
-
Insurance: Claim amount predictions
-
Educational Testing:
- Standardized test scores (SAT, GRE)
-
Grade curving and percentile ranks
-
Medical Diagnostics:
- Normal ranges for lab values
- Growth charts for children
Example: If IQ scores are N(100, 15): - 68% of people have IQ between 85-115 - 95% have IQ between 70-130
- 99.7% have IQ between 55-145
Q6: How is normal distribution used in machine learning algorithms?
Answer:
Direct Usage:
- Naive Bayes Classifier:
- Assumes features follow normal distribution
-
Uses Gaussian likelihood: \(P(x_i|y) = N(\mu_{i,y}, \sigma_{i,y}^2)\)
-
Linear Regression:
- Assumes residuals are normally distributed
-
Enables confidence intervals and hypothesis tests
-
Discriminant Analysis (LDA/QDA):
- Assumes classes have multivariate normal distributions
- Decision boundaries based on Gaussian densities
Indirect Usage:
- Weight Initialization:
- Neural networks: Xavier/He initialization
-
Random weights from normal distribution
-
Regularization:
- Gaussian priors in Bayesian methods
-
L2 regularization equivalent to normal prior
-
Feature Engineering:
- Box-Cox transformation to achieve normality
-
StandardScaler assumes normal-like distribution
-
Uncertainty Quantification:
- Bayesian neural networks
-
Gaussian processes
-
Generative Models:
- VAE latent space often assumed normal
- Normalizing flows
Why Normal Distribution is Preferred: - Mathematical tractability - Central Limit Theorem justification - Maximum entropy for given mean and variance - Conjugate priors in Bayesian inference
Q7: What is the relationship between normal distribution and maximum likelihood estimation?
Answer:
MLE for Normal Distribution:
Given samples \(x_1, x_2, ..., x_n\) from \(N(\mu, \sigma^2)\):
Likelihood Function: \(\(L(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}\)\)
Log-Likelihood: \(\(\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i-\mu)^2\)\)
MLE Estimators: Taking derivatives and setting to zero:
Key Properties: - Sample mean is unbiased: \(E[\hat{\mu}] = \mu\) - MLE variance estimator is biased: \(E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2\) - Unbiased estimator uses \(n-1\) in denominator - Both estimators are consistent and efficient
Connection to Machine Learning: - Linear regression with Gaussian noise assumption - Foundation for many statistical tests - Basis for confidence intervals
Q8: How do you handle non-normal data in machine learning?
Answer:
Detection Methods: - Visual inspection (histograms, Q-Q plots) - Statistical tests (Shapiro-Wilk, Anderson-Darling) - Skewness and kurtosis analysis
Transformation Techniques:
- Log Transformation: For right-skewed data
- \(y = \log(x)\) (requires \(x > 0\))
-
Reduces positive skewness
-
Square Root: For count data or mild skewness
-
\(y = \sqrt{x}\) (requires \(x \geq 0\))
-
Box-Cox Transformation: For positive data
- \(y = \frac{x^{\lambda} - 1}{\lambda}\) (if \(\lambda \neq 0\))
- \(y = \log(x)\) (if \(\lambda = 0\))
-
Automatically finds optimal Β»
-
Yeo-Johnson: Handles negative values
- Extension of Box-Cox for all real numbers
Alternative Approaches:
- Robust Methods:
- Use median instead of mean
- Robust regression (Huber loss)
-
Trimmed statistics
-
Non-parametric Methods:
- Random Forest, SVM
- k-NN, Decision Trees
-
No distributional assumptions
-
Different Distributions:
- Poisson for count data
- Binomial for binary outcomes
-
Exponential for survival times
-
Ensemble Methods:
- Bootstrap aggregating
- Less sensitive to individual distributions
Q9: What is standardization and why is it important when features follow normal distributions?
Answer:
Standardization (Z-score normalization): Transform features to have mean=0 and std=1:
Why Important for Normal Data:
- Scale Independence:
- Features with different units become comparable
-
Example: Age (0-100) vs Income (\(0-\)100,000)
-
Algorithm Performance:
- Distance-based algorithms (k-NN, SVM, k-means)
- Gradient descent convergence
-
Neural network training stability
-
Mathematical Properties:
- Preserves normal distribution shape
- Standardized normal has known properties
- Enables use of z-tables and standard formulas
When Normal Assumption Helps: - 68-95-99.7 rule applies after standardization - Outlier detection using z-scores - Statistical tests and confidence intervals - Feature importance comparison
Implementation:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Alternative: Min-Max Normalization: - Scales to [0,1] range - Better when distribution is not normal - Preserves original distribution shape
Q10: Explain the concept of multivariate normal distribution and its applications.
Answer:
Multivariate Normal Distribution: Extension of normal distribution to multiple variables:
PDF: \(\(f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\)\)
Where: - \(\mathbf{x} = [x_1, x_2, ..., x_k]^T\) is k-dimensional vector - \(\boldsymbol{\mu}\) is mean vector - \(\boldsymbol{\Sigma}\) is covariance matrix
Key Properties: - Marginal distributions are normal - Linear combinations are normal - Conditional distributions are normal - Zero correlation implies independence
Applications in ML:
- Gaussian Mixture Models (GMM):
- Clustering with probabilistic assignments
-
Each cluster is a multivariate normal
-
Principal Component Analysis (PCA):
- Assumes data follows multivariate normal
-
Finds principal directions of variation
-
Linear Discriminant Analysis (LDA):
- Classes assumed multivariate normal
-
Same covariance matrix across classes
-
Gaussian Processes:
- Function values follow multivariate normal
-
Used in Bayesian optimization
-
Kalman Filters:
- State estimation in time series
- System and observation noise assumed normal
Covariance Matrix Interpretation: - Diagonal: Individual variable variances - Off-diagonal: Correlations between variables - Eigenvalues: Principal component variances - Eigenvectors: Principal directions
π§#π§#π§ π§π§ π§ π§Eπ§xπ§aπ§mπ§pπ§lπ§eπ§sπ§
π§#π§#π§#π§ π§Rπ§eπ§aπ§lπ§-π§wπ§oπ§rπ§lπ§dπ§ π§Eπ§xπ§aπ§mπ§pπ§lπ§eπ§:π§ π§Qπ§uπ§aπ§lπ§iπ§tπ§yπ§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§iπ§nπ§ π§Mπ§aπ§nπ§uπ§fπ§aπ§cπ§tπ§uπ§rπ§iπ§nπ§gπ§
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import StandardScaler
import seaborn as sns
π§#π§ π§Mπ§aπ§nπ§uπ§fπ§aπ§cπ§tπ§uπ§rπ§iπ§nπ§gπ§ π§Qπ§uπ§aπ§lπ§iπ§tπ§yπ§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§Eπ§xπ§aπ§mπ§pπ§lπ§eπ§
np.random.seed(42)
def simulate_manufacturing_data():
"""
Simulate manufacturing data with normal distribution assumptions
"""
# Simulate production data over 30 days
n_days = 30
samples_per_day = 50
# Target specifications
target_length = 100.0 # mm
tolerance = Β±2.0 # mm (so acceptable range is 98-102 mm)
process_std = 0.8 # mm (process standard deviation)
# Simulate daily production
production_data = []
for day in range(1, n_days + 1):
# Daily mean might drift slightly (process variation)
daily_mean = target_length + np.random.normal(0, 0.2)
# Generate samples for the day
daily_samples = np.random.normal(daily_mean, process_std, samples_per_day)
for sample in daily_samples:
production_data.append({
'day': day,
'length': sample,
'within_spec': 98 <= sample <= 102,
'defect_type': 'none' if 98 <= sample <= 102 else ('short' if sample < 98 else 'long')
})
return pd.DataFrame(production_data)
π§#π§ π§Gπ§eπ§nπ§eπ§rπ§aπ§tπ§eπ§ π§mπ§aπ§nπ§uπ§fπ§aπ§cπ§tπ§uπ§rπ§iπ§nπ§gπ§ π§dπ§aπ§tπ§aπ§
manufacturing_df = simulate_manufacturing_data()
print("Manufacturing Quality Control Analysis")
print("=" * 50)
print(f"Total samples: {len(manufacturing_df)}")
print(f"Target length: 100.0 mm Β± 2.0 mm")
print(f"Overall mean: {manufacturing_df['length'].mean():.3f} mm")
print(f"Overall std: {manufacturing_df['length'].std():.3f} mm")
print(f"Defect rate: {(~manufacturing_df['within_spec']).mean():.1%}")
π§#π§ π§Dπ§aπ§iπ§lπ§yπ§ π§sπ§tπ§aπ§tπ§iπ§sπ§tπ§iπ§cπ§sπ§
daily_stats = manufacturing_df.groupby('day')['length'].agg(['mean', 'std', 'count'])
daily_defect_rate = manufacturing_df.groupby('day')['within_spec'].apply(lambda x: (~x).mean())
π§#π§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§cπ§hπ§aπ§rπ§tπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§ π§uπ§sπ§iπ§nπ§gπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
overall_mean = manufacturing_df['length'].mean()
overall_std = manufacturing_df['length'].std()
π§#π§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§lπ§iπ§mπ§iπ§tπ§sπ§ π§(π§3π§-π§sπ§iπ§gπ§mπ§aπ§ π§rπ§uπ§lπ§eπ§)π§
ucl = overall_mean + 3 * overall_std # Upper Control Limit
lcl = overall_mean - 3 * overall_std # Lower Control Limit
usl = 102 # Upper Specification Limit
lsl = 98 # Lower Specification Limit
print(f"\nControl Chart Limits:")
print(f"Upper Control Limit (UCL): {ucl:.3f} mm")
print(f"Lower Control Limit (LCL): {lcl:.3f} mm")
print(f"Upper Specification Limit (USL): {usl:.1f} mm")
print(f"Lower Specification Limit (LSL): {lsl:.1f} mm")
π§#π§ π§Pπ§rπ§oπ§cπ§eπ§sπ§sπ§ π§cπ§aπ§pπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§
cp = (usl - lsl) / (6 * overall_std) # Process capability
cpk = min((usl - overall_mean)/(3 * overall_std),
(overall_mean - lsl)/(3 * overall_std)) # Process capability index
print(f"\nProcess Capability:")
print(f"Cp (potential capability): {cp:.3f}")
print(f"Cpk (actual capability): {cpk:.3f}")
print(f"Process capability interpretation:")
if cpk >= 1.33:
print(" Excellent process (< 63 PPM defects)")
elif cpk >= 1.0:
print(" Adequate process (< 2,700 PPM defects)")
else:
print(" Poor process (> 2,700 PPM defects)")
π§#π§ π§Cπ§aπ§lπ§cπ§uπ§lπ§aπ§tπ§eπ§ π§tπ§hπ§eπ§oπ§rπ§eπ§tπ§iπ§cπ§aπ§lπ§ π§dπ§eπ§fπ§eπ§cπ§tπ§ π§rπ§aπ§tπ§eπ§sπ§ π§uπ§sπ§iπ§nπ§gπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
z_usl = (usl - overall_mean) / overall_std
z_lsl = (lsl - overall_mean) / overall_std
prob_exceed_usl = 1 - stats.norm.cdf(z_usl)
prob_below_lsl = stats.norm.cdf(z_lsl)
theoretical_defect_rate = prob_exceed_usl + prob_below_lsl
print(f"\nDefect Rate Analysis:")
print(f"Observed defect rate: {(~manufacturing_df['within_spec']).mean():.1%}")
print(f"Theoretical defect rate (based on normal): {theoretical_defect_rate:.1%}")
print(f"Theoretical PPM: {theoretical_defect_rate * 1e6:.0f}")
π§#π§ π§Vπ§iπ§sπ§uπ§aπ§lπ§iπ§zπ§aπ§tπ§iπ§oπ§nπ§
plt.figure(figsize=(20, 15))
π§#π§ π§1π§.π§ π§Oπ§vπ§eπ§rπ§aπ§lπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§ π§wπ§iπ§tπ§hπ§ π§sπ§pπ§eπ§cπ§iπ§fπ§iπ§cπ§aπ§tπ§iπ§oπ§nπ§sπ§
plt.subplot(3, 4, 1)
plt.hist(manufacturing_df['length'], bins=50, density=True, alpha=0.7,
color='lightblue', edgecolor='black', label='Observed Data')
π§#π§ π§Fπ§iπ§tπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§ π§dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
mu_fit, sigma_fit = stats.norm.fit(manufacturing_df['length'])
x = np.linspace(manufacturing_df['length'].min() - 1, manufacturing_df['length'].max() + 1, 1000)
plt.plot(x, stats.norm.pdf(x, mu_fit, sigma_fit), 'r-', linewidth=2, label='Fitted Normal')
π§#π§ π§Aπ§dπ§dπ§ π§sπ§pπ§eπ§cπ§iπ§fπ§iπ§cπ§aπ§tπ§iπ§oπ§nπ§ π§lπ§iπ§mπ§iπ§tπ§sπ§
plt.axvline(lsl, color='red', linestyle='--', linewidth=2, label='Spec Limits')
plt.axvline(usl, color='red', linestyle='--', linewidth=2)
plt.axvline(overall_mean, color='green', linestyle='-', linewidth=2, label='Process Mean')
plt.xlabel('Length (mm)')
plt.ylabel('Density')
plt.title('Overall Distribution vs Specifications')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§2π§.π§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§cπ§hπ§aπ§rπ§tπ§ π§fπ§oπ§rπ§ π§dπ§aπ§iπ§lπ§yπ§ π§mπ§eπ§aπ§nπ§sπ§
plt.subplot(3, 4, 2)
plt.plot(daily_stats.index, daily_stats['mean'], 'bo-', markersize=4)
plt.axhline(overall_mean, color='green', linestyle='-', label='Grand Mean')
plt.axhline(overall_mean + 3*overall_std/np.sqrt(50), color='red', linestyle='--', label='Β±3Γ limits')
plt.axhline(overall_mean - 3*overall_std/np.sqrt(50), color='red', linestyle='--')
plt.xlabel('Day')
plt.ylabel('Daily Mean Length (mm)')
plt.title('X-bar Control Chart')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§3π§.π§ π§Cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§cπ§hπ§aπ§rπ§tπ§ π§fπ§oπ§rπ§ π§dπ§aπ§iπ§lπ§yπ§ π§rπ§aπ§nπ§gπ§eπ§sπ§
plt.subplot(3, 4, 3)
plt.plot(daily_stats.index, daily_stats['std'], 'go-', markersize=4)
plt.axhline(overall_std, color='blue', linestyle='-', label='Grand Std')
plt.xlabel('Day')
plt.ylabel('Daily Std (mm)')
plt.title('S Control Chart')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§4π§.π§ π§Dπ§aπ§iπ§lπ§yπ§ π§dπ§eπ§fπ§eπ§cπ§tπ§ π§rπ§aπ§tπ§eπ§sπ§
plt.subplot(3, 4, 4)
plt.bar(daily_defect_rate.index, daily_defect_rate.values * 100, alpha=0.7, color='orange')
plt.axhline(theoretical_defect_rate * 100, color='red', linestyle='--',
label=f'Expected: {theoretical_defect_rate:.1%}')
plt.xlabel('Day')
plt.ylabel('Defect Rate (%)')
plt.title('Daily Defect Rates')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§5π§.π§ π§Qπ§-π§Qπ§ π§pπ§lπ§oπ§tπ§ π§tπ§oπ§ π§vπ§eπ§rπ§iπ§fπ§yπ§ π§nπ§oπ§rπ§mπ§aπ§lπ§iπ§tπ§yπ§
plt.subplot(3, 4, 5)
stats.probplot(manufacturing_df['length'], dist="norm", plot=plt)
plt.title('Q-Q Plot: Normality Check')
plt.grid(True, alpha=0.3)
π§#π§ π§6π§.π§ π§Iπ§nπ§dπ§iπ§vπ§iπ§dπ§uπ§aπ§lπ§ π§mπ§eπ§aπ§sπ§uπ§rπ§eπ§mπ§eπ§nπ§tπ§sπ§ π§cπ§oπ§nπ§tπ§rπ§oπ§lπ§ π§cπ§hπ§aπ§rπ§tπ§
plt.subplot(3, 4, 6)
sample_subset = manufacturing_df.head(100) # First 100 samples
plt.plot(range(len(sample_subset)), sample_subset['length'], 'b-', alpha=0.6)
plt.axhline(overall_mean, color='green', linestyle='-', label='Mean')
plt.axhline(ucl, color='red', linestyle='--', label='Control Limits')
plt.axhline(lcl, color='red', linestyle='--')
plt.axhline(usl, color='orange', linestyle=':', label='Spec Limits')
plt.axhline(lsl, color='orange', linestyle=':')
plt.xlabel('Sample Number')
plt.ylabel('Length (mm)')
plt.title('Individual Measurements (First 100)')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§7π§.π§ π§Pπ§rπ§oπ§cπ§eπ§sπ§sπ§ π§cπ§aπ§pπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§vπ§iπ§sπ§uπ§aπ§lπ§iπ§zπ§aπ§tπ§iπ§oπ§nπ§
plt.subplot(3, 4, 7)
x_cap = np.linspace(94, 106, 1000)
y_cap = stats.norm.pdf(x_cap, overall_mean, overall_std)
plt.plot(x_cap, y_cap, 'b-', linewidth=2, label='Process Distribution')
plt.fill_between(x_cap, 0, y_cap, where=((x_cap >= lsl) & (x_cap <= usl)),
alpha=0.3, color='green', label='Within Spec')
plt.fill_between(x_cap, 0, y_cap, where=(x_cap < lsl),
alpha=0.3, color='red', label='Below LSL')
plt.fill_between(x_cap, 0, y_cap, where=(x_cap > usl),
alpha=0.3, color='red', label='Above USL')
plt.axvline(lsl, color='red', linestyle='--', linewidth=2)
plt.axvline(usl, color='red', linestyle='--', linewidth=2)
plt.xlabel('Length (mm)')
plt.ylabel('Density')
plt.title(f'Process Capability (Cpk = {cpk:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§8π§.π§ π§Hπ§iπ§sπ§tπ§oπ§gπ§rπ§aπ§mπ§ π§bπ§yπ§ π§dπ§eπ§fπ§eπ§cπ§tπ§ π§tπ§yπ§pπ§eπ§
plt.subplot(3, 4, 8)
defect_counts = manufacturing_df['defect_type'].value_counts()
colors = {'none': 'green', 'short': 'red', 'long': 'orange'}
bars = plt.bar(defect_counts.index, defect_counts.values,
color=[colors[x] for x in defect_counts.index])
plt.xlabel('Defect Type')
plt.ylabel('Count')
plt.title('Distribution by Defect Type')
for bar, count in zip(bars, defect_counts.values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
str(count), ha='center', va='bottom')
plt.grid(True, alpha=0.3)
π§#π§ π§9π§.π§ π§Mπ§oπ§vπ§iπ§nπ§gπ§ π§aπ§vπ§eπ§rπ§aπ§gπ§eπ§ π§tπ§rπ§eπ§nπ§dπ§
plt.subplot(3, 4, 9)
manufacturing_df['moving_avg'] = manufacturing_df['length'].rolling(window=20, center=True).mean()
plt.plot(range(len(manufacturing_df)), manufacturing_df['length'], 'b-', alpha=0.3, label='Individual')
plt.plot(range(len(manufacturing_df)), manufacturing_df['moving_avg'], 'r-', linewidth=2, label='Moving Avg (20)')
plt.axhline(target_length, color='green', linestyle='--', label='Target')
plt.xlabel('Sample Number')
plt.ylabel('Length (mm)')
plt.title('Process Trend Analysis')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§1π§0π§.π§ π§Pπ§rπ§oπ§bπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§cπ§aπ§lπ§cπ§uπ§lπ§aπ§tπ§iπ§oπ§nπ§sπ§
plt.subplot(3, 4, 10)
π§#π§ π§Cπ§rπ§eπ§aπ§tπ§eπ§ π§pπ§rπ§oπ§bπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§dπ§eπ§nπ§sπ§iπ§tπ§yπ§ π§aπ§rπ§eπ§aπ§sπ§
x_prob = np.linspace(96, 104, 1000)
y_prob = stats.norm.pdf(x_prob, overall_mean, overall_std)
plt.plot(x_prob, y_prob, 'b-', linewidth=2)
π§#π§ π§Sπ§hπ§aπ§dπ§eπ§ π§dπ§iπ§fπ§fπ§eπ§rπ§eπ§nπ§tπ§ π§pπ§rπ§oπ§bπ§aπ§bπ§iπ§lπ§iπ§tπ§yπ§ π§rπ§eπ§gπ§iπ§oπ§nπ§sπ§
plt.fill_between(x_prob, 0, y_prob, where=(x_prob <= overall_mean - overall_std),
alpha=0.3, color='lightcoral', label='< ΒΌ-Γ (16%)')
plt.fill_between(x_prob, 0, y_prob, where=((x_prob > overall_mean - overall_std) &
(x_cap <= overall_mean + overall_std)),
alpha=0.3, color='lightgreen', label='ΒΌΒ±Γ (68%)')
plt.fill_between(x_prob, 0, y_prob, where=(x_prob > overall_mean + overall_std),
alpha=0.3, color='lightcoral', label='> ΒΌ+Γ (16%)')
plt.xlabel('Length (mm)')
plt.ylabel('Density')
plt.title('Probability Regions (Empirical Rule)')
plt.legend()
plt.grid(True, alpha=0.3)
π§#π§ π§1π§1π§-π§1π§2π§:π§ π§Aπ§dπ§dπ§iπ§tπ§iπ§oπ§nπ§aπ§lπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§eπ§sπ§
plt.subplot(3, 4, 11)
π§#π§ π§Bπ§oπ§xπ§ π§pπ§lπ§oπ§tπ§ π§bπ§yπ§ π§dπ§aπ§yπ§ π§(π§sπ§aπ§mπ§pπ§lπ§eπ§ π§oπ§fπ§ π§dπ§aπ§yπ§sπ§)π§
sample_days = [1, 10, 20, 30]
box_data = [manufacturing_df[manufacturing_df['day'] == day]['length'].values
for day in sample_days]
plt.boxplot(box_data, labels=[f'Day {d}' for d in sample_days])
plt.ylabel('Length (mm)')
plt.title('Distribution by Selected Days')
plt.grid(True, alpha=0.3)
plt.subplot(3, 4, 12)
π§#π§ π§Cπ§uπ§mπ§uπ§lπ§aπ§tπ§iπ§vπ§eπ§ π§dπ§eπ§fπ§eπ§cπ§tπ§ π§rπ§aπ§tπ§eπ§
manufacturing_df['cumulative_defects'] = (~manufacturing_df['within_spec']).cumsum()
manufacturing_df['cumulative_rate'] = manufacturing_df['cumulative_defects'] / np.arange(1, len(manufacturing_df) + 1)
plt.plot(range(len(manufacturing_df)), manufacturing_df['cumulative_rate'] * 100, 'b-')
plt.axhline(theoretical_defect_rate * 100, color='red', linestyle='--', label='Expected Rate')
plt.xlabel('Sample Number')
plt.ylabel('Cumulative Defect Rate (%)')
plt.title('Cumulative Defect Rate Trend')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
π§#π§ π§Sπ§tπ§aπ§tπ§iπ§sπ§tπ§iπ§cπ§aπ§lπ§ π§sπ§uπ§mπ§mπ§aπ§rπ§yπ§
print(f"\nStatistical Analysis Summary:")
print(f"Shapiro-Wilk normality test: p = {stats.shapiro(manufacturing_df['length'])[1]:.6f}")
if stats.shapiro(manufacturing_df['length'])[1] > 0.05:
print(" Β Data appears to follow normal distribution (p > 0.05)")
else:
print(" Β Data may not be perfectly normal (p d 0.05)")
π§#π§ π§Rπ§eπ§cπ§oπ§mπ§mπ§eπ§nπ§dπ§aπ§tπ§iπ§oπ§nπ§
print(f"\nRecommendations:")
if cpk >= 1.33:
print(" Process is operating excellently")
elif cpk >= 1.0:
print("Β Process is adequate but could be improved")
print(" - Consider reducing process variation")
print(" - Check for process centering")
else:
print("L Process needs immediate attention")
print(" - High defect rate detected")
print(" - Consider process adjustment or tighter control")
return manufacturing_df
π§#π§ π§Rπ§uπ§nπ§ π§tπ§hπ§eπ§ π§mπ§aπ§nπ§uπ§fπ§aπ§cπ§tπ§uπ§rπ§iπ§nπ§gπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§
manufacturing_data = simulate_manufacturing_data()
π§#π§#π§#π§ π§Fπ§iπ§nπ§aπ§nπ§cπ§iπ§aπ§lπ§ π§Rπ§iπ§sπ§kπ§ π§Aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§ π§Eπ§xπ§aπ§mπ§pπ§lπ§eπ§
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from datetime import datetime, timedelta
π§#π§ π§Fπ§iπ§nπ§aπ§nπ§cπ§iπ§aπ§lπ§ π§Rπ§iπ§sπ§kπ§ π§Aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§ π§uπ§sπ§iπ§nπ§gπ§ π§Nπ§oπ§rπ§mπ§aπ§lπ§ π§Dπ§iπ§sπ§tπ§rπ§iπ§bπ§uπ§tπ§iπ§oπ§nπ§
def financial_risk_analysis():
"""
Analyze financial portfolio using normal distribution assumptions
"""
np.random.seed(42)
# Simulate daily returns for different assets
n_days = 252 # One trading year
assets = ['Stock_A', 'Stock_B', 'Bond', 'Commodity']
# Define asset characteristics (annual returns and volatility)
asset_params = {
'Stock_A': {'annual_return': 0.12, 'volatility': 0.20},
'Stock_B': {'annual_return': 0.08, 'volatility': 0.15},
'Bond': {'annual_return': 0.04, 'volatility': 0.05},
'Commodity': {'annual_return': 0.06, 'volatility': 0.25}
}
# Convert to daily parameters
daily_returns = {}
for asset, params in asset_params.items():
daily_mean = params['annual_return'] / 252
daily_std = params['volatility'] / np.sqrt(252)
daily_returns[asset] = np.random.normal(daily_mean, daily_std, n_days)
# Create DataFrame
returns_df = pd.DataFrame(daily_returns)
# Portfolio weights
weights = np.array([0.4, 0.3, 0.2, 0.1]) # 40% Stock_A, 30% Stock_B, 20% Bond, 10% Commodity
# Calculate portfolio returns
returns_df['Portfolio'] = np.dot(returns_df[assets], weights)
print("Financial Portfolio Risk Analysis")
print("=" * 50)
print("Asset Allocation:")
for asset, weight in zip(assets, weights):
print(f" {asset}: {weight:.1%}")
print(f"\nPortfolio Statistics (Annualized):")
portfolio_annual_return = returns_df['Portfolio'].mean() * 252
portfolio_annual_volatility = returns_df['Portfolio'].std() * np.sqrt(252)
sharpe_ratio = portfolio_annual_return / portfolio_annual_volatility
print(f"Expected Return: {portfolio_annual_return:.2%}")
print(f"Volatility (Risk): {portfolio_annual_volatility:.2%}")
print(f"Sharpe Ratio: {sharpe_ratio:.3f}")
# Value at Risk (VaR) calculations using normal distribution
confidence_levels = [0.95, 0.99, 0.999]
print(f"\nValue at Risk (VaR) Analysis:")
print("Confidence Level | 1-Day VaR | 10-Day VaR | Monthly VaR")
print("-" * 55)
portfolio_mean = returns_df['Portfolio'].mean()
portfolio_std = returns_df['Portfolio'].std()
for conf in confidence_levels:
# Z-score for confidence level
z_score = stats.norm.ppf(1 - conf)
# VaR calculations (negative because it's a loss)
var_1day = -(portfolio_mean + z_score * portfolio_std)
var_10day = -(portfolio_mean * 10 + z_score * portfolio_std * np.sqrt(10))
var_monthly = -(portfolio_mean * 21 + z_score * portfolio_std * np.sqrt(21))
print(f"{conf:.1%} | {var_1day:.2%} | {var_10day:.2%} | {var_monthly:.2%}")
# Monte Carlo simulation for portfolio value
initial_value = 1000000 # $1M initial portfolio
# Simulate portfolio path
cumulative_returns = (1 + returns_df['Portfolio']).cumprod()
portfolio_values = initial_value * cumulative_returns
# Calculate maximum drawdown
running_max = portfolio_values.cummax()
drawdown = (portfolio_values - running_max) / running_max
max_drawdown = drawdown.min()
print(f"\nPortfolio Performance:")
print(f"Initial Value: ${initial_value:,.0f}")
print(f"Final Value: ${portfolio_values.iloc[-1]:,.0f}")
print(f"Total Return: {(portfolio_values.iloc[-1]/initial_value - 1):.2%}")
print(f"Maximum Drawdown: {max_drawdown:.2%}")
# Risk metrics
downside_returns = returns_df['Portfolio'][returns_df['Portfolio'] < 0]
downside_deviation = downside_returns.std() * np.sqrt(252)
print(f"Downside Deviation: {downside_deviation:.2%}")
print(f"Probability of Loss (daily): {(returns_df['Portfolio'] < 0).mean():.1%}")
# Normal distribution tests for each asset
print(f"\nNormality Tests (Shapiro-Wilk p-values):")
for asset in assets + ['Portfolio']:
_, p_value = stats.shapiro(returns_df[asset])
status = "Normal" if p_value > 0.05 else "Non-normal"
print(f" {asset}: p = {p_value:.4f} ({status})")
# Visualization
plt.figure(figsize=(20, 16))
# 1. Portfolio value over time
plt.subplot(4, 3, 1)
plt.plot(portfolio_values, linewidth=2, color='blue')
plt.title('Portfolio Value Over Time')
plt.xlabel('Trading Day')
plt.ylabel('Portfolio Value ($)')
plt.grid(True, alpha=0.3)
# 2. Daily returns distribution
plt.subplot(4, 3, 2)
plt.hist(returns_df['Portfolio'], bins=50, density=True, alpha=0.7, color='lightblue', edgecolor='black')
# Fit normal distribution
mu, sigma = stats.norm.fit(returns_df['Portfolio'])
x = np.linspace(returns_df['Portfolio'].min(), returns_df['Portfolio'].max(), 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Fitted Normal')
# Mark VaR levels
var_95 = stats.norm.ppf(0.05, mu, sigma)
var_99 = stats.norm.ppf(0.01, mu, sigma)
plt.axvline(var_95, color='orange', linestyle='--', label='95% VaR')
plt.axvline(var_99, color='red', linestyle='--', label='99% VaR')
plt.title('Portfolio Daily Returns Distribution')
plt.xlabel('Daily Return')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
# 3. Q-Q plot for portfolio returns
plt.subplot(4, 3, 3)
stats.probplot(returns_df['Portfolio'], dist="norm", plot=plt)
plt.title('Portfolio Returns Q-Q Plot')
plt.grid(True, alpha=0.3)
# 4. Individual asset returns
plt.subplot(4, 3, 4)
for i, asset in enumerate(assets):
plt.plot(np.cumsum(returns_df[asset]), label=asset, linewidth=2)
plt.title('Cumulative Returns by Asset')
plt.xlabel('Trading Day')
plt.ylabel('Cumulative Return')
plt.legend()
plt.grid(True, alpha=0.3)
# 5. Correlation matrix
plt.subplot(4, 3, 5)
corr_matrix = returns_df[assets].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5, cbar_kws={"shrink": .5})
plt.title('Asset Correlation Matrix')
# 6. Rolling volatility
plt.subplot(4, 3, 6)
rolling_vol = returns_df['Portfolio'].rolling(window=21).std() * np.sqrt(252)
plt.plot(rolling_vol, linewidth=2, color='purple')
plt.axhline(portfolio_annual_volatility, color='red', linestyle='--',
label=f'Average: {portfolio_annual_volatility:.1%}')
plt.title('Rolling 21-Day Volatility (Annualized)')
plt.xlabel('Trading Day')
plt.ylabel('Volatility')
plt.legend()
plt.grid(True, alpha=0.3)
# 7. Drawdown chart
plt.subplot(4, 3, 7)
plt.fill_between(range(len(drawdown)), drawdown, 0, color='red', alpha=0.3)
plt.plot(drawdown, color='red', linewidth=2)
plt.title('Portfolio Drawdown')
plt.xlabel('Trading Day')
plt.ylabel('Drawdown')
plt.grid(True, alpha=0.3)
# 8. VaR backtest
plt.subplot(4, 3, 8)
var_95_threshold = -(portfolio_mean + stats.norm.ppf(0.05) * portfolio_std)
breaches = returns_df['Portfolio'] < -var_95_threshold
plt.plot(returns_df['Portfolio'], alpha=0.7, color='blue', linewidth=1)
plt.axhline(-var_95_threshold, color='red', linestyle='--', label='95% VaR Threshold')
plt.scatter(range(len(returns_df)), returns_df['Portfolio'],
c=breaches, cmap='RdYlGn', s=10, alpha=0.7)
breach_rate = breaches.mean()
plt.title(f'VaR Backtesting (Breach Rate: {breach_rate:.1%})')
plt.xlabel('Trading Day')
plt.ylabel('Daily Return')
plt.legend()
plt.grid(True, alpha=0.3)
# 9. Risk-return scatter
plt.subplot(4, 3, 9)
annual_returns = returns_df[assets].mean() * 252
annual_volatilities = returns_df[assets].std() * np.sqrt(252)
plt.scatter(annual_volatilities, annual_returns, s=100, alpha=0.7)
for i, asset in enumerate(assets):
plt.annotate(asset, (annual_volatilities[i], annual_returns[i]),
xytext=(5, 5), textcoords='offset points')
# Add portfolio point
plt.scatter(portfolio_annual_volatility, portfolio_annual_return,
s=200, color='red', marker='*', label='Portfolio')
plt.title('Risk-Return Profile')
plt.xlabel('Volatility (Risk)')
plt.ylabel('Expected Return')
plt.legend()
plt.grid(True, alpha=0.3)
# 10. Monte Carlo simulation of future paths
plt.subplot(4, 3, 10)
n_simulations = 100
days_forward = 252 # 1 year forward
future_paths = []
current_value = portfolio_values.iloc[-1]
for _ in range(n_simulations):
future_returns = np.random.normal(portfolio_mean, portfolio_std, days_forward)
future_values = current_value * np.cumprod(1 + future_returns)
future_paths.append(future_values)
plt.plot(range(days_forward), future_values, alpha=0.1, color='blue')
# Plot percentiles
future_paths = np.array(future_paths)
percentiles = [5, 25, 50, 75, 95]
colors = ['red', 'orange', 'green', 'orange', 'red']
for p, color in zip(percentiles, colors):
plt.plot(range(days_forward), np.percentile(future_paths, p, axis=0),
color=color, linewidth=2, label=f'{p}th percentile')
plt.title('Monte Carlo Future Price Simulation')
plt.xlabel('Days Forward')
plt.ylabel('Portfolio Value ($)')
plt.legend()
plt.grid(True, alpha=0.3)
# 11. Stress testing
plt.subplot(4, 3, 11)
stress_scenarios = {
'Normal': (0, 1),
'Mild Stress': (-0.02, 1.5),
'Severe Stress': (-0.05, 2.0),
'Extreme Stress': (-0.10, 3.0)
}
scenario_results = []
scenario_names = []
for name, (shock_mean, vol_multiplier) in stress_scenarios.items():
stressed_returns = np.random.normal(
portfolio_mean + shock_mean,
portfolio_std * vol_multiplier,
1000
)
var_99_stressed = -np.percentile(stressed_returns, 1)
scenario_results.append(var_99_stressed)
scenario_names.append(name)
bars = plt.bar(scenario_names, scenario_results, color=['green', 'yellow', 'orange', 'red'])
plt.title('Stress Testing - 99% VaR')
plt.ylabel('VaR (1%)')
plt.xticks(rotation=45)
# Add value labels
for bar, val in zip(bars, scenario_results):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{val:.2%}', ha='center', va='bottom')
plt.grid(True, alpha=0.3)
# 12. Expected shortfall (Conditional VaR)
plt.subplot(4, 3, 12)
confidence_levels_es = np.linspace(0.90, 0.999, 50)
var_values = []
es_values = []
for conf in confidence_levels_es:
var_threshold = -stats.norm.ppf(1-conf, portfolio_mean, portfolio_std)
var_values.append(var_threshold)
# Expected Shortfall (average of losses beyond VaR)
tail_returns = returns_df['Portfolio'][returns_df['Portfolio'] <= -var_threshold]
es = -tail_returns.mean() if len(tail_returns) > 0 else var_threshold
es_values.append(es)
plt.plot(confidence_levels_es, var_values, label='VaR', linewidth=2)
plt.plot(confidence_levels_es, es_values, label='Expected Shortfall', linewidth=2)
plt.title('VaR vs Expected Shortfall')
plt.xlabel('Confidence Level')
plt.ylabel('Risk Measure')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return returns_df, portfolio_values
π§#π§ π§Rπ§uπ§nπ§ π§tπ§hπ§eπ§ π§fπ§iπ§nπ§aπ§nπ§cπ§iπ§aπ§lπ§ π§aπ§nπ§aπ§lπ§yπ§sπ§iπ§sπ§
returns_data, portfolio_data = financial_risk_analysis()
π§#π§#π§ π§=π§Γπ§ π§Rπ§eπ§fπ§eπ§rπ§eπ§nπ§cπ§eπ§sπ§
Foundational Texts: - Statistical Inference - Casella & Berger - Introduction to Mathematical Statistics - Hogg, McKean, Craig - Probability and Statistics - Walpole, Myers, Myers, Ye
Machine Learning Applications: - The Elements of Statistical Learning - Hastie, Tibshirani, Friedman - Pattern Recognition and Machine Learning - Christopher Bishop - Machine Learning: A Probabilistic Perspective - Kevin Murphy
Classical Papers: - Central Limit Theorem - Strassen (1964) - Maximum Likelihood Estimation - Fisher (1922) - Box-Cox Transformations - Box & Cox (1964)
Statistical Computing: - SciPy Statistics Documentation - Statsmodels Documentation - NumPy Random Documentation
Quality Control Applications: - Introduction to Statistical Quality Control - Douglas Montgomery - Statistical Process Control - John Oakland
Financial Applications: - Risk Management and Financial Institutions - John Hull - Options, Futures, and Other Derivatives - John Hull - Value at Risk: The New Benchmark for Managing Financial Risk - Philippe Jorion
Online Resources: - Khan Academy Statistics - StatQuest YouTube Channel - MIT OpenCourseWare - Probability and Statistics - Stanford CS229 Machine Learning Notes
Specialized Topics: - Multivariate Normal Distribution - Gaussian Processes - Normalizing Flows - Central Limit Theorem Variations