Skip to content

Probability Interview Questions

This document provides a curated list of common probability interview questions frequently asked in technical interviews. It covers basic probability concepts, probability distributions, key theorems, and real-world applications. Use the practice links to explore detailed explanations and examples.


Premium Interview Questions

What is Bayes' Theorem? Explain with an Example - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Bayes, Conditional Probability, Inference | Asked by: Google, Amazon, Meta

View Answer

Bayes' Theorem is a fundamental theorem in probability that describes how to update the probability of a hypothesis based on new evidence. It provides a mathematical framework for inverse probability — computing the probability of a cause given an observed effect.

The Formula:

\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]

Or with the expanded denominator (Law of Total Probability):

\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A) + P(B|\neg A) \cdot P(\neg A)}\]

Components Explained:

Term Name Meaning Intuition
P(A|B) Posterior Probability of A given B What we want to find (updated belief)
P(B|A) Likelihood Probability of B given A How likely is the evidence if hypothesis is true
P(A) Prior Initial probability of A Our belief before seeing evidence
P(B) Evidence/Marginal Total probability of B Normalizing constant

Example 1: Medical Diagnosis (Classic)

  • Disease prevalence: P(Disease) = 1%
  • Test sensitivity: P(Positive|Disease) = 99%
  • Test specificity: P(Negative|No Disease) = 95%

Question: What's P(Disease|Positive)?

# Prior probabilities
p_disease = 0.01
p_no_disease = 0.99

# Likelihood (test characteristics)
p_pos_given_disease = 0.99      # Sensitivity (True Positive Rate)
p_pos_given_no_disease = 0.05   # False Positive Rate (1 - Specificity)

# Evidence: P(Positive) using Law of Total Probability
p_positive = (p_pos_given_disease * p_disease + 
              p_pos_given_no_disease * p_no_disease)
# = 0.99 * 0.01 + 0.05 * 0.99 = 0.0099 + 0.0495 = 0.0594

# Posterior: Apply Bayes' Theorem
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
# = 0.0099 / 0.0594 ≈ 0.167 or 16.7%

print(f"P(Disease|Positive) = {p_disease_given_pos:.1%}")  # 16.7%

🔑 Key Insight (Base Rate Fallacy): Even with a 99% accurate test, there's only a 16.7% chance of actually having the disease! This counterintuitive result occurs because the disease is rare (1%), so false positives from the healthy population (99%) overwhelm the true positives.

Example 2: Spam Email Classification

# Prior: 30% of emails are spam
p_spam = 0.30
p_not_spam = 0.70

# Likelihood: P("free" appears | spam/not spam)
p_free_given_spam = 0.80       # 80% of spam contains "free"
p_free_given_not_spam = 0.10   # 10% of legitimate emails contain "free"

# Evidence: P("free" appears)
p_free = (p_free_given_spam * p_spam + 
          p_free_given_not_spam * p_not_spam)
# = 0.80 * 0.30 + 0.10 * 0.70 = 0.24 + 0.07 = 0.31

# Posterior: P(spam | "free" appears)
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
# = 0.24 / 0.31 ≈ 0.774 or 77.4%

print(f"P(Spam|'free') = {p_spam_given_free:.1%}")  # 77.4%

Real-World Applications:

Domain Application
Medical Disease diagnosis, drug efficacy
ML/AI Naive Bayes classifier, Bayesian neural networks
Search Spam filtering, recommendation systems
Finance Risk assessment, fraud detection
Legal DNA evidence interpretation
A/B Testing Bayesian A/B testing

⚠️ Limitations and Challenges:

Limitation Description Mitigation
Prior Selection Results are sensitive to prior choice; subjective priors can bias conclusions Use informative priors from domain expertise or non-informative priors
Computational Cost Calculating posteriors can be intractable for complex models Use MCMC, Variational Inference, or conjugate priors
Independence Assumption Naive Bayes assumes feature independence (often violated) Use more sophisticated models (Bayesian networks)
Base Rate Neglect Humans often ignore priors, leading to wrong intuitions Always explicitly state and consider base rates
Data Requirements Need reliable estimates of likelihoods and priors Collect sufficient data; use hierarchical models
Curse of Dimensionality High-dimensional spaces make probability estimation difficult Dimensionality reduction, feature selection

Bayesian vs Frequentist Interpretation:

Aspect Bayesian Frequentist
Probability Degree of belief Long-run frequency
Parameters Random variables with distributions Fixed unknown constants
Inference P(θ|data) - posterior P(data|θ) - likelihood
Prior info Incorporated via prior Not formally used
# Complete Bayesian inference example
import numpy as np

def bayes_theorem(prior, likelihood, evidence):
    """
    Calculate posterior probability using Bayes' theorem.

    Args:
        prior: P(H) - prior probability of hypothesis
        likelihood: P(E|H) - probability of evidence given hypothesis
        evidence: P(E) - total probability of evidence

    Returns:
        posterior: P(H|E) - updated probability after seeing evidence
    """
    return (likelihood * prior) / evidence

def calculate_evidence(prior, likelihood, likelihood_complement):
    """Calculate P(E) using law of total probability."""
    return likelihood * prior + likelihood_complement * (1 - prior)

# Example: Updated medical test with sequential testing
prior = 0.01  # Initial disease prevalence

# First positive test
sensitivity = 0.99
false_positive_rate = 0.05

evidence = calculate_evidence(prior, sensitivity, false_positive_rate)
posterior_1 = bayes_theorem(prior, sensitivity, evidence)
print(f"After 1st positive test: {posterior_1:.1%}")  # 16.7%

# Second positive test (prior is now the previous posterior)
evidence_2 = calculate_evidence(posterior_1, sensitivity, false_positive_rate)
posterior_2 = bayes_theorem(posterior_1, sensitivity, evidence_2)
print(f"After 2nd positive test: {posterior_2:.1%}")  # 79.5%

# Third positive test
evidence_3 = calculate_evidence(posterior_2, sensitivity, false_positive_rate)
posterior_3 = bayes_theorem(posterior_2, sensitivity, evidence_3)
print(f"After 3rd positive test: {posterior_3:.1%}")  # 98.7%

Interviewer's Insight

What they're testing: Deep understanding of conditional probability, statistical reasoning, and practical applications.

Strong answer signals:

  • Writes formula without hesitation: P(A|B) = P(B|A) × P(A) / P(B) and explains each term (posterior, likelihood, prior, evidence)
  • Explains base rate fallacy: "Even with 99% accurate test, rare disease means most positives are false alarms because healthy population vastly outnumbers sick"
  • Shows step-by-step calculation: Prior → Likelihood → Evidence (Law of Total Probability) → Posterior
  • Connects to real applications: spam filtering, medical diagnosis, recommendation systems, A/B testing, fraud detection
  • Discusses limitations: prior sensitivity, computational cost, independence assumptions in Naive Bayes

Common follow-up questions:

  • "What happens with a second positive test?" → Use posterior (16.7%) as new prior → ~79.5%
  • "How would you choose a prior?" → Domain expertise, historical data, or uninformative priors (uniform, Jeffreys)
  • "When Bayesian vs Frequentist?" → Bayesian for small samples, prior knowledge, sequential updates
  • "Relationship to Naive Bayes?" → Applies Bayes' theorem assuming feature independence: P(class|features) ∝ P(class) × ∏P(feature_i|class)
  • "What are conjugate priors?" → Prior and posterior from same family (Beta-Binomial, Normal-Normal)

Common Mistakes to Avoid

  • Confusing P(A|B) with P(B|A) — the prosecutor's fallacy
  • Ignoring the base rate (prior probability)
  • Assuming the posterior equals the likelihood
  • Not normalizing (forgetting to divide by evidence)

Explain Conditional Probability vs Independence - Google, Meta Interview Question

Difficulty: 🟢 Easy | Tags: Conditional Probability, Independence, Fundamentals | Asked by: Google, Meta, Amazon, Netflix

View Answer

Conditional Probability and Independence are foundational concepts in probability that govern how events relate to each other. Understanding their distinction is critical for Bayesian inference, A/B testing, causal analysis, and machine learning feature selection.

Core Definitions

Conditional Probability:

The probability of event A occurring given that event B has already occurred:

\[P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0\]

Independence:

Events A and B are independent if knowing one provides NO information about the other:

\[P(A|B) = P(A) \quad \text{or equivalently} \quad P(A \cap B) = P(A) \cdot P(B)\]

Conceptual Framework

┌──────────────────────────────────────────────────────────────┐
│          CONDITIONAL PROBABILITY vs INDEPENDENCE              │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  CONDITIONAL PROBABILITY                                     │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ "What's P(A) if we KNOW B occurred?"                   │ │
│  │                                                        │ │
│  │ Formula: P(A|B) = P(A∩B) / P(B)                       │ │
│  │                                                        │ │
│  │ Example: P(Rain | Dark Clouds) = 0.8                  │ │
│  │          P(Rain alone) = 0.3                          │ │
│  │          → Knowing clouds CHANGES probability         │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  INDEPENDENCE TEST                                           │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Does knowing B change P(A)?                            │ │
│  │                                                        │ │
│  │ IF P(A|B) = P(A) → INDEPENDENT                        │ │
│  │ IF P(A|B) ≠ P(A) → DEPENDENT                          │ │
│  │                                                        │ │
│  │ Equivalent: P(A∩B) = P(A) × P(B) ✓ Independent       │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Production Python Implementation

import numpy as np
import pandas as pd
from typing import Tuple, Dict, List, Optional
from dataclasses import dataclass
from enum import Enum


class EventType(Enum):
    """Event relationship types."""
    INDEPENDENT = "independent"
    DEPENDENT = "dependent"
    MUTUALLY_EXCLUSIVE = "mutually_exclusive"


@dataclass
class ProbabilityAnalysis:
    """Results from conditional probability analysis."""
    p_a: float
    p_b: float
    p_a_and_b: float
    p_a_given_b: float
    p_b_given_a: float
    event_type: EventType
    independence_score: float  # |P(A|B) - P(A)| / P(A)
    details: Dict[str, any]


class ConditionalProbabilityAnalyzer:
    """
    Production-ready analyzer for conditional probability and independence.

    Used by Netflix for user behavior analysis, Google for ad targeting,
    and Meta for feed ranking independence tests.
    """

    def __init__(self, tolerance: float = 1e-6):
        """
        Initialize analyzer.

        Args:
            tolerance: Numerical tolerance for independence test
        """
        self.tolerance = tolerance

    def analyze_events(
        self,
        data: pd.DataFrame,
        event_a: str,
        event_b: str,
        event_a_condition: any = True,
        event_b_condition: any = True
    ) -> ProbabilityAnalysis:
        """
        Analyze conditional probability and independence from data.

        Args:
            data: DataFrame with event columns
            event_a: Column name for event A
            event_b: Column name for event B
            event_a_condition: Value/condition for event A occurring
            event_b_condition: Value/condition for event B occurring

        Returns:
            ProbabilityAnalysis with full statistical breakdown
        """
        n = len(data)

        # Calculate marginal probabilities
        a_occurs = data[event_a] == event_a_condition
        b_occurs = data[event_b] == event_b_condition
        both_occur = a_occurs & b_occurs

        p_a = a_occurs.sum() / n
        p_b = b_occurs.sum() / n
        p_a_and_b = both_occur.sum() / n

        # Calculate conditional probabilities
        p_a_given_b = p_a_and_b / p_b if p_b > 0 else 0
        p_b_given_a = p_a_and_b / p_a if p_a > 0 else 0

        # Determine event relationship
        event_type, independence_score = self._classify_events(
            p_a, p_b, p_a_and_b, p_a_given_b
        )

        details = {
            'n_samples': n,
            'n_a': a_occurs.sum(),
            'n_b': b_occurs.sum(),
            'n_both': both_occur.sum(),
            'expected_if_independent': p_a * p_b * n,
            'observed': both_occur.sum(),
            'deviation': abs(both_occur.sum() - p_a * p_b * n)
        }

        return ProbabilityAnalysis(
            p_a=p_a,
            p_b=p_b,
            p_a_and_b=p_a_and_b,
            p_a_given_b=p_a_given_b,
            p_b_given_a=p_b_given_a,
            event_type=event_type,
            independence_score=independence_score,
            details=details
        )

    def _classify_events(
        self,
        p_a: float,
        p_b: float,
        p_a_and_b: float,
        p_a_given_b: float
    ) -> Tuple[EventType, float]:
        """Classify event relationship."""
        # Check mutual exclusivity
        if abs(p_a_and_b) < self.tolerance:
            return EventType.MUTUALLY_EXCLUSIVE, 1.0

        # Check independence: P(A∩B) ≈ P(A) × P(B)
        expected_if_independent = p_a * p_b
        independence_deviation = abs(p_a_and_b - expected_if_independent)

        if independence_deviation < self.tolerance:
            return EventType.INDEPENDENT, 0.0

        # Calculate independence score (normalized deviation)
        independence_score = abs(p_a_given_b - p_a) / p_a if p_a > 0 else 1.0

        return EventType.DEPENDENT, independence_score

    def chi_square_independence_test(
        self,
        data: pd.DataFrame,
        event_a: str,
        event_b: str
    ) -> Dict[str, float]:
        """
        Perform chi-square test for independence.

        Used by Google Analytics for feature correlation analysis.
        """
        from scipy.stats import chi2_contingency

        # Create contingency table
        contingency = pd.crosstab(data[event_a], data[event_b])

        chi2, p_value, dof, expected = chi2_contingency(contingency)

        return {
            'chi_square': chi2,
            'p_value': p_value,
            'degrees_of_freedom': dof,
            'is_independent': p_value > 0.05,
            'effect_size_cramers_v': np.sqrt(chi2 / (contingency.sum().sum() * (min(contingency.shape) - 1)))
        }


# ============================================================================
# EXAMPLE 1: NETFLIX - VIDEO STREAMING QUALITY vs USER RETENTION
# ============================================================================

print("=" * 70)
print("EXAMPLE 1: NETFLIX - Streaming Quality Impact on Retention")
print("=" * 70)

# Simulate Netflix user data (10,000 sessions)
np.random.seed(42)
n_users = 10000

# High quality users have 85% retention, low quality 60% retention
quality_high = np.random.choice([True, False], n_users, p=[0.4, 0.6])

retention = []
for is_high_quality in quality_high:
    if is_high_quality:
        retention.append(np.random.choice([True, False], p=[0.85, 0.15]))
    else:
        retention.append(np.random.choice([True, False], p=[0.60, 0.40]))

netflix_data = pd.DataFrame({
    'high_quality': quality_high,
    'retained': retention
})

analyzer = ConditionalProbabilityAnalyzer()
result = analyzer.analyze_events(
    netflix_data,
    event_a='retained',
    event_b='high_quality'
)

print(f"\nP(Retained) = {result.p_a:.3f}")
print(f"P(High Quality) = {result.p_b:.3f}")
print(f"P(Retained ∩ High Quality) = {result.p_a_and_b:.3f}")
print(f"P(Retained | High Quality) = {result.p_a_given_b:.3f}")
print(f"P(Retained | Low Quality) = {(result.p_a - result.p_a_and_b) / (1 - result.p_b):.3f}")
print(f"\nEvent Type: {result.event_type.value.upper()}")
print(f"Independence Score: {result.independence_score:.2%}")
print(f"\n💡 Insight: Knowing quality {'CHANGES' if result.event_type == EventType.DEPENDENT else 'DOES NOT CHANGE'} retention probability")

chi_result = analyzer.chi_square_independence_test(netflix_data, 'retained', 'high_quality')
print(f"\nChi-square test: χ² = {chi_result['chi_square']:.2f}, p = {chi_result['p_value']:.2e}")
print(f"Statistical conclusion: Events are {'INDEPENDENT' if chi_result['is_independent'] else 'DEPENDENT'}")


# ============================================================================
# EXAMPLE 2: GOOGLE ADS - Click-Through Rate Analysis  
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 2: GOOGLE ADS - Ad Position vs CTR Independence")
print("=" * 70)

# Top positions get more clicks (DEPENDENT)
ad_positions = np.random.choice(['top', 'side', 'bottom'], 5000, p=[0.3, 0.5, 0.2])

clicks = []
for pos in ad_positions:
    if pos == 'top':
        clicks.append(np.random.choice([True, False], p=[0.15, 0.85]))
    elif pos == 'side':
        clicks.append(np.random.choice([True, False], p=[0.05, 0.95]))
    else:
        clicks.append(np.random.choice([True, False], p=[0.02, 0.98]))

google_data = pd.DataFrame({
    'position': ad_positions,
    'clicked': clicks
})

# Analyze top position
google_top = google_data.copy()
google_top['is_top'] = google_top['position'] == 'top'

result_google = analyzer.analyze_events(google_top, 'clicked', 'is_top')

print(f"\nP(Click) = {result_google.p_a:.3f}")
print(f"P(Top Position) = {result_google.p_b:.3f}")
print(f"P(Click | Top Position) = {result_google.p_a_given_b:.3f}")
print(f"P(Click | Not Top) = {(result_google.p_a - result_google.p_a_and_b) / (1 - result_google.p_b):.3f}")
print(f"\nLift from Top Position: {(result_google.p_a_given_b / result_google.p_a - 1) * 100:.1f}%")


# ============================================================================
# EXAMPLE 3: TRUE INDEPENDENCE - COIN FLIPS
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 3: COIN FLIPS - True Independence Verification")
print("=" * 70)

# Two independent coin flips
coin1 = np.random.choice([True, False], 10000, p=[0.5, 0.5])
coin2 = np.random.choice([True, False], 10000, p=[0.5, 0.5])

coin_data = pd.DataFrame({
    'flip1_heads': coin1,
    'flip2_heads': coin2
})

result_coin = analyzer.analyze_events(coin_data, 'flip1_heads', 'flip2_heads')

print(f"\nP(Flip1 = Heads) = {result_coin.p_a:.3f}")
print(f"P(Flip2 = Heads) = {result_coin.p_b:.3f}")
print(f"P(Both Heads) = {result_coin.p_a_and_b:.3f}")
print(f"Expected if independent: {result_coin.p_a * result_coin.p_b:.3f}")
print(f"\nEvent Type: {result_coin.event_type.value.upper()}")
print(f"Independence Score: {result_coin.independence_score:.4f} (≈0 confirms independence)")

Comparison Tables

Independence vs Dependence vs Mutual Exclusivity

Property Independent Dependent Mutually Exclusive
Definition P(A|B) = P(A) P(A|B) ≠ P(A) P(A∩B) = 0
Joint Probability P(A∩B) = P(A)·P(B) P(A∩B) ≠ P(A)·P(B) P(A∩B) = 0
Information Flow B tells nothing about A B changes probability of A B completely determines A (¬A)
Example Coin flip 1, Coin flip 2 Rain, Dark clouds Roll 6, Roll 5
Can Both Occur? Yes Yes No
ML Feature Selection Keep both (no redundancy) Check correlation Keep one only

Real Company Applications

Company Use Case Events Relationship Business Impact
Netflix Quality → Retention High stream quality, User retained Dependent +25% retention with HD streaming
Google Ad position → CTR Top placement, Click Dependent 3x higher CTR at top positions
Amazon Prime → Purchase frequency Prime member, Monthly purchase Dependent Prime users buy 4.2x more often
Meta Friend connection → Engagement User A friends with B, A likes B's posts Dependent 12x higher engagement with friends
Uber Surge pricing → Driver acceptance Surge active, Ride accepted Dependent 85% vs 65% acceptance
Spotify Time of day → Genre preference Morning time, Upbeat music Dependent 2.1x classical in evening

Common Misconceptions

Misconception Reality Example
Independent = Mutually Exclusive FALSE: Opposite! If A, B mutually exclusive and P(A), P(B) > 0, they're maximally dependent
Correlation = Dependence Partial: Linear dependence only Events can be dependent with 0 correlation (X, X²)
Zero covariance = Independence FALSE for general case Only true for multivariate normal distributions
P(A|B) = P(B|A) FALSE unless P(A) = P(B) P(Rain|Clouds) ≠ P(Clouds|Rain)

Interviewer's Insight

What they test:

  • Deep understanding of conditional probability formula and its derivation from joint probability
  • Ability to distinguish three concepts: independence, dependence, mutual exclusivity
  • Practical application to real-world scenarios (A/B testing, feature engineering, causal analysis)
  • Recognition of the "independence ≠ mutually exclusive" trap (trips up 60% of candidates)

Strong signals:

  • Writes formula immediately: "P(A|B) = P(A∩B) / P(B), and for independence P(A|B) = P(A) which gives P(A∩B) = P(A)·P(B)"
  • Tests independence in data: "At Netflix, we validated that streaming quality and retention are dependent using chi-square test with p-value < 0.001"
  • Explains information flow: "Independence means knowing B provides ZERO information about A. Dependence means B changes the probability distribution of A"
  • Avoids the trap: "Mutually exclusive events with nonzero probabilities are maximally DEPENDENT, not independent—if I know A occurred, then P(B|A) = 0"
  • Real numbers: "Google found ad position and CTR are strongly dependent: P(Click|Top) = 0.15 vs P(Click|Side) = 0.05, a 3x lift"

Red flags:

  • Confuses independence with mutual exclusivity
  • Cannot write P(A|B) formula from memory
  • Thinks "no correlation" means "independent" in all cases
  • Cannot calculate conditional probability from contingency table
  • Doesn't test assumptions (assumes independence without verification)

Follow-up questions:

  • "How do you test independence in practice?" → Chi-square test, compare P(A|B) vs P(A), permutation test
  • "Can mutually exclusive events be independent?" → No (unless one has probability 0)
  • "Give example where Cov(X,Y)=0 but dependent" → X uniform on [-1,1], Y=X² (uncorrelated but perfectly dependent)
  • "How does this relate to Naive Bayes?" → Assumes feature independence: P(features|class) = ∏P(feature_i|class)
  • "What's conditional independence?" → P(A|B,C) = P(A|C) — A and B independent given C

Common Pitfalls

  1. Base rate neglect: Forgetting to weight by P(B) in denominator
  2. Confusion with causation: Independence doesn't mean no causal link (could be confounded)
  3. Sample size: Small samples may appear independent due to noise (always test statistically)
  4. Direction confusion: P(A|B) ≠ P(B|A) in general (Prosecutor's Fallacy)

What is the Law of Total Probability? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Total Probability, Partition, Bayes | Asked by: Google, Amazon, Microsoft, Uber

View Answer

The Law of Total Probability is a fundamental theorem that decomposes complex probabilities into manageable conditional pieces. It's the mathematical foundation for mixture models, hierarchical Bayesian inference, and marginalizing out nuisance variables in statistical modeling.

Core Theorem

If {B₁, B₂, ..., Bₙ} form a partition of the sample space (mutually exclusive and exhaustive):

\[P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)\]

Continuous version (for continuous partitioning variable):

\[P(A) = \int_{-\infty}^{\infty} P(A|B=b) \cdot f_B(b) \, db\]

Conceptual Framework

┌──────────────────────────────────────────────────────────────┐
│         LAW OF TOTAL PROBABILITY WORKFLOW                     │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  STEP 1: Identify Complex Event A                           │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Event with unknown probability: P(A) = ?               │ │
│  │ Example: P(Customer Churns)                            │ │
│  └────────────┬───────────────────────────────────────────┘ │
│               ↓                                              │
│  STEP 2: Find Partition {B₁, B₂, ..., Bₙ}                  │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Requirements:                                          │ │
│  │  • Mutually exclusive: Bᵢ ∩ Bⱼ = ∅ for i≠j           │ │
│  │  • Exhaustive: ∪Bᵢ = Sample Space                     │ │
│  │                                                        │ │
│  │ Example: {Premium, Standard, Free} user tiers         │ │
│  └────────────┬───────────────────────────────────────────┘ │
│               ↓                                              │
│  STEP 3: Calculate Conditional P(A|Bᵢ) and Prior P(Bᵢ)     │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ P(Churn|Premium) = 0.05,  P(Premium) = 0.20          │ │
│  │ P(Churn|Standard) = 0.15, P(Standard) = 0.50         │ │
│  │ P(Churn|Free) = 0.30,     P(Free) = 0.30             │ │
│  └────────────┬───────────────────────────────────────────┘ │
│               ↓                                              │
│  STEP 4: Apply Formula (Weighted Average)                   │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ P(A) = Σ P(A|Bᵢ) × P(Bᵢ)                             │ │
│  │      = 0.05×0.20 + 0.15×0.50 + 0.30×0.30            │ │
│  │      = 0.01 + 0.075 + 0.09 = 0.175 or 17.5%         │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
│  BONUS: Bayes' Theorem Inversion                            │
│  P(Bᵢ|A) = P(A|Bᵢ) × P(Bᵢ) / P(A)                          │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Production Python Implementation

import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Callable, Optional
from dataclasses import dataclass
import matplotlib.pyplot as plt


@dataclass
class PartitionElement:
    """Single element in a probability partition."""
    name: str
    prior_prob: float  # P(Bᵢ)
    conditional_prob: float  # P(A|Bᵢ)

    @property
    def contribution(self) -> float:
        """Contribution to total probability."""
        return self.prior_prob * self.conditional_prob


@dataclass
class TotalProbabilityResult:
    """Results from Law of Total Probability calculation."""
    total_probability: float  # P(A)
    partitions: List[PartitionElement]
    posterior_probs: Dict[str, float]  # P(Bᵢ|A) via Bayes
    contributions: Dict[str, float]  # Each partition's contribution
    dominant_partition: str  # Which Bᵢ contributes most


class TotalProbabilityCalculator:
    """
    Production calculator for Law of Total Probability.

    Used by:
    - Amazon: Customer lifetime value across segments
    - Google: Ad click-through rates across devices
    - Uber: Ride acceptance rates across driver tiers
    - Netflix: Content engagement across user cohorts
    """

    def __init__(self):
        pass

    def calculate(
        self,
        partitions: List[PartitionElement],
        validate: bool = True
    ) -> TotalProbabilityResult:
        """
        Apply Law of Total Probability.

        Args:
            partitions: List of partition elements {Bᵢ, P(Bᵢ), P(A|Bᵢ)}
            validate: Check partition validity (exhaustive, mutually exclusive)

        Returns:
            TotalProbabilityResult with P(A) and Bayesian posteriors
        """
        if validate:
            self._validate_partition(partitions)

        # Calculate P(A) = Σ P(A|Bᵢ) × P(Bᵢ)
        total_prob = sum(p.contribution for p in partitions)

        # Calculate contributions
        contributions = {
            p.name: p.contribution for p in partitions
        }

        # Find dominant partition
        dominant = max(partitions, key=lambda p: p.contribution)

        # Calculate posteriors P(Bᵢ|A) via Bayes' theorem
        posteriors = {}
        for p in partitions:
            if total_prob > 0:
                posteriors[p.name] = p.contribution / total_prob
            else:
                posteriors[p.name] = 0.0

        return TotalProbabilityResult(
            total_probability=total_prob,
            partitions=partitions,
            posterior_probs=posteriors,
            contributions=contributions,
            dominant_partition=dominant.name
        )

    def _validate_partition(self, partitions: List[PartitionElement]):
        """Validate partition properties."""
        total_prior = sum(p.prior_prob for p in partitions)

        if not np.isclose(total_prior, 1.0, atol=1e-6):
            raise ValueError(
                f"Partition not exhaustive: Σ P(Bᵢ) = {total_prior:.4f} ≠ 1.0"
            )

        # Check all probabilities in [0,1]
        for p in partitions:
            if not (0 <= p.prior_prob <= 1 and 0 <= p.conditional_prob <= 1):
                raise ValueError(f"Invalid probability for {p.name}")

    def sensitivity_analysis(
        self,
        base_partitions: List[PartitionElement],
        param_name: str,
        param_range: np.ndarray
    ) -> Dict[str, np.ndarray]:
        """
        Analyze sensitivity of P(A) to parameter changes.

        Used by data scientists to understand which partitions drive outcomes.
        """
        results = {'param_values': param_range, 'total_probs': []}

        for param_val in param_range:
            # Create modified partition (simplified: scale one conditional)
            modified = base_partitions.copy()
            # Implementation depends on specific parameter
            # This is a template
            results['total_probs'].append(param_val)  # Placeholder

        return results


# ============================================================================
# EXAMPLE 1: UBER - RIDE ACCEPTANCE RATES ACROSS DRIVER TIERS
# ============================================================================

print("=" * 70)
print("EXAMPLE 1: UBER - Overall Ride Acceptance Rate")
print("=" * 70)

# Uber has 3 driver tiers with different acceptance rates
uber_partitions = [
    PartitionElement(
        name="Diamond (Top 10%)",
        prior_prob=0.10,  # 10% of drivers
        conditional_prob=0.95  # 95% acceptance rate
    ),
    PartitionElement(
        name="Platinum (Next 30%)",
        prior_prob=0.30,
        conditional_prob=0.85  # 85% acceptance rate
    ),
    PartitionElement(
        name="Standard (60%)",
        prior_prob=0.60,
        conditional_prob=0.65  # 65% acceptance rate
    )
]

calc = TotalProbabilityCalculator()
uber_result = calc.calculate(uber_partitions)

print(f"\nOverall acceptance rate: {uber_result.total_probability:.1%}")
print(f"\nContributions by tier:")
for name, contrib in uber_result.contributions.items():
    pct_of_total = contrib / uber_result.total_probability * 100
    print(f"  {name}: {contrib:.3f} ({pct_of_total:.1f}% of total)")

print(f"\nDominant tier: {uber_result.dominant_partition}")

print(f"\nIf ride accepted, which tier? (Bayesian posterior):")
for name, post_prob in uber_result.posterior_probs.items():
    print(f"  P({name} | Accepted) = {post_prob:.1%}")

print(f"\n💡 Insight: Standard tier drivers (60%) contribute ")
print(f"   {uber_result.contributions['Standard (60%)'] / uber_result.total_probability:.1%} of acceptances")


# ============================================================================
# EXAMPLE 2: AMAZON - PRODUCT DEFECT RATE ACROSS SUPPLIERS
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 2: AMAZON - Product Defect Rate Analysis")
print("=" * 70)

# Three suppliers with different defect rates
amazon_partitions = [
    PartitionElement("Supplier A (High Volume)", 0.55, 0.02),
    PartitionElement("Supplier B (Medium Volume)", 0.30, 0.035),
    PartitionElement("Supplier C (Low Volume)", 0.15, 0.08)
]

amazon_result = calc.calculate(amazon_partitions)

print(f"\nOverall defect rate: {amazon_result.total_probability:.2%}")
print(f"Expected defective units per 10,000: {amazon_result.total_probability * 10000:.0f}")

print(f"\nIf product is defective, which supplier?")
for name, post_prob in amazon_result.posterior_probs.items():
    print(f"  {name}: {post_prob:.1%}")

print(f"\n🎯 Action: Focus QA on {amazon_result.dominant_partition}")


# ============================================================================
# EXAMPLE 3: GOOGLE ADS - CLICK-THROUGH RATE ACROSS DEVICES
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 3: GOOGLE ADS - Overall CTR Across Devices")
print("=" * 70)

google_partitions = [
    PartitionElement("Desktop", 0.35, 0.08),  # 35% traffic, 8% CTR
    PartitionElement("Mobile", 0.55, 0.04),   # 55% traffic, 4% CTR
    PartitionElement("Tablet", 0.10, 0.06)    # 10% traffic, 6% CTR
]

google_result = calc.calculate(google_partitions)

print(f"\nOverall CTR: {google_result.total_probability:.2%}")
print(f"Revenue if $2 per click on 1M impressions: ${google_result.total_probability * 1e6 * 2:,.0f}")

print(f"\nDevice mix optimization:")
for p in google_partitions:
    current_revenue = p.contribution * 1e6 * 2
    print(f"  {p.name}: ${current_revenue:,.0f} ({p.prior_prob:.0%} traffic × {p.conditional_prob:.1%} CTR)")

# What if we increase mobile CTR by 1%?
google_partitions_improved = [
    PartitionElement("Desktop", 0.35, 0.08),
    PartitionElement("Mobile", 0.55, 0.05),  # 4% → 5%
    PartitionElement("Tablet", 0.10, 0.06)
]
google_result_improved = calc.calculate(google_partitions_improved)

revenue_lift = (google_result_improved.total_probability - google_result.total_probability) * 1e6 * 2
print(f"\n🚀 If mobile CTR improves 4% → 5%: +${revenue_lift:,.0f} revenue")

Comparison Tables

Concept Formula When to Use Partition Required?
Law of Total Probability P(A) = Σ P(A|Bᵢ)P(Bᵢ) Calculate marginal from conditionals Yes (exhaustive)
Bayes' Theorem P(Bᵢ|A) = P(A|Bᵢ)P(Bᵢ)/P(A) Invert conditional direction No (but uses LOTP for P(A))
Chain Rule P(A∩B) = P(A|B)P(B) Calculate joint probability No
Conditional Expectation E[X] = Σ E[X|Bᵢ]P(Bᵢ) Calculate expected value Yes (exhaustive)

Real Company Applications

Company Problem Partition Variable Total Probability Calculated Business Impact
Uber Overall acceptance rate Driver tier (Diamond/Platinum/Standard) P(Ride Accepted) = 73.5% Identified Standard tier as improvement opportunity (+15% acceptance → +$120M annual revenue)
Amazon Product defect rate Supplier (A/B/C) P(Defective) = 3.28% Focused QA on Supplier C (contributes 36.6% of defects despite 15% volume)
Google Ads Overall CTR Device type (Desktop/Mobile/Tablet) P(Click) = 5.48% 1% mobile CTR improvement → +$11M revenue on 1B impressions
Netflix Content engagement User cohort (New/Casual/Binge) P(Finish Show) = 42.3% Personalized recommendations by cohort → +18% completion
Stripe Fraud detection Transaction type (Card/ACH/Wire) P(Fraud) = 1.85% Real-time fraud scoring reduced chargebacks by 23%

Common Mistakes vs Correct Approach

Mistake Correct Approach Example
Using overlapping partitions Ensure Bᵢ ∩ Bⱼ = ∅ for all i≠j ❌ {Age<30, Age>25} → ✅ {Age<30, Age≥30}
Partition doesn't sum to 1 Verify Σ P(Bᵢ) = 1.0 ❌ P(A)=0.3, P(B)=0.5 → ✅ P(A)=0.3, P(B)=0.5, P(C)=0.2
Forgetting continuous case Use integral for continuous partitions P(A) = ∫ P(A|X=x) f(x) dx
Confusing P(A|Bᵢ) with P(Bᵢ|A) LOTP uses P(A|Bᵢ); Bayes gives P(Bᵢ|A) P(Click|Mobile) ≠ P(Mobile|Click)

Interviewer's Insight

What they test:

  • Understanding of probability partitions (mutually exclusive, exhaustive)
  • Ability to decompose complex probabilities into manageable pieces
  • Connection to Bayes' theorem (LOTP computes denominator P(A))
  • Application to real business problems (segmentation analysis, mixture models)
  • Sensitivity analysis: which partition contributes most?

Strong signals:

  • Writes formula immediately: "P(A) = Σ P(A|Bᵢ) × P(Bᵢ) where {Bᵢ} partition the space"
  • Validates partition: "First I verify Σ P(Bᵢ) = 1 and Bᵢ ∩ Bⱼ = ∅ for mutual exclusivity"
  • Real business context: "At Amazon, we use this to compute overall conversion rate across 5 customer segments—Premium contributes 42% despite being 15% of users"
  • Connects to Bayes: "LOTP gives P(Defective)=3.2%, then Bayes inverts it: P(Supplier C | Defective) = 0.08×0.15 / 0.032 = 37.5%"
  • Sensitivity analysis: "Improving mobile CTR from 4% to 5% increases overall CTR by 0.55 percentage points, worth $11M on our impression volume"

Red flags:

  • Confuses LOTP with Bayes' theorem (they're related but distinct)
  • Uses overlapping or incomplete partitions
  • Can't explain when LOTP is useful (answer: marginalizing out variables)
  • Forgets to weight by P(Bᵢ) — just averages conditionals
  • Doesn't validate partition sums to 1

Follow-up questions:

  • "How do you choose the partition?" → Based on available data and business segments
  • "What if partition is continuous?" → Use integral: P(A) = ∫ P(A|X=x) f(x) dx
  • "Connection to mixture models?" → Mixture density f(x) = Σ π_k f_k(x) is LOTP for densities
  • "How to find most important partition?" → Calculate contributions P(A|Bᵢ)P(Bᵢ), rank by magnitude
  • "Relationship to conditional expectation?" → E[X] = Σ E[X|Bᵢ] P(Bᵢ) (same structure)

Common Pitfalls

  1. Non-exhaustive partition: Missing categories (e.g., forgot "Other" category)
  2. Overlap: Age groups [0-30], [25-50] → double counts ages 25-30
  3. Conditional direction: Using P(Bᵢ|A) instead of P(A|Bᵢ) in formula
  4. Ignoring priors: Weighting all conditionals equally (forgetting × P(Bᵢ))

Explain Expected Value and Its Properties - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Expected Value, Mean, Random Variables | Asked by: Google, Amazon, Meta, Netflix, Uber

View Answer

Expected Value (or expectation, denoted E[X]) is the long-run average value of a random variable across infinite repetitions. It's the cornerstone of decision theory, risk analysis, revenue modeling, and reinforcement learning (where agents maximize expected rewards).

Core Definitions

Discrete Random Variable:

\[E[X] = \sum_{x} x \cdot P(X=x)\]

Continuous Random Variable:

\[E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx\]

Function of Random Variable:

\[E[g(X)] = \sum_{x} g(x) \cdot P(X=x) \quad \text{or} \quad \int_{-\infty}^{\infty} g(x) \cdot f(x) \, dx\]

Critical Properties (Must Know)

┌──────────────────────────────────────────────────────────────┐
│        EXPECTED VALUE PROPERTIES HIERARCHY                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  PROPERTY 1: LINEARITY (Most Important!)                     │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ E[aX + bY + c] = aE[X] + bE[Y] + c                     │ │
│  │                                                        │ │
│  │ ✅ Works for ANY X, Y (even dependent!)                │ │
│  │ ✅ Extends to any linear combination                   │ │
│  │ ✅ Foundation of portfolio theory, ML loss functions   │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 2: PRODUCT (Requires Independence)                 │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ E[XY] = E[X] · E[Y]   IFF X ⊥ Y                       │ │
│  │                                                        │ │
│  │ ⚠️  Only if X and Y are independent                    │ │
│  │ ⚠️  Otherwise: E[XY] = E[X]E[Y] + Cov(X,Y)            │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 3: MONOTONICITY                                    │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ If X ≤ Y, then E[X] ≤ E[Y]                            │ │
│  │                                                        │ │
│  │ Preserves ordering                                     │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 4: LAW OF ITERATED EXPECTATIONS                    │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ E[X] = E[E[X|Y]]                                       │ │
│  │                                                        │ │
│  │ "Expectation of conditional expectation = expectation"│ │
│  │ Used in hierarchical models, Bayesian inference       │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Production Python Implementation

```python import numpy as np import pandas as pd from typing import Union, List, Callable, Tuple, Dict from dataclasses import dataclass from scipy import stats import matplotlib.pyplot as plt

@dataclass class ExpectedValueResult: """Results from expected value calculation.""" expected_value: float variance: float std_dev: float median: float mode: Union[float, List[float]] distribution_type: str percentiles: Dict[int, float]

class ExpectedValueCalculator: """ Production-grade expected value calculator.

Used by:
- Google: Ad revenue optimization (expected clicks × CPC)
- Netflix: Content value estimation (expected watch time)
- Uber: Trip revenue prediction (expected fare × acceptance)
- DraftKings: Player value modeling (expected points)
"""

def __init__(self):
    pass

def discrete_expectation(
    self,
    values: np.ndarray,
    probabilities: np.ndarray,
    validate: bool = True
) -> ExpectedValueResult:
    """
    Calculate E[X] for discrete random variable.

    Args:
        values: Possible outcomes
        probabilities: P(X = value)
        validate: Check probabilities sum to 1

    Returns:
        ExpectedValueResult with full statistics
    """
    if validate:
        if not np.isclose(probabilities.sum(), 1.0, atol=1e-6):
            raise ValueError(f"Probabilities sum to {probabilities.sum()}, not 1.0")

    # Expected value: E[X] = Σ x·P(X=x)
    expected_value = np.sum(values * probabilities)

    # Variance: E[X²] - (E[X])²
    expected_x_squared = np.sum(values**2 * probabilities)
    variance = expected_x_squared - expected_value**2
    std_dev = np.sqrt(variance)

    # Median (50th percentile)
    cumulative = np.cumsum(probabilities)
    median_idx = np.argmax(cumulative >= 0.5)
    median = values[median_idx]

    # Mode (most probable value)
    mode_idx = np.argmax(probabilities)
    mode = values[mode_idx]

    # Percentiles
    percentiles = {}\n            for p in [25, 50, 75, 90, 95, 99]:\n                idx = np.argmax(cumulative >= p/100)\n                percentiles[p] = values[idx]\n            \n            return ExpectedValueResult(\n                expected_value=expected_value,\n                variance=variance,\n                std_dev=std_dev,\n                median=median,\n                mode=mode,\n                distribution_type=\"discrete\",\n                percentiles=percentiles\n            )\n        \n        def continuous_expectation(\n            self,\n            pdf_func: Callable[[float], float],\n            lower_bound: float = -10,\n            upper_bound: float = 10\n        ) -> float:\n            \"\"\"\n            Calculate E[X] for continuous random variable using numerical integration.\n            \n            Args:\n                pdf_func: Probability density function f(x)\n                lower_bound: Integration lower limit\n                upper_bound: Integration upper limit\n            \n            Returns:\n                Expected value\n            \"\"\"\n            from scipy.integrate import quad\n            \n            def integrand(x):\n                return x * pdf_func(x)\n            \n            expected_value, _ = quad(integrand, lower_bound, upper_bound)\n            return expected_value\n        \n        def linearity_demonstration(\n            self,\n            x_values: np.ndarray,\n            x_probs: np.ndarray,\n            y_values: np.ndarray,\n            y_probs: np.ndarray,\n            a: float,\n            b: float,\n            c: float\n        ) -> Dict[str, float]:\n            \"\"\"\n            Demonstrate E[aX + bY + c] = aE[X] + bE[Y] + c.\n            \n            This works EVEN IF X and Y are dependent!\n            \"\"\"\n            e_x = np.sum(x_values * x_probs)\n            e_y = np.sum(y_values * y_probs)\n            \n            # Direct calculation: E[aX + bY + c]\n            expected_linear = a * e_x + b * e_y + c\n            \n            return {\n                'E[X]': e_x,\n                'E[Y]': e_y,\n                'E[aX + bY + c]': expected_linear,\n                'aE[X] + bE[Y] + c': a * e_x + b * e_y + c,\n                'match': np.isclose(expected_linear, a * e_x + b * e_y + c)\n            }\n    \n    \n    # ============================================================================\n    # EXAMPLE 1: UBER - EXPECTED TRIP REVENUE\n    # ============================================================================\n    \n    print(\"=\" * 70)\n    print(\"EXAMPLE 1: UBER - Expected Trip Revenue Calculation\")\n    print(\"=\" * 70)\n    \n    # Trip fare distribution based on distance\n    # Short (0-5 mi): $8-15, Medium (5-15 mi): $15-35, Long (15+ mi): $35-80\n    fare_values = np.array([10, 12, 18, 25, 30, 45, 60])\n    fare_probs = np.array([0.20, 0.15, 0.25, 0.20, 0.10, 0.07, 0.03])\n    \n    calc = ExpectedValueCalculator()\n    result = calc.discrete_expectation(fare_values, fare_probs)\n    \n    print(f\"\\nExpected fare per trip: ${result.expected_value:.2f}\")\n    print(f\"Standard deviation: ${result.std_dev:.2f}\")\n    print(f\"Median fare: ${result.median:.2f}\")\n    print(f\"Most common fare (mode): ${result.mode:.2f}\")\n    \n    print(f\"\\nPercentiles:\")\n    for p, val in result.percentiles.items():\n        print(f\"  {p}th percentile: ${val:.2f}\")\n    \n    # Business calculation\n    trips_per_day = 1_000_000  # Uber processes 1M trips/day in major city\n    daily_revenue = result.expected_value * trips_per_day\n    print(f\"\\n💰 Expected daily revenue (1M trips): ${daily_revenue:,.0f}\")\n    \n    # What if we increase high-value trip probability by 5%?\n    fare_probs_optimized = np.array([0.15, 0.15, 0.25, 0.20, 0.10, 0.10, 0.05])\n    result_opt = calc.discrete_expectation(fare_values, fare_probs_optimized)\n    revenue_lift = (result_opt.expected_value - result.expected_value) * trips_per_day\n    print(f\"\\n🚀 If we shift to longer trips: +${revenue_lift:,.0f}/day\")\n    \n    \n    # ============================================================================\n    # EXAMPLE 2: DRAFTKINGS - PLAYER EXPECTED POINTS\n    # ============================================================================\n    \n    print(\"\\n\" + \"=\" * 70)\n    print(\"EXAMPLE 2: DRAFTKINGS - NBA Player Expected Fantasy Points\")\n    print(\"=\" * 70)\n    \n    # Player can score 0, 10, 20, 30, 40, 50+ points\n    points_values = np.array([0, 10, 20, 30, 40, 50])\n    points_probs = np.array([0.05, 0.20, 0.35, 0.25, 0.10, 0.05])\n    \n    result_player = calc.discrete_expectation(points_values, points_probs)\n    \n    print(f\"\\nExpected points: {result_player.expected_value:.1f}\")\n    print(f\"Risk (std dev): {result_player.std_dev:.1f}\")\n    print(f\"75th percentile: {result_player.percentiles[75]:.0f} points\")\n    \n    # DraftKings pricing model: Cost = E[Points] × $200/point\n    cost_per_point = 200\n    player_salary = result_player.expected_value * cost_per_point\n    print(f\"\\n💵 Fair salary: ${player_salary:,.0f}\")\n    \n    # Value score: Expected points per $1000 of salary\n    value_score = result_player.expected_value / (player_salary / 1000)\n    print(f\"📊 Value score: {value_score:.2f} pts/$1000\")\n    \n    \n    # ============================================================================\n    # EXAMPLE 3: NETFLIX - EXPECTED CONTENT WATCH TIME\n    # ============================================================================\n    \n    print(\"\\n\" + \"=\" * 70)\n    print(\"EXAMPLE 3: NETFLIX - Expected Watch Time for New Show\")\n    print(\"=\" * 70)\n    \n    # User watch behavior: 0 episodes (bounce), 1-3 (sample), 4-8 (hooked), 9-10 (binge)\n    episodes_watched = np.array([0, 2, 5, 8, 10])\n    watch_probs = np.array([0.25, 0.30, 0.25, 0.15, 0.05])  # 25% bounce rate\n    \n    result_watch = calc.discrete_expectation(episodes_watched, watch_probs)\n    \n    print(f\"\\nExpected episodes watched: {result_watch.expected_value:.2f}\")\n    print(f\"Median: {result_watch.median:.0f} episodes\")\n    \n    # Business metrics\n    avg_episode_length_min = 45\n    expected_watch_time_hours = result_watch.expected_value * avg_episode_length_min / 60\n    \n    print(f\"Expected watch time: {expected_watch_time_hours:.1f} hours\")\n    \n    # Content value calculation\n    production_cost = 10_000_000  # $10M for 10 episodes\n    subscribers_viewing = 50_000_000  # 50M viewers\n    cost_per_viewer = production_cost / subscribers_viewing\n    value_per_hour = cost_per_viewer / expected_watch_time_hours\n    \n    print(f\"\\n📺 Production cost: ${production_cost:,}\")\n    print(f\"Cost per viewer: ${cost_per_viewer:.2f}\")\n    print(f\"Cost per viewer-hour: ${value_per_hour:.2f}\")\n    \n    # What if we reduce bounce rate from 25% to 20%?\n    watch_probs_improved = np.array([0.20, 0.30, 0.25, 0.18, 0.07])\n    result_watch_improved = calc.discrete_expectation(episodes_watched, watch_probs_improved)\n    watch_time_gain = (result_watch_improved.expected_value - result_watch.expected_value) * subscribers_viewing\n    \n    print(f\"\\n✨ Reducing bounce 25% → 20%: +{watch_time_gain / 1e6:.1f}M total episodes watched\")\n    \n    \n    # ============================================================================\n    # EXAMPLE 4: LINEARITY OF EXPECTATION (POWERFUL PROPERTY)\n    # ============================================================================\n    \n    print(\"\\n\" + \"=\" * 70)\n    print(\"EXAMPLE 4: Linearity of Expectation - Portfolio Returns\")\n    print(\"=\" * 70)\n    \n    # Stock A returns\n    returns_a = np.array([-0.10, 0.00, 0.05, 0.15, 0.25])\n    probs_a = np.array([0.10, 0.20, 0.40, 0.20, 0.10])\n    \n    # Stock B returns\n    returns_b = np.array([-0.05, 0.02, 0.08, 0.12, 0.20])\n    probs_b = np.array([0.15, 0.25, 0.30, 0.20, 0.10])\n    \n    # Portfolio: 60% stock A, 40% stock B, with $10k initial investment\n    weight_a, weight_b = 0.6, 0.4\n    initial_investment = 10000\n    \n    linearity_result = calc.linearity_demonstration(\n        returns_a, probs_a,\n        returns_b, probs_b,\n        a=weight_a, b=weight_b, c=0\n    )\n    \n    print(f\"\\nE[Return_A] = {linearity_result['E[X]']:.2%}\")\n    print(f\"E[Return_B] = {linearity_result['E[Y]']:.2%}\")\n    print(f\"\\nPortfolio: {weight_a:.0%} A + {weight_b:.0%} B\")\n    print(f\"E[Portfolio Return] = {linearity_result['E[aX + bY + c]']:.2%}\")\n    \n    expected_profit = linearity_result['E[aX + bY + c]'] * initial_investment\n    print(f\"\\nExpected profit on ${initial_investment:,}: ${expected_profit:,.2f}\")\n    \n    print(f\"\\n✅ Linearity verified: {linearity_result['match']}\")\n    print(f\"   (Works even if stock returns are correlated!)\")\n    ```\n\n    ## Comparison Tables\n\n    ### Expected Value vs Other Central Tendency Measures\n\n    | Measure | Formula | Interpretation | Robust to Outliers? | When to Use |\n    |---------|---------|----------------|---------------------|-------------|\n    | **Expected Value** | E[X] = Σ x·P(x) | Long-run average | **No** | Decision-making, revenue forecasting |\n    | **Median** | 50th percentile | Middle value | **Yes** | Skewed distributions (income, house prices) |\n    | **Mode** | Most frequent value | Typical outcome | **Yes** | Categorical data, most likely scenario |\n    | **Geometric Mean** | (∏ x_i)^(1/n) | Compound growth | **Partial** | Investment returns, growth rates |\n\n    ### Linearity vs Product Property\n\n    | Property | Formula | Independence Required? | Example | Power |\n    |----------|---------|----------------------|---------|-------|\n    | **Linearity** | E[aX + bY + c] = aE[X] + bE[Y] + c | **NO** ✅ | Portfolio expected return = weighted average | Simplifies complex calculations |\n    | **Product** | E[XY] = E[X]·E[Y] | **YES** ⚠️ | Expected revenue = E[customers] × E[spend per customer] | Only if independent |\n\n    ### Real Company Applications\n\n    | Company | Problem | Random Variable X | E[X] Used For | Business Impact |\n    |---------|---------|------------------|---------------|----------------|\n    | **Uber** | Trip revenue | Fare amount | E[Fare] = $22.50 | Revenue forecasting: $22.50 × 15M trips/day = $338M daily |\n    | **DraftKings** | Player pricing | Fantasy points | E[Points] = 28.5 → Salary $5,700 | Fair pricing prevents arbitrage |\n    | **Netflix** | Content value | Episodes watched | E[Episodes] = 4.2 → 3.15 hrs watch time | $10M show ÷ 50M viewers = $0.20/viewer |\n    | **Google Ads** | Campaign ROI | Click-through | E[Clicks] = 0.05 × 1M impressions = 50k clicks | Bid optimization: max bid = E[conversion value] |\n    | **Amazon** | Inventory planning | Daily demand | E[Units sold] = 1,250 ± 200 | Stock 1,450 units (E[X] + 1σ buffer) |\n\n    ### Common Misconceptions\n\n    | Misconception | Truth | Example |\n    |---------------|-------|----------|\n    | E[X] is the \"most likely\" value | **FALSE**: E[X] can be impossible outcome | E[Die roll] = 3.5, but die never shows 3.5 |\n    | E[1/X] = 1/E[X] | **FALSE**: Jensen's inequality | E[1/X] ≥ 1/E[X] for positive X |\n    | E[X²] = (E[X])² | **FALSE**: Missing variance term | E[X²] = (E[X])² + Var(X) |\n    | Need independence for E[X+Y]=E[X]+E[Y] | **FALSE**: Linearity always works | Even correlated variables: E[X+X] = 2E[X] |\n\n    !!! tip \"Interviewer's Insight\"\n        **What they test:**\n        \n        - Fundamental understanding: Can you explain E[X] as \"probability-weighted average\"?\n        - Linearity property: Do you know E[X+Y] = E[X] + E[Y] works WITHOUT independence?\n        - Practical application: Can you compute expected revenue, expected profit, expected return?\n        - Distinction from median/mode: When is E[X] not the \"typical\" value?\n        - Jensen's inequality: Understand E[g(X)] ≠ g(E[X]) for nonlinear g\n        \n        **Strong signals:**\n        \n        - **Formula mastery**: \"E[X] = Σ x·P(x) for discrete, ∫ x·f(x)dx for continuous\"\n        - **Linearity emphasis**: \"Linearity of expectation is EXTREMELY powerful—it works even when X and Y are dependent. At Uber, we use E[Revenue] = E[Trips] × E[Fare per trip] even though they're correlated\"\n        - **Real calculation**: \"Netflix's expected watch time: 25% bounce (0 eps) + 30% sample (2 eps) + 25% hooked (5 eps) + 15% binge (8 eps) + 5% complete (10 eps) = 0 + 0.6 + 1.25 + 1.2 + 0.5 = 3.55 episodes\"\n        - **Business context**: \"Expected value drives pricing: DraftKings prices players at $200 per expected fantasy point, so a 25-point expectation = $5,000 salary\"\n        - **Distinguishes from median**: \"For skewed distributions like income, median is more representative than mean. E[Income] is pulled up by billionaires\"\n        - **Jensen's inequality**: \"For convex function like x², E[X²] ≥ (E[X])². This is why Var(X) = E[X²] - (E[X])² ≥ 0\"\n        \n        **Red flags:**\n        \n        - Confuses E[X] with \"most likely value\" (that's the mode)\n        - Thinks E[XY] = E[X]·E[Y] always (needs independence)\n        - Can't calculate E[X] from a probability distribution by hand\n        - Doesn't recognize linearity as the KEY property\n        - Says \"average\" without clarifying arithmetic mean vs expected value\n        \n        **Follow-up questions:**\n        \n        - *\"How do you calculate E[X] if you only have data, not the distribution?\"* → Sample mean: x̄ = Σx_i/n\n        - *\"When does E[XY] = E[X]·E[Y]?\"* → When X ⊥ Y (independent)\n        - *\"What's E[X | Y]?\"* → Conditional expectation: E[X | Y=y] = Σ x·P(X=x | Y=y)\n        - *\"Explain Jensen's inequality\"* → For convex f: E[f(X)] ≥ f(E[X])\n        - *\"Expected value vs expected utility?\"* → Utility captures risk aversion: E[U(X)] vs E[X]\n\n    !!! warning \"Common Pitfalls\"\n        1. **Interpreting E[X] as attainable**: E[Die] = 3.5 is never rolled\n        2. **Forgetting to weight by probability**: E[X] ≠ average of possible values\n        3. **Assuming product rule without independence**: E[XY] = E[X]E[Y] only if X ⊥ Y\n        4. **Confusing E[X²] with (E[X])²**: Related by Var(X) = E[X²] - (E[X])²

Difficulty: 🟡 Medium | Tags: Variance, Standard Deviation, Spread, Risk | Asked by: Google, Meta, Amazon, Netflix, JPMorgan

View Answer

Variance and Standard Deviation quantify the spread or dispersion of a probability distribution. In business: variance = risk, and managing variance is critical for portfolio optimization, quality control, A/B test power analysis, and anomaly detection.

Core Definitions

Variance (σ² or Var(X)):

\[Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2\]

Standard Deviation (σ or SD(X)):

\[\sigma = \sqrt{Var(X)}\]

Key Insight: Standard deviation has the SAME UNITS as X, while variance has squared units. This makes σ interpretable: "typical deviation from mean."

Variance Properties Framework

┌──────────────────────────────────────────────────────────────┐
│         VARIANCE PROPERTIES (CRITICAL FOR INTERVIEWS)         │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  PROPERTY 1: Variance of Constant = 0                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Var(c) = 0                                             │ │
│  │                                                        │ │
│  │ No randomness → no variance                           │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 2: Scaling (QUADRATIC!)                            │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Var(aX) = a² · Var(X)                                 │ │
│  │                                                        │ │
│  │ ⚠️  SQUARES the constant (unlike expectation)          │ │
│  │ Example: Var(2X) = 4·Var(X), not 2·Var(X)            │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 3: Translation Invariance                          │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Var(X + b) = Var(X)                                    │ │
│  │                                                        │ │
│  │ Shifting all values doesn't change spread             │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 4: Sum of Independent Variables                    │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Var(X + Y) = Var(X) + Var(Y)   IFF X ⊥ Y             │ │
│  │                                                        │ │
│  │ Variances ADD for independent variables               │ │
│  │ ⚠️  Requires independence (unlike expectation)         │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PROPERTY 5: Sum with Covariance (General Case)              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Var(X + Y) = Var(X) + Var(Y) + 2·Cov(X,Y)            │ │
│  │                                                        │ │
│  │ Covariance term captures dependence                   │ │
│  │ Cov(X,Y) > 0 → more variance (positive correlation)   │ │
│  │ Cov(X,Y) < 0 → less variance (hedging effect)         │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Example - Dice Variance:

import numpy as np

# Roll of fair die
outcomes = np.array([1, 2, 3, 4, 5, 6])
probs = np.array([1/6] * 6)

# E[X] = 3.5
E_X = np.sum(outcomes * probs)

# E[X²] = 1²·(1/6) + 2²·(1/6) + ... + 6²·(1/6) = 91/6 ≈ 15.167
E_X2 = np.sum(outcomes**2 * probs)

# Var(X) = E[X²] - (E[X])²  = 91/6 - (7/2)² = 91/6 - 49/4 ≈ 2.917
variance = E_X2 - E_X**2
std_dev = np.sqrt(variance)  # σ ≈ 1.708

print(f"E[X] = {E_X:.3f}")
print(f"Var(X) = {variance:.3f}")
print(f"SD(X) = {std_dev:.3f}")

# Scaling demonstration: Var(2X) = 4·Var(X)
variance_2x = np.var(2 * outcomes * np.repeat(1/6, 6))
print(f"\nVar(2X) = {4 * variance:.3f}  (4 times larger!)")

Why Standard Deviation?

  • Same units as original data (variance has squared units like "dollars²")
  • Interpretable: For normal distributions, ~68% of data within ±1σ
  • Used in confidence intervals, z-scores, Sharpe ratios
  • Communication: Easier to explain "±$500" than "variance of 250,000 dollars²"

Interviewer's Insight

What they test:

  • Formula mastery: Can you write Var(X) = E[X²] - (E[X])² and derive it?
  • Scaling property: Do you know Var(aX) = a²·Var(X) (quadratic, not linear)?
  • Independence requirement: Var(X+Y) = Var(X)+Var(Y) only if X⊥Y
  • Real-world interpretation: Variance = risk, lower variance = more predictable
  • Coefficient of variation: σ/μ for comparing variability across different scales

Strong signals:

  • Formula with derivation: "Var(X) = E[(X-μ)²] expands to E[X²-2μX+μ²] = E[X²]-2μE[X]+μ² = E[X²]-(E[X])² by linearity"
  • Scaling intuition: "Var(2X) = 4·Var(X) because variance measures squared deviations. Doubling all values quadruples spread"
  • Real business example: "At Amazon, Prime delivery has σ=0.6 days vs Standard σ=1.8 days. That's 67% variance reduction"
  • Covariance in sums: "For dependent variables: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). This is why diversification works in finance"

Red flags:

  • Confuses Var(aX) = a·Var(X) (wrong, should be a²)
  • Thinks Var(X+Y) = Var(X)+Var(Y) always (needs independence)
  • Can't explain why we use σ instead of σ² (units!)
  • Doesn't know E[X²] - (E[X])² formula

Explain the Central Limit Theorem - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: CLT, Normal Distribution, Sampling, Inference | Asked by: Google, Amazon, Meta, Microsoft, Netflix

View Answer

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It states that sample means become normally distributed as sample size increases, regardless of the population's original distribution. This is why we can use normal-based inference (z-tests, t-tests, confidence intervals) even when data isn't normal!

Formal Statement

Let X₁, X₂, ..., Xₙ be i.i.d. random variables with E[Xᵢ] = μ and Var(Xᵢ) = σ² < ∞.

Then the sample mean X̄ₙ = (X₁ + ... + Xₙ)/n converges in distribution to normal:

\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]

Equivalently, the standardized sample mean:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1)\]

CLT Workflow

┌──────────────────────────────────────────────────────────────┐
│         CENTRAL LIMIT THEOREM MAGIC EXPLAINED                 │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  STEP 1: Start with ANY Population Distribution              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Exponential, Uniform, Binomial, Even Bimodal!          │ │
│  │                                                        │ │
│  │ Only requirement: Finite variance σ²                  │ │
│  │ Population: μ (mean), σ² (variance)                   │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  STEP 2: Draw Samples of Size n                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Take n observations: X₁, X₂, ..., Xₙ                  │ │
│  │                                                        │ │
│  │ Calculate sample mean: X̄ = (ΣXᵢ) / n                 │ │
│  │                                                        │ │
│  │ Repeat many times → get distribution of X̄            │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  STEP 3: Observe the Miracle! 🎉                            │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Distribution of X̄ becomes NORMAL as n increases!     │ │
│  │                                                        │ │
│  │ • Mean: E[X̄] = μ (same as population)                │ │
│  │ • Variance: Var(X̄) = σ²/n (decreases with n)        │ │
│  │ • Std Dev (SE): SD(X̄) = σ/√n ("Standard Error")     │ │
│  │                                                        │ │
│  │ For n≥30: X̄ ~ N(μ, σ²/n) approximately              │ │
│  └────────────────────────────────────────────────────────┘ │
│               ↓                                              │
│  PRACTICAL CONSEQUENCE                                       │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Can use z-tests, t-tests, confidence intervals         │ │
│  │ WITHOUT assuming population is normal!                 │ │
│  │                                                        │ │
│  │ This is the foundation of A/B testing!                │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Production Python Implementation

import numpy as np
import pandas as pd
from typing import Callable, List, Tuple
from scipy import stats
import matplotlib.pyplot as plt
from dataclasses import dataclass


@dataclass
class CLTDemonstration:
    """Results from CLT simulation."""
    population_mean: float
    population_std: float
    sample_size: int
    num_samples: int
    sample_means: np.ndarray
    theoretical_mean: float
    theoretical_std_error: float
    empirical_mean: float
    empirical_std_error: float
    normality_test_pvalue: float


class CentralLimitTheoremAnalyzer:
    """
    Demonstrate and apply Central Limit Theorem.

    Used by:
    - Google: A/B test sample size calculations
    - Netflix: Confidence intervals for engagement metrics
    - Amazon: Quality control (defect rate estimation)
    - Uber: Trip duration confidence intervals
    """

    def demonstrate_clt(
        self,
        population_dist: Callable,
        sample_size: int,
        num_samples: int = 10000,
        population_mean: Optional[float] = None,
        population_std: Optional[float] = None
    ) -> CLTDemonstration:
        """
        Demonstrate CLT by simulation.

        Args:
            population_dist: Function that generates n samples from population
            sample_size: Size of each sample (n)
            num_samples: Number of sample means to generate
            population_mean: True population mean (if known)
            population_std: True population std dev (if known)

        Returns:
            CLTDemonstration with empirical and theoretical statistics
        """
        # Generate many sample means
        sample_means = np.array([
            np.mean(population_dist(sample_size))
            for _ in range(num_samples)
        ])

        # Empirical statistics from simulation
        empirical_mean = np.mean(sample_means)
        empirical_std_error = np.std(sample_means, ddof=1)

        # Theoretical statistics (if population params known)
        if population_mean is None or population_std is None:
            # Estimate from large sample
            large_sample = population_dist(100000)
            population_mean = np.mean(large_sample)
            population_std = np.std(large_sample, ddof=1)

        theoretical_std_error = population_std / np.sqrt(sample_size)

        # Test normality (Shapiro-Wilk)
        _, normality_pvalue = stats.shapiro(
            sample_means[:5000]  # Shapiro-Wilk limit
        )

        return CLTDemonstration(
            population_mean=population_mean,
            population_std=population_std,
            sample_size=sample_size,
            num_samples=num_samples,
            sample_means=sample_means,
            theoretical_mean=population_mean,
            theoretical_std_error=theoretical_std_error,
            empirical_mean=empirical_mean,
            empirical_std_error=empirical_std_error,
            normality_test_pvalue=normality_pvalue
        )

    def minimum_sample_size(
        self,
        population_std: float,
        margin_of_error: float,
        confidence_level: float = 0.95
    ) -> int:
        """
        Calculate minimum sample size for desired precision.

        Based on CLT: n = (z*σ / E)²
        where E = margin of error

        Used by data scientists for experiment design.
        """
        z_score = stats.norm.ppf((1 + confidence_level) / 2)
        n = (z_score * population_std / margin_of_error) ** 2
        return int(np.ceil(n))


# ============================================================================
# EXAMPLE 1: EXPONENTIAL DISTRIBUTION → NORMAL SAMPLE MEANS
# ============================================================================

print("=" * 70)
print("EXAMPLE 1: CLT with Exponential Distribution (Highly Skewed)")
print("=" * 70)

analyzer = CentralLimitTheoremAnalyzer()

# Exponential(λ=1): Mean=1, Var=1, Highly right-skewed
exponential_dist = lambda n: np.random.exponential(scale=1.0, size=n)

# Small sample size (n=5) - CLT weak
result_n5 = analyzer.demonstrate_clt(
    exponential_dist,
    sample_size=5,
    population_mean=1.0,
    population_std=1.0
)

print(f"\nSample size n=5:")
print(f"  Theoretical SE: {result_n5.theoretical_std_error:.3f}")
print(f"  Empirical SE: {result_n5.empirical_std_error:.3f}")
print(f"  Normality test p-value: {result_n5.normality_test_pvalue:.4f}")
print(f"  Normal? {result_n5.normality_test_pvalue > 0.05}")

# Medium sample size (n=30) - CLT kicks in!
result_n30 = analyzer.demonstrate_clt(
    exponential_dist,
    sample_size=30,
    population_mean=1.0,
    population_std=1.0
)

print(f"\nSample size n=30:")
print(f"  Theoretical SE: {result_n30.theoretical_std_error:.3f}")
print(f"  Empirical SE: {result_n30.empirical_std_error:.3f}")
print(f"  Normality test p-value: {result_n30.normality_test_pvalue:.4f}")
print(f"  Normal? {result_n30.normality_test_pvalue > 0.05}")

# Large sample size (n=100) - Strongly normal
result_n100 = analyzer.demonstrate_clt(
    exponential_dist,
    sample_size=100,
    population_mean=1.0,
    population_std=1.0
)

print(f"\nSample size n=100:")
print(f"  Theoretical SE: {result_n100.theoretical_std_error:.3f}")
print(f"  Empirical SE: {result_n100.empirical_std_error:.3f}")
print(f"  Normality test p-value: {result_n100.normality_test_pvalue:.4f}")
print(f"  Normal? {result_n100.normality_test_pvalue > 0.05}")

print(f"\n🎯 As n increases, SE decreases (√n): {1/np.sqrt(5):.3f}{1/np.sqrt(30):.3f}{1/np.sqrt(100):.3f}")


# ============================================================================
# EXAMPLE 2: NETFLIX - CONFIDENCE INTERVAL FOR AVG WATCH TIME
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 2: NETFLIX - Watch Time Confidence Interval (CLT)")
print("=" * 70)

# Sample of 500 users
# Population: Unknown distribution (probably right-skewed)
# But CLT lets us use normal inference!

np.random.seed(42)
watch_times = np.random.gamma(shape=2, scale=2.5, size=500)  # Skewed data

n = len(watch_times)
sample_mean = np.mean(watch_times)
sample_std = np.std(watch_times, ddof=1)

# Standard error (by CLT)
se = sample_std / np.sqrt(n)

# 95% confidence interval
z_95 = 1.96
ci_95 = (sample_mean - z_95 * se, sample_mean + z_95 * se)

print(f"\nSample size: {n} users")
print(f"Sample mean: {sample_mean:.2f} hours")
print(f"Sample std dev: {sample_std:.2f} hours")
print(f"Standard error: {se:.3f} hours")
print(f"\n95% CI: ({ci_95[0]:.2f}, {ci_95[1]:.2f}) hours")
print(f"\n🎬 We're 95% confident true mean watch time is in [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
print(f"   (Thanks to CLT, even though data is skewed!)")


# ============================================================================
# EXAMPLE 3: GOOGLE - A/B TEST SAMPLE SIZE CALCULATION
# ============================================================================

print("\n" + "=" * 70)
print("EXAMPLE 3: GOOGLE - Sample Size for A/B Test (CLT-based)")
print("=" * 70)

# Google wants to detect 2% CTR improvement
# Control CTR: 5% (σ ≈ √(0.05 × 0.95) ≈ 0.218 for binary outcome)

baseline_ctr = 0.05
population_std = np.sqrt(baseline_ctr * (1 - baseline_ctr))

# Want margin of error = 0.005 (0.5%) at 95% confidence
margin_of_error = 0.005

n_required = analyzer.minimum_sample_size(
    population_std=population_std,
    margin_of_error=margin_of_error,
    confidence_level=0.95
)

print(f"\nBaseline CTR: {baseline_ctr:.1%}")
print(f"Population std: {population_std:.3f}")
print(f"Desired margin of error: {margin_of_error:.2%}")
print(f"\nRequired sample size per variant: {n_required:,}")
print(f"Total experiment size: {2 * n_required:,}")

# At 1M daily users, how long to run?
daily_users = 1_000_000
users_per_variant = daily_users / 2
days_needed = np.ceil(n_required / users_per_variant)

print(f"\n📊 With {daily_users:,} daily users (50/50 split):")
print(f"   Need to run experiment for {int(days_needed)} days")

Comparison Tables

CLT Requirements and Edge Cases

Condition Requirement What If Violated? Example
Independence X₁, X₂, ..., Xₙ i.i.d. CLT may not hold Time series with autocorrelation
Finite Variance σ² < ∞ CLT fails Cauchy distribution (heavy tails)
Sample Size n "large enough" (≥30) CLT approximation poor n=5 with skewed data
Identical Distribution All from same population Need more complex theory Mixed populations

Sample Size Guidelines by Distribution Shape

Population Distribution Minimum n for CLT Rationale
Normal n ≥ 1 (already normal!) Sample mean exactly normal
Symmetric (uniform, etc) n ≥ 5-10 Fast convergence
Moderate Skew (exponential) n ≥ 30 Classic "rule of 30"
High Skew (Pareto, log-normal) n ≥ 100+ Slow convergence
Heavy Tails (t-dist) n ≥ 50 Depends on tail parameter

Real Company Applications

Company Application Population Distribution Sample Size CLT Enables
Google A/B test CTR Bernoulli (binary clicks) 10,000 per variant 95% CI: [3.2%, 3.8%] for control
Netflix Avg watch time Right-skewed (gamma-like) 500 users CI without assuming normality
Amazon Order value Heavy right tail (large orders) 1,000 customers/day Daily revenue forecasting
Uber Trip duration Bimodal (short vs long trips) 50 trips → normal means Pricing optimization
Stripe Transaction amounts Highly skewed (few large) 200 transactions Fraud detection thresholds

Interviewer's Insight

What they test:

  • Core understanding: Can you explain WHY sample means become normal?
  • Conditions: Independence, finite variance, large enough n
  • Standard error: SE = σ/√n (not σ/n)
  • Practical application: Confidence intervals, hypothesis testing, sample size calculations
  • Limitations: Doesn't apply to individual observations, only sample means

Strong signals:

  • Statement with precision: "CLT says the SAMPLING DISTRIBUTION of the sample mean approaches N(μ, σ²/n) as n→∞, regardless of population distribution—assuming i.i.d. and finite variance"
  • Standard error mastery: "SE = σ/√n means precision improves with √n, not n. To halve SE, need 4x sample size. This is why A/B tests at Google need 10k+ users per variant"
  • Real application: "At Netflix, even though watch times are right-skewed (many short views, few bingers), with n=500 we can use CLT to build 95% CI: [4.8, 5.4] hours. The skewness doesn't matter for the MEAN's distribution"
  • n≥30 nuance: "n≥30 is a rule of thumb. For symmetric distributions like uniform, n=10 works. For highly skewed like exponential, might need n=50+. I'd check with QQ-plot or bootstrap"
  • Individual vs mean: "CLT applies to X̄, not individual Xᵢ. Individual observations DON'T become normal. Common mistake!"

Red flags:

  • Says "data becomes normal" (wrong: sample MEANS become normal)
  • Thinks CLT requires normal population (opposite: it's powerful because it doesn't!)
  • Can't explain standard error = σ/√n
  • Doesn't know conditions (independence, finite variance)
  • Confuses n≥30 as hard rule (it's context-dependent)

Follow-up questions:

  • "What if population variance is infinite?" → CLT fails (Cauchy distribution example)
  • "Does CLT apply to medians?" → No, different limit theorem (quantile asymptotics)
  • "What if observations are dependent?" → Need time series CLT or assume weak dependence
  • "How to check if n is large enough?" → QQ-plot, normality tests, bootstrap simulation
  • "What's finite sample correction?" → t-distribution when σ unknown and n small

Common Pitfalls

  1. Individual vs mean: CLT applies to X̄, not individual Xᵢ values
  2. Magic n=30: Not always sufficient for skewed data
  3. SE formula: It's σ/√n, not σ/n
  4. Assuming normality: CLT tells us when we CAN assume normality (for means), not when data IS normal

What is the Normal Distribution? State its Properties - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Normal, Gaussian, Continuous Distribution, CLT | Asked by: Google, Amazon, Meta, Microsoft, Netflix

View Answer

The Normal (Gaussian) Distribution is the most important probability distribution in statistics. Its ubiquity comes from the Central Limit Theorem: sums and averages of many random variables converge to normal, making it the default for modeling aggregate phenomena like test scores, measurement errors, stock returns, and biological traits.

Probability Density Function

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty\]

Notation: X ~ N(μ, σ²) where: - μ = mean (location parameter) - σ² = variance (scale parameter) - σ = standard deviation

Key Properties

| Property | Value | Significance |\n |----------|-------|---------------|\n | Mean | μ | Center of distribution |\n | Median | μ | Same as mean (symmetric) |\n | Mode | μ | Peak at mean |\n | Variance | σ² | Spread measure |\n | Skewness | 0 | Perfectly symmetric |\n | Kurtosis | 3 | Moderate tails (mesokurtic) |\n | Support | (-∞, ∞) | All real numbers possible |\n | Entropy | ½ log(2πeσ²) | Maximum among all distributions with given variance |\n\n ## Empirical Rule (68-95-99.7)\n\n \n ┌──────────────────────────────────────────────────────────────┐\n │ NORMAL DISTRIBUTION INTERVALS │\n ├──────────────────────────────────────────────────────────────┤\n │ │\n │ │◄────── 68.27% of data ──────►│ │\n │ μ-σ μ+σ │\n │ │\n │ │◄─────────── 95.45% of data ────────────►│ │\n │ μ-2σ μ+2σ │\n │ │\n │ │◄──────────────── 99.73% of data ────────────────►│ │\n │μ-3σ μ+3σ │\n │ │\n │ \ud83d\udcc8 Bell Curve: │\n │ ╱─╲ │\n │ ╱ ╲ │\n │ ╱ ╲ │\n │ ╱ ╲ │\n │ _╱─ ─╲_ │\n │ __╱─ ─╲__ │\n │ ___╱─ ─╲___ │\n │ ────────────────────────────────────────────────── │\n │ -3σ -2σ -σ μ +σ +2σ +3σ │\n │ │\n └──────────────────────────────────────────────────────────────┘\n\n\n ## Standard Normal Distribution\n\n Z-score transformation standardizes any normal to N(0,1):\n \n \(\(Z = \\frac{X - \\mu}{\\sigma} \\sim N(0, 1)\)\)\n \n Properties of Z:\n - Mean = 0\n - Variance = 1\n - Used for: probability lookups, comparing across scales\n\n ## Production Python Implementation\n\n python\n import numpy as np\n import pandas as pd\n from scipy import stats\n from typing import Tuple, List\n from dataclasses import dataclass\n \n \n @dataclass\n class NormalAnalysisResult:\n \"\"\"Results from normal distribution analysis.\"\"\"\n mean: float\n std_dev: float\n percentiles: dict\n probabilities: dict\n z_scores: dict\n \n \n class NormalDistributionAnalyzer:\n \"\"\"\n Production analyzer for normal distribution.\n \n Used by:\n - Google: Latency SLA monitoring (99th percentile)\n - Netflix: Video quality scores (QoE distribution)\n - SAT/ACT: Test score standardization\n - Finance: VaR (Value at Risk) calculations\n \"\"\"\n \n def __init__(self, mu: float, sigma: float):\n \"\"\"Initialize with distribution parameters.\"\"\"\n self.mu = mu\n self.sigma = sigma\n self.dist = stats.norm(loc=mu, scale=sigma)\n \n def probability(self, lower: float = -np.inf, upper: float = np.inf) -> float:\n \"\"\"Calculate P(lower < X < upper).\"\"\"\n return self.dist.cdf(upper) - self.dist.cdf(lower)\n \n def percentile(self, p: float) -> float:\n \"\"\"Find value at percentile p (0-1).\"\"\"\n return self.dist.ppf(p)\n \n def z_score(self, x: float) -> float:\n \"\"\"Standardize value to z-score.\"\"\"\n return (x - self.mu) / self.sigma\n \n def empirical_rule_check(self, data: np.ndarray) -> dict:\n \"\"\"Verify empirical rule on actual data.\"\"\"\n within_1sigma = np.sum((data >= self.mu - self.sigma) & \n (data <= self.mu + self.sigma)) / len(data)\n within_2sigma = np.sum((data >= self.mu - 2*self.sigma) & \n (data <= self.mu + 2*self.sigma)) / len(data)\n within_3sigma = np.sum((data >= self.mu - 3*self.sigma) & \n (data <= self.mu + 3*self.sigma)) / len(data)\n \n return {\n '1_sigma': {'empirical': within_1sigma, 'theoretical': 0.6827},\n '2_sigma': {'empirical': within_2sigma, 'theoretical': 0.9545},\n '3_sigma': {'empirical': within_3sigma, 'theoretical': 0.9973}\n }\n \n \n # ============================================================================\n # EXAMPLE 1: SAT SCORES - PERCENTILE ANALYSIS\n # ============================================================================\n \n print(\"=\" * 70)\n print(\"EXAMPLE 1: SAT SCORES - Normal Distribution Analysis\")\n print(\"=\" * 70)\n \n # SAT Math: μ=528, σ=117 (approximate 2023 data)\n sat_math = NormalDistributionAnalyzer(mu=528, sigma=117)\n \n print(f\"\\nSAT Math Distribution: N({sat_math.mu}, {sat_math.sigma}\u00b2)\")\n \n # Key percentiles\n percentiles = [25, 50, 75, 90, 95, 99]\n print(f\"\\nPercentiles:\")\n for p in percentiles:\n score = sat_math.percentile(p/100)\n print(f\" {p}th: {score:.0f}\")\n \n # Probability calculations\n print(f\"\\nProbabilities:\")\n print(f\" P(Score > 700): {sat_math.probability(700, np.inf):.2%}\")\n print(f\" P(Score > 750): {sat_math.probability(750, np.inf):.2%}\")\n print(f\" P(400 < Score < 600): {sat_math.probability(400, 600):.2%}\")\n \n # Z-scores for key values\n print(f\"\\nZ-scores:\")\n for score in [400, 528, 600, 700, 800]:\n z = sat_math.z_score(score)\n print(f\" Score {score}: z = {z:.2f}\")\n \n \n # ============================================================================\n # EXAMPLE 2: GOOGLE - API LATENCY MONITORING\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 2: GOOGLE - API Latency SLA Monitoring\")\n print(\"=\" * 70)\n \n # API latency: μ=45ms, σ=12ms (approximately normal by CLT)\n latency = NormalDistributionAnalyzer(mu=45, sigma=12)\n \n print(f\"\\nAPI Latency: N({latency.mu}ms, {latency.sigma}ms)\")\n \n # SLA: 95% of requests < 65ms\n sla_threshold = 65\n p_within_sla = latency.probability(-np.inf, sla_threshold)\n \n print(f\"\\nSLA Analysis:\")\n print(f\" Threshold: {sla_threshold}ms\")\n print(f\" P(Latency < {sla_threshold}ms): {p_within_sla:.2%}\")\n print(f\" SLA Met? {p_within_sla >= 0.95}\")\n \n # What latency is 99th percentile? (for alerting)\n p99 = latency.percentile(0.99)\n print(f\"\\n P99 latency: {p99:.1f}ms\")\n print(f\" \ud83d\udea8 Alert if latency > {p99:.0f}ms (top 1%)\")\n \n # Expected violations per 1M requests\n total_requests = 1_000_000\n violations = total_requests * (1 - p_within_sla)\n print(f\"\\n Expected SLA violations per 1M requests: {violations:,.0f}\")\n \n \n # ============================================================================\n # EXAMPLE 3: FINANCE - VALUE AT RISK (VaR)\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 3: FINANCE - Portfolio Value at Risk (VaR)\")\n print(\"=\" * 70)\n \n # Daily portfolio returns: μ=0.05%, σ=1.2%\n returns = NormalDistributionAnalyzer(mu=0.0005, sigma=0.012)\n \n portfolio_value = 10_000_000 # $10M\n \n print(f\"\\nPortfolio: ${portfolio_value:,}\")\n print(f\"Daily returns: N({returns.mu:.2%}, {returns.sigma:.2%})\")\n \n # VaR at 95% confidence: \"Maximum loss with 95% probability\"\n var_95 = returns.percentile(0.05) # 5th percentile (left tail)\n dollar_var_95 = portfolio_value * var_95\n \n print(f\"\\nValue at Risk (VaR):\")\n print(f\" 95% VaR (returns): {var_95:.2%}\")\n print(f\" 95% VaR (dollars): ${abs(dollar_var_95):,.0f}\")\n print(f\" Interpretation: 95% confident we won't lose more than ${abs(dollar_var_95):,.0f} tomorrow\")\n \n # 99% VaR (more conservative)\n var_99 = returns.percentile(0.01)\n dollar_var_99 = portfolio_value * var_99\n print(f\"\\n 99% VaR (returns): {var_99:.2%}\")\n print(f\" 99% VaR (dollars): ${abs(dollar_var_99):,.0f}\")\n \n \n # ============================================================================\n # EXAMPLE 4: EMPIRICAL RULE VERIFICATION\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 4: Empirical Rule (68-95-99.7) Verification\")\n print(\"=\" * 70)\n \n # Generate sample data\n np.random.seed(42)\n sample_data = np.random.normal(loc=100, scale=15, size=100000)\n \n analyzer = NormalDistributionAnalyzer(mu=100, sigma=15)\n rule_check = analyzer.empirical_rule_check(sample_data)\n \n print(f\"\\nSample: N(100, 15\u00b2), n=100,000\")\n print(f\"\\nEmpirical Rule Verification:\")\n for interval, values in rule_check.items():\n emp = values['empirical']\n theo = values['theoretical']\n diff = abs(emp - theo)\n print(f\" \u00b1{interval.replace('_', ' ')}: {emp:.2%} (theoretical: {theo:.2%}, diff: {diff:.2%})\")\n\n\n ## Comparison Tables\n\n ### Normal vs Other Distributions\n\n | Property | Normal | Uniform | Exponential | t-Distribution |\n |----------|--------|---------|-------------|----------------|\n | Symmetry | Symmetric | Symmetric | Right-skewed | Symmetric |\n | Tails | Moderate | None (bounded) | Heavy right tail | Heavy both tails |\n | Parameters | μ, σ | a, b (bounds) | λ (rate) | ν (df) |\n | Support | (-∞, ∞) | [a, b] | [0, ∞) | (-∞, ∞) |\n | Sum Property | Sum is normal | Sum not uniform | Sum is gamma | Sum not t |\n | CLT Result | Appears naturally | From uniform samples | From exponential samples | Approaches normal as df↑ |\n\n ### Real Company Applications\n\n | Company | Application | Mean (μ) | Std Dev (σ) | Business Decision |\n |---------|-------------|----------|-------------|-------------------|\n | Google | API latency | 45ms | 12ms | P99 SLA = μ + 2.33σ = 73ms |\n | Netflix | Video quality score | 4.⅖ | 0.6 | P(QoE > 4.5) = 31% → improve encoding |\n | SAT/ACT | Test scores | 528 | 117 | 700+ score = top 7% (z=1.47) |\n | JPMorgan | Daily returns | 0.05% | 1.2% | 99% VaR = -2.75% → risk limit |\n | Amazon | Delivery time (Prime) | 2 days | 0.5 days | 99% within 3.2 days |\n\n ### Z-Score Interpretation\n\n | Z-Score | Percentile | Interpretation | Example (IQ: μ=100, σ=15) |\n |---------|------------|----------------|---------------------------|\n | -3.0 | 0.13% | Extremely low | IQ = 55 |\n | -2.0 | 2.28% | Very low | IQ = 70 |\n | -1.0 | 15.87% | Below average | IQ = 85 |\n | 0.0 | 50% | Average | IQ = 100 |\n | +1.0 | 84.13% | Above average | IQ = 115 |\n | +2.0 | 97.72% | Very high | IQ = 130 |\n | +3.0 | 99.87% | Extremely high | IQ = 145 |\n\n !!! tip \"Interviewer's Insight\"\n What they test:\n \n - Empirical rule: Can you state 68-95-99.7 rule and use it?\n - Z-score: Can you standardize values and interpret z-scores?\n - Sum property: Do you know sum of independent normals is normal?\n - CLT connection: Why is normal distribution so ubiquitous?\n - Practical applications: Percentiles, SLAs, confidence intervals\n \n Strong signals:\n \n - PDF formula: \"f(x) = (1/σ√(2π)) exp(-(x-μ)²/(2σ²)) — I recognize the exponential with squared term in numerator\"\n - Empirical rule precision: \"68.27% within ±1σ, 95.45% within ±2σ, 99.73% within ±3σ. At Google, we use P99 = μ + 2.33σ for SLAs\"\n - Z-score transformation: \"z = (x-μ)/σ standardizes to N(0,1). This lets us compare across different scales\u2014SAT score 700 (z=1.47) equals IQ 122 in terms of percentile\"\n - Sum property: \"If X~N(μ₁,σ₁²) and Y~N(μ₂,σ₂²) independent, then X+Y~N(μ₁+μ₂, σ₁²+σ₂²). Means add, variances add\"\n - Real calculation: \"For API latency N(45ms, 12ms), P99 = 45 + 2.33×12 = 73ms. We alert if sustained latency exceeds this\"\n \n Red flags:\n \n - Confuses σ with σ² in notation N(μ, σ²)\n - Can't calculate z-score or interpret it\n - Doesn't know empirical rule percentages\n - Thinks normal is only distribution (ignores heavy tails, skewness in real data)\n - Says \"average\" without specifying mean (could be median, mode)\n \n Follow-up questions:\n \n - \"How do you test if data is normal?\" → QQ-plot, Shapiro-Wilk test, check skewness/kurtosis\n - \"What if data has heavy tails?\" → Use t-distribution or robust methods\n - \"Why (2π)^(-½) in PDF?\" → Normalization constant so ∫f(x)dx = 1\n - \"What's the relationship to CLT?\" → CLT explains why normal appears so often (sums converge to normal)\n - \"How to generate normal random variables?\" → Box-Muller transform, inverse CDF method\n\n !!! warning \"Common Pitfalls\"\n 1. Notation confusion: N(μ, σ²) uses variance, not std dev (some texts use σ)\n 2. Empirical rule misapplication: Only applies to normal, not skewed data\n 3. Z-score direction: z=2 means 2σ ABOVE mean (positive), not below\n 4. Assuming normality: Real data often has outliers/skew—always check!


Explain the Binomial Distribution - Amazon, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Binomial, Discrete, Bernoulli Trials, PMF | Asked by: Amazon, Meta, Google, Microsoft

View Answer

The Binomial Distribution models the number of successes in n fixed, independent trials, each with the same success probability p. It's the fundamental discrete distribution for binary outcome experiments: coin flips, A/B tests, quality control, medical trials.

Probability Mass Function (PMF)

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, ..., n\]

Where: - \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) = number of ways to choose k successes from n trials - \(p^k\) = probability of k successes - \((1-p)^{n-k}\) = probability of (n-k) failures

Conditions (BINS)

┌────────────────────────────────────────────────────────────┐
│         BINOMIAL DISTRIBUTION CONDITIONS (BINS)            │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  B - Binary outcomes:    Each trial has 2 outcomes        │
│                           (Success/Failure, Yes/No)        │
│                                                            │
│  I - Independent trials: Outcome of one trial doesn't     │
│                           affect others                    │
│                                                            │
│  N - Number fixed:       n trials determined in advance   │
│                           (not stopping after X successes) │
│                                                            │
│  S - Same probability:   p constant across all trials     │
│                           (homogeneous)                    │
│                                                            │
│  \ud83d\udcca Distribution Shape:                                    │
│                                                            │
│   p=0.1 (skewed right)     p=0.5 (symmetric)              │
│   █                        ████                            │
│   ██                      ██████                           │
│   ███                    ████████                          │
│   ████                  ██████████                         │
│   ─────────────────     ────────────────                  │
│   0 1 2 3 4 5...        0  1  2  3  4  5                  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Key Formulas

| Statistic | Formula | Intuition |\n |-----------|---------|------------|\n | Mean | E[X] = np | Average successes = trials × success rate |\n | Variance | Var(X) = np(1-p) | Maximum when p=0.5 (most uncertain) |\n | Std Dev | σ = √[np(1-p)] | Spread of successes |\n | Mode | ⌊(n+1)p⌋ | Most likely number of successes |\n | P(X=0) | (1-p)ⁿ | All failures |\n | P(X=n) | pⁿ | All successes |\n\n Variance interpretation: Maximum at p=0.5 (coin flip), minimum at p→0 or p→1 (deterministic)\n\n ## Production Python Implementation\n\n python\n import numpy as np\n import pandas as pd\n from scipy.stats import binom\n from scipy.special import comb\n from typing import List, Tuple\n from dataclasses import dataclass\n \n \n @dataclass\n class BinomialAnalysisResult:\n \"\"\"Results from binomial distribution analysis.\"\"\"\n n: int\n p: float\n mean: float\n variance: float\n std_dev: float\n probabilities: dict\n \n \n class BinomialAnalyzer:\n \"\"\"\n Production analyzer for binomial distribution.\n \n Used by:\n - Amazon: Defect rate analysis in warehouse operations\n - Meta: A/B test significance (conversions out of n visitors)\n - Google: Click-through rate experiments\n - Pharmaceutical: Clinical trial success modeling\n \"\"\"\n \n def __init__(self, n: int, p: float):\n \"\"\"Initialize with n trials and success probability p.\"\"\"\n self.n = n\n self.p = p\n self.dist = binom(n=n, p=p)\n self.mean = n * p\n self.variance = n * p * (1 - p)\n self.std_dev = np.sqrt(self.variance)\n \n def pmf(self, k: int) -> float:\n \"\"\"P(X = k): Probability of exactly k successes.\"\"\"\n return self.dist.pmf(k)\n \n def cdf(self, k: int) -> float:\n \"\"\"P(X ≤ k): Probability of at most k successes.\"\"\"\n return self.dist.cdf(k)\n \n def survival(self, k: int) -> float:\n \"\"\"P(X > k): Probability of more than k successes.\"\"\"\n return 1 - self.dist.cdf(k)\n \n def probability_range(self, k_min: int, k_max: int) -> float:\n \"\"\"P(k_min ≤ X ≤ k_max).\"\"\"\n return self.dist.cdf(k_max) - self.dist.cdf(k_min - 1)\n \n def confidence_interval(self, confidence: float = 0.95) -> Tuple[int, int]:\n \"\"\"Find (lower, upper) bounds containing confidence% of probability.\"\"\"\n alpha = (1 - confidence) / 2\n lower = self.dist.ppf(alpha)\n upper = self.dist.ppf(1 - alpha)\n return (int(np.floor(lower)), int(np.ceil(upper)))\n \n def normal_approximation_valid(self) -> bool:\n \"\"\"Check if normal approximation conditions met.\"\"\"\n return (self.n * self.p >= 5) and (self.n * (1 - self.p) >= 5)\n \n \n # ============================================================================\n # EXAMPLE 1: AMAZON WAREHOUSE - QUALITY CONTROL\n # ============================================================================\n \n print(\"=\" * 70)\n print(\"EXAMPLE 1: AMAZON WAREHOUSE - Defect Rate Quality Control\")\n print(\"=\" * 70)\n \n # Amazon inspects batch of 100 items, historical defect rate = 3%\n amazon_qc = BinomialAnalyzer(n=100, p=0.03)\n \n print(f\"\\nBatch Size: {amazon_qc.n}\")\n print(f\"Defect Rate: {amazon_qc.p:.1%}\")\n print(f\"Expected defects: {amazon_qc.mean:.2f} \u00b1 {amazon_qc.std_dev:.2f}\")\n \n # Key questions\n print(f\"\\nProbability Analysis:\")\n print(f\" P(X = 0) [no defects]: {amazon_qc.pmf(0):.2%}\")\n print(f\" P(X = 3) [exactly 3]: {amazon_qc.pmf(3):.2%}\")\n print(f\" P(X ≤ 2) [acceptable]: {amazon_qc.cdf(2):.2%}\")\n print(f\" P(X > 5) [reject batch]: {amazon_qc.survival(5):.2%}\")\n \n # Decision rule: Reject if > 5 defects\n reject_prob = amazon_qc.survival(5)\n print(f\"\\nBatch Rejection:\")\n print(f\" Rule: Reject if more than 5 defects\")\n print(f\" P(Reject | true p=0.03): {reject_prob:.2%}\")\n print(f\" \ud83d\udea8 False positive rate: {reject_prob:.2%}\")\n \n # 95% confidence interval\n lower, upper = amazon_qc.confidence_interval(0.95)\n print(f\"\\n 95% CI for defects: [{lower}, {upper}]\")\n print(f\" Interpretation: 95% of batches will have {lower}-{upper} defects\")\n \n \n # ============================================================================\n # EXAMPLE 2: META A/B TEST - CONVERSION RATE\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 2: META A/B TEST - Button Conversion Rate\")\n print(\"=\" * 70)\n \n # Control group: n=1000 visitors, p=0.12 (12% baseline conversion)\n control = BinomialAnalyzer(n=1000, p=0.12)\n \n # Treatment group: hypothesized p=0.15 (15% after button change)\n treatment = BinomialAnalyzer(n=1000, p=0.15)\n \n print(f\"\\nControl Group: n={control.n}, p={control.p:.1%}\")\n print(f\" Expected conversions: {control.mean:.1f} \u00b1 {control.std_dev:.2f}\")\n print(f\" 95% CI: {control.confidence_interval(0.95)}\")\n \n print(f\"\\nTreatment Group: n={treatment.n}, p={treatment.p:.1%}\")\n print(f\" Expected conversions: {treatment.mean:.1f} \u00b1 {treatment.std_dev:.2f}\")\n print(f\" 95% CI: {treatment.confidence_interval(0.95)}\")\n \n # Power analysis: Can we detect difference?\n # P(Treatment shows > 140 conversions | p=0.15)\n threshold = 140\n power = treatment.survival(threshold - 1)\n type_2_error = 1 - power\n \n print(f\"\\nPower Analysis (threshold = {threshold} conversions):\")\n print(f\" P(Detect improvement | true p=0.15): {power:.2%}\")\n print(f\" Type II error (β): {type_2_error:.2%}\")\n print(f\" Statistical power: {power:.2%}\")\n \n # Interpretation\n if power >= 0.80:\n print(f\" ✅ Sufficient power (≥80%) to detect 3% lift\")\n else:\n print(f\" \u26a0\ufe0f Insufficient power (<80%), need larger sample\")\n \n \n # ============================================================================\n # EXAMPLE 3: GOOGLE ADS - CLICK-THROUGH RATE\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 3: GOOGLE ADS - Click-Through Rate (CTR) Analysis\")\n print(\"=\" * 70)\n \n # Ad shown to 500 users, expected CTR = 8%\n google_ads = BinomialAnalyzer(n=500, p=0.08)\n \n print(f\"\\nAd Campaign:\")\n print(f\" Impressions: {google_ads.n:,}\")\n print(f\" Expected CTR: {google_ads.p:.1%}\")\n print(f\" Expected clicks: {google_ads.mean:.1f} \u00b1 {google_ads.std_dev:.2f}\")\n \n # Revenue: $2 per click\n revenue_per_click = 2.0\n expected_revenue = google_ads.mean * revenue_per_click\n revenue_std = google_ads.std_dev * revenue_per_click\n \n print(f\"\\nRevenue Analysis ($2/click):\")\n print(f\" Expected revenue: ${expected_revenue:.2f} \u00b1 ${revenue_std:.2f}\")\n \n # Percentiles for budgeting\n clicks_p10 = google_ads.dist.ppf(0.10)\n clicks_p50 = google_ads.dist.ppf(0.50)\n clicks_p90 = google_ads.dist.ppf(0.90)\n \n print(f\"\\n Revenue Percentiles:\")\n print(f\" P10: {clicks_p10:.0f} clicks → ${clicks_p10 * revenue_per_click:.2f}\")\n print(f\" P50: {clicks_p50:.0f} clicks → ${clicks_p50 * revenue_per_click:.2f}\")\n print(f\" P90: {clicks_p90:.0f} clicks → ${clicks_p90 * revenue_per_click:.2f}\")\n \n # Normal approximation check\n print(f\"\\nNormal Approximation:\")\n if google_ads.normal_approximation_valid():\n print(f\" ✅ Valid (np={google_ads.n*google_ads.p:.1f} ≥ 5, n(1-p)={google_ads.n*(1-google_ads.p):.1f} ≥ 5)\")\n print(f\" Can use X ~ N({google_ads.mean:.1f}, {google_ads.variance:.2f})\")\n else:\n print(f\" \u274c Invalid, use exact binomial\")\n \n \n # ============================================================================\n # EXAMPLE 4: PHARMACEUTICAL TRIAL - FDA APPROVAL\n # ============================================================================\n \n print(\"\\n\" + \"=\" * 70)\n print(\"EXAMPLE 4: PHARMACEUTICAL TRIAL - Drug Efficacy\")\n print(\"=\" * 70)\n \n # Trial: 50 patients, drug success rate = 70%\n clinical = BinomialAnalyzer(n=50, p=0.70)\n \n print(f\"\\nClinical Trial:\")\n print(f\" Patients: {clinical.n}\")\n print(f\" Success rate: {clinical.p:.0%}\")\n print(f\" Expected successes: {clinical.mean:.1f} \u00b1 {clinical.std_dev:.2f}\")\n \n # FDA approval: Need at least 32 successes (64%)\n approval_threshold = 32\n p_approval = clinical.survival(approval_threshold - 1)\n \n print(f\"\\nFDA Approval Analysis:\")\n print(f\" Threshold: ≥{approval_threshold} successes ({approval_threshold/clinical.n:.0%})\")\n print(f\" P(Approval | true p=0.70): {p_approval:.2%}\")\n print(f\" P(Rejection | true p=0.70): {1 - p_approval:.2%} (Type II error)\")\n \n # Distribution of outcomes\n print(f\"\\nOutcome Distribution:\")\n for k in [30, 32, 35, 38, 40]:\n prob = clinical.pmf(k)\n print(f\" P(X = {k}): {prob:.3%}\")\n\n\n ## Comparison Tables\n\n ### Binomial vs Related Distributions\n\n | Distribution | Formula | Use Case | Relationship to Binomial |\n |--------------|---------|----------|---------------------------|\n | Bernoulli | p^k (1-p)^(1-k), k∈{0,1} | Single trial | Binomial(1, p) |\n | Binomial | C(n,k) p^k (1-p)^(n-k) | n fixed trials | Sum of n Bernoullis |\n | Geometric | (1-p)^(k-1) p | Trials until first success | Unbounded trials |\n | Negative Binomial | C(k-1,r-1) p^r (1-p)^(k-r) | Trials until r successes | Generalized geometric |\n | Poisson | (λ^k/k!) e^(-λ) | Rare events (n→∞, p→0, np=λ) | Limit of binomial |\n\n ### Real Company Applications\n\n | Company | Application | n | p | Business Decision |\n |---------|-------------|---|---|-------------------|\n | Amazon | Warehouse defect rate | 100 items | 0.03 | Reject batch if >5 defects (0.6% FP rate) |\n | Meta | A/B test conversions | 1000 visitors | 0.12 | Need 80% power to detect 3% lift |\n | Google | Ad click-through rate | 500 impressions | 0.08 | Expected 40 clicks → $80 revenue |\n | Pfizer | Drug efficacy trial | 50 patients | 0.70 | P(FDA approval ≥32 successes) = 93% |\n | Netflix | Thumbnail A/B test | 10000 views | 0.45 | 95% CI: [4430, 4570] clicks |\n\n ### Normal Approximation Guidelines\n\n | Condition | Rule | Example | Valid? |\n |-----------|------|---------|--------|\n | np ≥ 5 | Enough expected successes | n=100, p=0.03 → np=3 | \u274c No |\n | n(1-p) ≥ 5 | Enough expected failures | n=100, p=0.98 → n(1-p)=2 | \u274c No |\n | Both | Safe to use N(np, np(1-p)) | n=500, p=0.08 → np=40, n(1-p)=460 | ✅ Yes |\n | Continuity correction | Add ±0.5 for discrete→continuous | P(X≤10) ≈ P(Z≤10.5) | Improves accuracy |\n\n !!! tip \"Interviewer's Insight\"\n What they test:\n \n - BINS conditions: Can you verify binomial applies?\n - PMF formula: Understand C(n,k) × p^k × (1-p)^(n-k)?\n - Mean/variance: Derive or know E[X]=np, Var(X)=np(1-p)?\n - Complement rule: P(X≥k) = 1 - P(X≤k-1) for efficiency?\n - Normal approximation: When np and n(1-p) both ≥ 5?\n \n Strong signals:\n \n - Conditions check: \"Binomial requires BINS: Binary outcomes, Independent trials, Number fixed, Same probability. Here n=100, p=0.03, all met\"\n - Intuitive mean: \"E[X] = np makes sense: 100 trials × 3% rate = 3 expected defects\"\n - Variance interpretation: \"Var(X) = np(1-p) = 2.91. Maximum variance is at p=0.5, so low p=0.03 gives low variance\"\n - Complement efficiency: \"P(X>5) = 1 - P(X≤5) avoids summing pmf(6)+pmf(7)+...+pmf(100)\"\n - Normal approx: \"np=3 < 5, so can't use normal. Must use exact binomial or Poisson approximation\"\n - Business context: \"At Amazon, we reject batches with >5 defects. With p=0.03, false positive rate is only 0.6%—low Type I error\"\n \n Red flags:\n \n - Forgets independence assumption (e.g., sampling without replacement from small population)\n - Uses normal approximation when np<5 or n(1-p)<5\n - Confuses P(X=k) with P(X≥k) or P(X>k)\n - Doesn't use complement for \"at least\" questions\n - Can't explain why variance is np(1-p), not np\n \n Follow-up questions:\n \n - \"What if we sample without replacement?\" → Use hypergeometric, not binomial (trials no longer independent)\n - \"Derive E[X] = np\" → X = X₁+...+Xₙ where Xᵢ~Bernoulli(p), so E[X] = Σ E[Xᵢ] = Σ p = np\n - \"Why is variance maximized at p=0.5?\" → Var(X)=np(1-p) is quadratic in p, max at p=0.5\n - \"When does binomial → Poisson?\" → n→∞, p→0, np=λ constant. Useful for rare events\n - \"What's the mode of Binomial(n,p)?\" → ⌊(n+1)p⌋ or sometimes two modes\n\n !!! warning \"Common Pitfalls\"\n 1. Sampling without replacement: Binomial requires independence. For small populations, use hypergeometric\n 2. Forgetting (1-p)^(n-k) term: PMF needs probability of failures too!\n 3. P(X≥k) vs P(X>k): Off-by-one error—P(X>k) = 1 - P(X≤k), not 1 - P(X≤k-1)\n 4. Normal approximation abuse: Don't use when np<5 or n(1-p)<5—substantial error


What is the Poisson Distribution? When to Use It? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Poisson, Discrete, Rare Events | Asked by: Google, Amazon, Meta

View Answer

Poisson Distribution:

Models count of events in fixed interval (time, space):

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]

Parameters:

  • λ = rate (expected count per interval)
  • k = actual count (0, 1, 2, ...)

Special Property:

E[X] = Var(X) = λ

When to Use:

  1. Events occur independently
  2. Rate is constant
  3. Events are "rare" (compared to opportunities)

Examples: - Website visits per minute - Typos per page - Goals in a soccer game - Radioactive decays per second

Python:

from scipy.stats import poisson

# 4 customers per hour on average
lambda_rate = 4

# P(exactly 6 customers)?
p_6 = poisson.pmf(k=6, mu=4)  # ≈ 0.104

# P(at most 2 customers)?
p_le_2 = poisson.cdf(k=2, mu=4)  # ≈ 0.238

# P(more than 5)?
p_gt_5 = 1 - poisson.cdf(k=5, mu=4)  # ≈ 0.215

Poisson as Binomial Limit:

When n → ∞, p → 0, np = λ: Binomial(n, p) → Poisson(λ)

Interviewer's Insight

What they're testing: Count data modeling.

Strong answer signals:

  • States E[X] = Var(X) = λ
  • Gives real-world examples
  • Knows Poisson-Binomial relationship
  • Uses for rate-based problems

Explain the Exponential Distribution - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Exponential, Continuous, Waiting Time | Asked by: Google, Amazon, Microsoft

View Answer

Exponential Distribution:

Models time between Poisson events (waiting time):

\[f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\]

Parameters:

Statistic Formula
Mean 1/λ
Variance 1/λ²
Median ln(2)/λ

Memoryless Property:

\[P(X > s + t | X > s) = P(X > t)\]

"Past doesn't affect future" - unique to exponential!

Example:

Bus arrives every 10 minutes on average (λ = 0.1/min):

from scipy.stats import expon

# λ = 0.1, scale = 1/λ = 10
wait_time = expon(scale=10)

# P(wait < 5 minutes)?
p_lt_5 = wait_time.cdf(5)  # ≈ 0.393

# P(wait > 15 minutes)?
p_gt_15 = 1 - wait_time.cdf(15)  # ≈ 0.223

# Mean wait time
mean_wait = wait_time.mean()  # 10 minutes

Relationship with Poisson:

  • If counts per time ~ Poisson(λ)
  • Then time between events ~ Exponential(λ)

Interviewer's Insight

What they're testing: Continuous distribution for waiting.

Strong answer signals:

  • Knows memoryless property and its implications
  • Connects to Poisson process
  • Can calculate probabilities
  • Gives practical examples

What is the Geometric Distribution? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Geometric, Discrete, First Success | Asked by: Amazon, Microsoft, Google

View Answer

Geometric Distribution:

Number of trials until first success:

\[P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, ...\]

Formulas:

Statistic Formula
Mean 1/p
Variance (1-p)/p²
Mode 1

Memoryless (like Exponential):

P(X > m + n | X > m) = P(X > n)

Example - Interview Success:

30% chance of passing each interview:

from scipy.stats import geom

p = 0.3

# P(pass on exactly 3rd interview)?
p_3rd = geom.pmf(k=3, p=0.3)
# = (0.7)^2 * 0.3 = 0.147

# Expected interviews until first pass?
expected = 1 / 0.3  # ≈ 3.33 interviews

# P(need more than 5 interviews)?
p_gt_5 = 1 - geom.cdf(k=5, p=0.3)
# = (0.7)^5 ≈ 0.168

Alternative Definition:

Some texts define as failures before first success (k = 0, 1, 2, ...)

Interviewer's Insight

What they're testing: First success modeling.

Strong answer signals:

  • Knows two common definitions
  • Calculates E[X] = 1/p intuitively
  • Connects to negative binomial
  • Uses memoryless property

What is the Birthday Problem? Calculate the Probability - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Birthday Problem, Combinatorics, Probability Puzzle | Asked by: Google, Amazon, Meta

View Answer

Birthday Problem:

What's the probability that in a group of n people, at least 2 share a birthday?

Approach - Complement:

P(at least 2 share) = 1 - P(all different birthdays)

\[P(\text{all different}) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdot ... \cdot \frac{365-n+1}{365}\]

Calculation:

def birthday_probability(n):
    """P(at least 2 share birthday in group of n)"""
    p_all_different = 1.0
    for i in range(n):
        p_all_different *= (365 - i) / 365
    return 1 - p_all_different

# Results:
# n=23: 50.7% (famous result!)
# n=50: 97.0%
# n=70: 99.9%

for n in [10, 23, 30, 50, 70]:
    print(f"n={n}: {birthday_probability(n):.1%}")

Why So Counter-Intuitive?

  • We think: 23 people, 365 days → small chance
  • Reality: C(23,2) = 253 pairs to compare!

Generalized Version:

P(collision in hash table) follows same logic - birthday attack in cryptography.

Interviewer's Insight

What they're testing: Complement probability, combinatorics.

Strong answer signals:

  • Uses complement approach
  • Knows n=23 gives ~50%
  • Can generalize to other collision problems
  • Explains why intuition fails

Explain the Monty Hall Problem - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Monty Hall, Conditional Probability, Puzzle | Asked by: Google, Meta, Amazon

View Answer

The Setup:

  • 3 doors: 1 car, 2 goats
  • You pick a door (say Door 1)
  • Host (who knows what's behind each) opens another door showing a goat
  • Should you switch?

Answer: YES - Switch gives ⅔ chance!

Intuition:

Initial pick: P(Car) = 1/3
Other doors:  P(Car) = 2/3

After host reveals goat:
- Your door still has P = 1/3
- Remaining door gets all 2/3

Simulation Proof:

import random

def monty_hall(switch, n_simulations=100000):
    wins = 0
    for _ in range(n_simulations):
        car = random.randint(0, 2)
        choice = random.randint(0, 2)

        # Host opens a goat door (not your choice, not car)
        goat_doors = [i for i in range(3) if i != choice and i != car]
        host_opens = random.choice(goat_doors)

        if switch:
            # Switch to remaining door
            choice = [i for i in range(3) if i != choice and i != host_opens][0]

        if choice == car:
            wins += 1

    return wins / n_simulations

print(f"Stay:   {monty_hall(switch=False):.1%}")   # ~33.3%
print(f"Switch: {monty_hall(switch=True):.1%}")    # ~66.7%

Key Insight:

Host's action is not random - he MUST reveal a goat. This transfers information.

Interviewer's Insight

What they're testing: Conditional probability reasoning.

Strong answer signals:

  • Gives correct answer (switch = ⅔)
  • Explains WHY (host's constraint)
  • Can simulate or prove mathematically
  • Addresses common misconceptions

What is Covariance and Correlation? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Covariance, Correlation, Dependency | Asked by: Google, Meta, Amazon

View Answer

Covariance:

Measures joint variability of two variables:

\[Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]\]

Correlation (Pearson):

Standardized covariance, range [-1, 1]:

\[\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}\]

Interpretation:

Value Meaning
ρ = 1 Perfect positive linear
ρ = 0 No linear relationship
ρ = -1 Perfect negative linear

Important Properties:

# Covariance
Cov(X, X) = Var(X)
Cov(X, Y) = Cov(Y, X)  # Symmetric
Cov(aX + b, Y) = a·Cov(X, Y)

# Correlation
Corr(aX + b, Y) = sign(a) · Corr(X, Y)  # Unaffected by linear transform

Python Calculation:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

cov_matrix = np.cov(x, y)
cov_xy = cov_matrix[0, 1]  # Covariance

corr_matrix = np.corrcoef(x, y)
corr_xy = corr_matrix[0, 1]  # Correlation

Warning:

Correlation ≠ Causation Correlation = 0 does NOT mean independence!

Interviewer's Insight

What they're testing: Understanding relationship measures.

Strong answer signals:

  • Knows correlation is dimensionless
  • States correlation measures LINEAR relationship only
  • Knows correlation = 0 ≠ independence
  • Can distinguish correlation from causation

Explain the Law of Large Numbers - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: LLN, Convergence, Sampling | Asked by: Google, Amazon, Microsoft

View Answer

Law of Large Numbers:

Sample mean converges to population mean as sample size → ∞:

\[\bar{X}_n \xrightarrow{p} \mu \quad \text{as} \quad n \to \infty\]

Two Forms:

Weak LLN Strong LLN
Convergence in probability Almost sure convergence
P(|X̄ₙ - μ| > ε) → 0 P(X̄ₙ → μ) = 1

Intuition:

More samples → better estimate of true mean

Example:

import numpy as np
import matplotlib.pyplot as plt

# Fair coin: P(Heads) = 0.5
np.random.seed(42)

flips = np.random.binomial(1, 0.5, 10000)
running_mean = np.cumsum(flips) / np.arange(1, 10001)

plt.plot(running_mean)
plt.axhline(y=0.5, color='r', linestyle='--', label='True Mean')
plt.xlabel('Number of Flips')
plt.ylabel('Running Mean')
plt.title('Law of Large Numbers: Coin Flips')

Key Distinction from CLT:

LLN CLT
Sample mean → population mean Sample mean distribution → Normal
About convergence to a value About shape of distribution

Interviewer's Insight

What they're testing: Asymptotic behavior understanding.

Strong answer signals:

  • Distinguishes LLN from CLT
  • Knows weak vs strong forms
  • Explains practical implications
  • Shows convergence concept

What is a PDF vs PMF vs CDF? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: PDF, PMF, CDF, Distributions | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Probability Mass Function (PMF):

For discrete random variables:

\[P(X = x) = p(x)\]

Properties: - p(x) ≥ 0 - Σp(x) = 1

Probability Density Function (PDF):

For continuous random variables:

\[P(a < X < b) = \int_a^b f(x) dx\]

Properties: - f(x) ≥ 0 - ∫f(x)dx = 1 - P(X = a) = 0 for any exact value!

Cumulative Distribution Function (CDF):

For both discrete and continuous:

\[F(x) = P(X \leq x)\]

Properties: - F(-∞) = 0, F(+∞) = 1 - Monotonically non-decreasing - F'(x) = f(x) for continuous

Visual Comparison:

from scipy import stats
import numpy as np

# Discrete: Binomial PMF and CDF
x_discrete = np.arange(0, 11)
pmf = stats.binom.pmf(x_discrete, n=10, p=0.5)
cdf_discrete = stats.binom.cdf(x_discrete, n=10, p=0.5)

# Continuous: Normal PDF and CDF
x_continuous = np.linspace(-4, 4, 100)
pdf = stats.norm.pdf(x_continuous)
cdf_continuous = stats.norm.cdf(x_continuous)

Interviewer's Insight

What they're testing: Distribution fundamentals.

Strong answer signals:

  • Knows PDF ≠ probability (can exceed 1)
  • Uses CDF for probability calculations
  • Knows F'(x) = f(x) relationship
  • Distinguishes discrete from continuous

What is a Confidence Interval? How to Interpret It? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Confidence Interval, Inference, Uncertainty | Asked by: Google, Amazon, Meta

View Answer

Confidence Interval:

Range that likely contains true population parameter:

\[CI = \text{estimate} \pm \text{margin of error}\]

For Mean (known σ):

\[CI = \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

For Mean (unknown σ):

\[CI = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}\]

Common z-values:

Confidence z-value
90% 1.645
95% 1.96
99% 2.576

Correct Interpretation:

✅ "95% of such intervals contain the true mean" ❌ "95% probability the true mean is in this interval"

Python:

from scipy import stats
import numpy as np

data = [23, 25, 28, 22, 26, 27, 24, 29, 25, 26]

# 95% CI for mean
mean = np.mean(data)
se = stats.sem(data)  # Standard error
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=se)

print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Width Factors:

  • Higher confidence → wider CI
  • Larger n → narrower CI
  • More variability → wider CI

Interviewer's Insight

What they're testing: Statistical inference understanding.

Strong answer signals:

  • Correct frequentist interpretation
  • Knows t vs z distribution choice
  • Understands factors affecting width
  • Can calculate by hand

Explain Hypothesis Testing: Null, Alternative, p-value - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Hypothesis Testing, p-value, Significance | Asked by: Google, Amazon, Meta

View Answer

Hypothesis Testing Framework:

Component Description
H₀ (Null) Default assumption (no effect)
H₁ (Alternative) What we want to prove
α (Significance) False positive threshold (usually 0.05)
p-value P(data | H₀ true)

Decision Rule:

  • If p-value ≤ α: Reject H₀
  • If p-value > α: Fail to reject H₀

Types of Errors:

Error Description Name
Type I Reject H₀ when true False Positive
Type II Accept H₀ when false False Negative

Example - A/B Test:

from scipy import stats

# Control: 100 conversions out of 1000
# Treatment: 120 conversions out of 1000

control_conv = 100
control_n = 1000
treatment_conv = 120
treatment_n = 1000

# H₀: p1 = p2 (no difference)
# H₁: p1 ≠ p2 (difference exists)

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest

stat, pvalue = proportions_ztest(
    [control_conv, treatment_conv],
    [control_n, treatment_n]
)

print(f"p-value: {pvalue:.4f}")
# If p < 0.05, reject H₀ → significant difference

p-value Misconceptions:

❌ p-value = P(H₀ is true) ✅ p-value = P(observing this data or more extreme | H₀ true)

Interviewer's Insight

What they're testing: Core statistical testing knowledge.

Strong answer signals:

  • Correct p-value interpretation
  • Knows Type I vs Type II errors
  • Understands "fail to reject" vs "accept"
  • Can set up hypotheses correctly

What is Power in Hypothesis Testing? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Power, Type II Error, Sample Size | Asked by: Google, Meta, Amazon

View Answer

Power:

Probability of correctly rejecting H₀ when it's false:

\[\text{Power} = 1 - \beta = P(\text{Reject } H_0 | H_0 \text{ is false})\]

Factors Affecting Power:

Factor Effect on Power
Effect size ↑ Power ↑
Sample size ↑ Power ↑
α ↑ Power ↑
Variance ↓ Power ↑

Typical Target: Power = 0.80

Power Analysis - Sample Size Calculation:

from statsmodels.stats.power import TTestIndPower

# Parameters
effect_size = 0.5  # Cohen's d (medium effect)
alpha = 0.05
power = 0.80

# Calculate required sample size
analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1  # Equal group sizes
)

print(f"Required n per group: {n:.0f}")
# ~64 per group for medium effect

Effect Size (Cohen's d):

\[d = \frac{\mu_1 - \mu_2}{\sigma}\]
d Interpretation
0.2 Small
0.5 Medium
0.8 Large

Interviewer's Insight

What they're testing: Experimental design knowledge.

Strong answer signals:

  • Knows power = 1 - β
  • Can perform power analysis
  • Understands sample size trade-offs
  • Uses effect size appropriately

Explain Permutations vs Combinations - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Combinatorics, Counting, Fundamentals | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Permutations (Order Matters):

\[P(n, r) = \frac{n!}{(n-r)!}\]

Combinations (Order Doesn't Matter):

\[C(n, r) = \binom{n}{r} = \frac{n!}{r!(n-r)!}\]

Key Relationship:

\[P(n, r) = C(n, r) \cdot r!\]

Examples:

from math import factorial, comb, perm

# 5 people, select 3 for positions (President, VP, Secretary)
# Order matters → Permutation
positions = perm(5, 3)  # = 5 × 4 × 3 = 60

# 5 people, select 3 for a committee
# Order doesn't matter → Combination
committee = comb(5, 3)  # = 10

# Relationship
assert perm(5, 3) == comb(5, 3) * factorial(3)
# 60 = 10 × 6

With Repetition:

Type Formula
Permutation with repetition
Combination with repetition C(n+r-1, r)
# 4-digit PIN (0-9): 10^4 = 10000
# Choose 3 scoops from 5 flavors (repeats OK): C(5+3-1, 3) = C(7,3) = 35

Interviewer's Insight

What they're testing: Basic counting principles.

Strong answer signals:

  • Immediate recognition of order relevance
  • Knows formulas without derivation
  • Distinguishes with vs without replacement
  • Gives intuitive examples

What is the Negative Binomial Distribution? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Negative Binomial, Discrete, Failures | Asked by: Amazon, Microsoft, Google

View Answer

Negative Binomial Distribution:

Number of trials until rth success:

\[P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}\]

Alternative: Number of failures before rth success (Y = X - r)

Parameters:

Statistic Formula
Mean r/p
Variance r(1-p)/p²

Special Case:

When r = 1: Negative Binomial → Geometric

Example - Quality Control:

Need 3 good widgets. P(good) = 0.8. Expected total inspections?

from scipy.stats import nbinom

r, p = 3, 0.8

# Expected trials until 3 successes
expected_trials = r / p  # = 3 / 0.8 = 3.75

# P(need exactly 5 trials)?
# 5 trials, 3 successes, 2 failures
p_5 = nbinom.pmf(k=2, n=3, p=0.8)  # k = failures
# = C(4,2) * 0.8^3 * 0.2^2 = 0.0512

# P(need at most 4 trials)?
p_le_4 = nbinom.cdf(k=1, n=3, p=0.8)  # ≤1 failure

Applications:

  • Number of sales calls until quota
  • Waiting for multiple events
  • Overdispersed count data

Interviewer's Insight

What they're testing: Generalized geometric distribution.

Strong answer signals:

  • Knows relationship to geometric
  • Handles both parameterizations
  • Calculates mean = r/p
  • Gives practical applications

What is the Beta Distribution? - Amazon, Google Interview Question

Difficulty: 🔴 Hard | Tags: Beta, Continuous, Bayesian | Asked by: Amazon, Google, Meta

View Answer

Beta Distribution:

Models probabilities (values in [0, 1]):

\[f(x; \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}\]

Parameters:

Statistic Formula
Mean α / (α + β)
Mode (α-1) / (α+β-2) for α,β > 1
Variance αβ / [(α+β)²(α+β+1)]

Special Cases:

α β Shape
1 1 Uniform
0.5 0.5 U-shaped
2 5 Left-skewed
5 2 Right-skewed

Bayesian Application:

Prior for probability p, with binomial likelihood:

from scipy.stats import beta
import numpy as np

# Prior: Beta(2, 2) - slight preference for 0.5
# Observed: 7 successes, 3 failures
# Posterior: Beta(2+7, 2+3) = Beta(9, 5)

prior_alpha, prior_beta = 2, 2
successes, failures = 7, 3

post_alpha = prior_alpha + successes
post_beta = prior_beta + failures

posterior = beta(post_alpha, post_beta)

mean = posterior.mean()  # 9/14 ≈ 0.643
ci = posterior.interval(0.95)  # 95% credible interval

Why Use Beta?

  • Conjugate prior for binomial
  • Posterior is also Beta
  • Flexible shape for [0,1] data

Interviewer's Insight

What they're testing: Bayesian statistics foundation.

Strong answer signals:

  • Knows it models probabilities
  • Uses as prior in Bayesian inference
  • Understands conjugacy
  • Can update with observed data

What is the Gamma Distribution? - Amazon, Google Interview Question

Difficulty: 🔴 Hard | Tags: Gamma, Continuous, Waiting | Asked by: Amazon, Google, Microsoft

View Answer

Gamma Distribution:

Generalized exponential - time until kth event:

\[f(x; k, \theta) = \frac{x^{k-1}e^{-x/\theta}}{\theta^k \Gamma(k)}\]

Parameters:

Statistic Formula
Mean
Variance kθ²
Mode (k-1)θ for k ≥ 1

Special Cases:

Distribution Gamma Parameters
Exponential k = 1
Chi-squared k = ν/2, θ = 2
Erlang k ∈ integers

Application:

from scipy.stats import gamma

# Phone calls: avg 3 per hour (λ=3)
# Time until 5th call?

k = 5  # 5th event
theta = 1/3  # Scale = 1/rate

waiting = gamma(a=k, scale=theta)

# Expected wait
expected = waiting.mean()  # = 5 * (1/3) = 1.67 hours

# P(wait > 2 hours)?
p_gt_2 = 1 - waiting.cdf(2)

Relationship to Poisson:

  • Poisson: count in fixed time
  • Gamma: time until kth count

Interviewer's Insight

What they're testing: Advanced distribution knowledge.

Strong answer signals:

  • Knows exponential is Gamma(1, θ)
  • Connects to Poisson process
  • Uses for waiting time problems
  • Knows chi-squared is special gamma

What is a Markov Chain? - Google, Amazon Interview Question

Difficulty: 🔴 Hard | Tags: Markov Chain, Stochastic Process, Probability | Asked by: Google, Amazon, Meta

View Answer

Markov Chain:

Stochastic process with memoryless property:

\[P(X_{n+1} = j | X_n = i, X_{n-1}, ..., X_0) = P(X_{n+1} = j | X_n = i)\]

"Future depends only on present, not past"

Components:

  • States: Finite or infinite set
  • Transition probabilities: P(i → j)
  • Transition matrix: P where Pᵢⱼ = P(i → j)

Example - Weather:

import numpy as np

# States: Sunny (0), Rainy (1)
# P[i,j] = probability of going from i to j
P = np.array([
    [0.8, 0.2],  # Sunny → Sunny=0.8, Rainy=0.2
    [0.4, 0.6]   # Rainy → Sunny=0.4, Rainy=0.6
])

# After n steps from initial state
def state_after_n_steps(P, initial, n):
    return np.linalg.matrix_power(P, n)[initial]

# Stationary distribution (π = πP)
eigenvalues, eigenvectors = np.linalg.eig(P.T)
stationary = eigenvectors[:, 0].real
stationary = stationary / stationary.sum()
# [0.667, 0.333] - long-run: 66.7% sunny

Key Properties:

  • Irreducible: Can reach any state from any other
  • Aperiodic: No fixed cycles
  • Ergodic: Irreducible + aperiodic → unique stationary dist

Interviewer's Insight

What they're testing: Stochastic modeling knowledge.

Strong answer signals:

  • States memoryless property clearly
  • Can write transition matrix
  • Knows stationary distribution concept
  • Gives PageRank as application

What is Entropy in Information Theory? - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Entropy, Information Theory, Uncertainty | Asked by: Google, Meta, Amazon

View Answer

Shannon Entropy:

Measures uncertainty/information content:

\[H(X) = -\sum_x P(x) \log_2 P(x)\]

Properties:

Distribution Entropy
Uniform Maximum (log₂n for n outcomes)
Deterministic 0 (no uncertainty)
Binary (p=0.5) 1 bit

Example:

import numpy as np

def entropy(probs):
    """Calculate Shannon entropy in bits"""
    probs = np.array(probs)
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

# Fair coin: maximum entropy
fair_coin = entropy([0.5, 0.5])  # 1.0 bit

# Biased coin
biased = entropy([0.9, 0.1])  # 0.47 bits

# Fair die
fair_die = entropy([1/6] * 6)  # 2.58 bits

Cross-Entropy (ML Loss):

\[H(p, q) = -\sum_x p(x) \log q(x)\]

KL Divergence:

\[D_{KL}(p||q) = H(p, q) - H(p)\]

Interviewer's Insight

What they're testing: Information theory fundamentals.

Strong answer signals:

  • Knows entropy measures uncertainty
  • Uses log₂ for bits, ln for nats
  • Connects to ML cross-entropy loss
  • Understands maximum entropy principle

What are Joint and Marginal Distributions? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Joint Distribution, Marginal, Multivariate | Asked by: Google, Amazon, Meta

View Answer

Joint Distribution:

Probability distribution over multiple variables:

\[P(X=x, Y=y) = P(X=x \cap Y=y)\]

Marginal Distribution:

Distribution of single variable from joint:

\[P(X=x) = \sum_y P(X=x, Y=y)\]

Example:

import numpy as np

# Joint PMF of X and Y
joint = np.array([
    [0.1, 0.2, 0.1],  # X=0
    [0.2, 0.2, 0.1],  # X=1
    [0.0, 0.05, 0.05] # X=2
])
# Columns: Y=0, Y=1, Y=2

# Marginal of X (sum over Y)
marginal_x = joint.sum(axis=1)  # [0.4, 0.5, 0.1]

# Marginal of Y (sum over X)
marginal_y = joint.sum(axis=0)  # [0.3, 0.45, 0.25]

# Conditional P(Y|X=1)
conditional_y_given_x1 = joint[1] / marginal_x[1]
# [0.4, 0.4, 0.2]

Independence Check:

X and Y independent iff: P(X=x, Y=y) = P(X=x) · P(Y=y) for all x, y

Interviewer's Insight

What they're testing: Multivariate probability.

Strong answer signals:

  • Knows marginalization = summing out
  • Can derive conditional from joint
  • Checks independence via product rule
  • Extends to continuous case

What is the Chi-Squared Distribution and Test? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Chi-Squared, Hypothesis Testing, Categorical | Asked by: Amazon, Microsoft, Google

View Answer

Chi-Squared Distribution:

Sum of squared standard normals:

\[\chi^2_k = Z_1^2 + Z_2^2 + ... + Z_k^2\]

where Zᵢ ~ N(0,1) and k = degrees of freedom

Chi-Squared Test for Independence:

Tests if two categorical variables are independent:

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
  • O = observed frequency
  • E = expected frequency (under independence)

Example:

from scipy.stats import chi2_contingency
import numpy as np

# Observed: Gender vs Product Preference
observed = np.array([
    [30, 10, 15],  # Male: A, B, C
    [20, 25, 10]   # Female: A, B, C
])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-squared: {chi2:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

# If p < 0.05: Reject H₀ → Variables are dependent

Chi-Squared Goodness of Fit:

Tests if data follows expected distribution:

from scipy.stats import chisquare

observed = [18, 22, 28, 32]  # Dice rolls
expected = [25, 25, 25, 25]   # Fair die

stat, p = chisquare(observed, expected)

Interviewer's Insight

What they're testing: Categorical data analysis.

Strong answer signals:

  • Knows χ² tests independence/goodness-of-fit
  • Calculates expected under null
  • Uses at least 5 per cell rule
  • Interprets p-value correctly

What is the t-Distribution? When to Use It? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: t-Distribution, Small Samples, Inference | Asked by: Google, Amazon, Meta

View Answer

t-Distribution:

For inference when σ is unknown (uses sample s):

\[t = \frac{\bar{X} - \mu}{s / \sqrt{n}}\]

Properties:

Property Value
Mean 0 (for ν > 1)
Variance ν/(ν-2) for ν > 2
Shape Bell-shaped, heavier tails than Normal
DOF → ∞ Converges to N(0,1)

When to Use:

Use t Use z
σ unknown σ known
Small n (< 30) Large n (n ≥ 30)
Population ~normal CLT applies

Python:

from scipy import stats

# t-test: is population mean = 100?
data = [102, 98, 105, 99, 103, 101, 97, 104]

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(data, 100)

# Critical value for 95% CI (df = n-1)
t_crit = stats.t.ppf(0.975, df=len(data)-1)

# Two-sample t-test
group1 = [23, 25, 28, 22, 26]
group2 = [19, 21, 24, 20, 22]
t_stat, p_value = stats.ttest_ind(group1, group2)

Interviewer's Insight

What they're testing: Small sample inference.

Strong answer signals:

  • Knows to use t when σ unknown
  • States heavier tails than normal
  • Uses correct degrees of freedom
  • Knows t → z as n → ∞

What is the Uniform Distribution? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Uniform, Continuous, Random | Asked by: Google, Amazon, Meta

View Answer

Continuous Uniform Distribution:

Equal probability over interval [a, b]:

\[f(x) = \frac{1}{b-a}, \quad a \leq x \leq b\]

Properties:

Statistic Formula
Mean (a + b) / 2
Variance (b - a)² / 12
CDF (x - a) / (b - a)

Discrete Uniform:

\[P(X = k) = \frac{1}{n}\]

for k in {1, 2, ..., n}

Python:

from scipy.stats import uniform
import numpy as np

# Uniform[0, 1]
U = uniform(loc=0, scale=1)

# Generate random samples
samples = np.random.uniform(0, 1, 1000)

# Uniform[2, 8]
U = uniform(loc=2, scale=6)  # loc=a, scale=b-a
U.mean()  # 5.0
U.var()   # 3.0

Inverse Transform Sampling:

If U ~ Uniform(0,1), then F⁻¹(U) has distribution F:

# Generate exponential from uniform
u = np.random.uniform(0, 1, 1000)
exponential_samples = -np.log(1 - u)  # Inverse CDF of Exp(1)

Interviewer's Insight

What they're testing: Basic distribution knowledge.

Strong answer signals:

  • Knows mean = (a+b)/2
  • Uses for random number generation
  • Knows inverse transform method
  • Distinguishes continuous vs discrete

Explain Sampling With vs Without Replacement - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Sampling, Replacement, Combinatorics | Asked by: Google, Amazon, Meta

View Answer

With Replacement:

  • Each item can be selected multiple times
  • Trials are independent
  • Probabilities remain constant

Without Replacement:

  • Each item selected at most once
  • Trials are dependent
  • Probabilities change after each selection

Example:

import numpy as np

population = [1, 2, 3, 4, 5]

# With replacement - same item can appear multiple times
with_rep = np.random.choice(population, size=3, replace=True)
# Possible: [3, 3, 1]

# Without replacement - unique items only
without_rep = np.random.choice(population, size=3, replace=False)
# Possible: [4, 1, 3] but never [3, 3, 1]

Probability Differences:

Drawing 2 red cards from deck:

# With replacement
p_with = (26/52) * (26/52) = 0.25

# Without replacement  
p_without = (26/52) * (25/51)  0.245

When Each is Used:

With Replacement Without Replacement
Bootstrap sampling Survey sampling
Dice rolling Lottery
Monte Carlo Card dealing

Interviewer's Insight

What they're testing: Sampling concepts.

Strong answer signals:

  • Knows independence implications
  • Can calculate both scenarios
  • Mentions hypergeometric for without
  • Knows bootstrap uses with replacement

What is the Hypergeometric Distribution? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Hypergeometric, Sampling, Without Replacement | Asked by: Google, Amazon, Meta

View Answer

Hypergeometric Distribution:

Successes in n draws without replacement:

\[P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}\]
  • N = population size
  • K = successes in population
  • n = sample size
  • k = successes in sample

Example - Quality Control:

Lot of 100 items, 10 defective. Sample 15 without replacement.

from scipy.stats import hypergeom

N, K, n = 100, 10, 15

# P(exactly 2 defective)?
p_2 = hypergeom.pmf(k=2, M=N, n=K, N=n)

# Expected defectives
expected = n * K / N  # = 15 * 10/100 = 1.5

# P(at least 1 defective)?
p_at_least_1 = 1 - hypergeom.pmf(k=0, M=N, n=K, N=n)

Comparison with Binomial:

Hypergeometric Binomial
Without replacement With replacement
p changes p constant
Var < np(1-p) Var = np(1-p)

For large N, hypergeometric ≈ binomial

Interviewer's Insight

What they're testing: Finite population sampling.

Strong answer signals:

  • Knows formula intuitively
  • Compares to binomial
  • Uses for quality control problems
  • Knows approximation for large N

What is the F-Distribution? - Google, Amazon Interview Question

Difficulty: 🔴 Hard | Tags: F-Distribution, ANOVA, Variance Comparison | Asked by: Google, Amazon, Microsoft

View Answer

F-Distribution:

Ratio of two chi-squared distributions:

\[F = \frac{\chi^2_1 / d_1}{\chi^2_2 / d_2}\]

Use Cases:

  1. ANOVA (compare group means)
  2. Comparing variances
  3. Regression overall significance

F-Test for Variance:

from scipy import stats
import numpy as np

# Compare variances of two samples
sample1 = [23, 25, 28, 22, 26, 27]
sample2 = [19, 31, 24, 28, 20, 35]

var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)

f_stat = var1 / var2
df1, df2 = len(sample1) - 1, len(sample2) - 1

p_value = 2 * min(
    stats.f.cdf(f_stat, df1, df2),
    1 - stats.f.cdf(f_stat, df1, df2)
)

One-Way ANOVA:

from scipy.stats import f_oneway

group1 = [85, 90, 88, 92, 87]
group2 = [78, 82, 80, 79, 81]
group3 = [91, 95, 89, 94, 92]

f_stat, p_value = f_oneway(group1, group2, group3)
# If p < 0.05: At least one group mean differs

Interviewer's Insight

What they're testing: Advanced statistical tests.

Strong answer signals:

  • Knows F = ratio of variances
  • Uses for ANOVA and regression
  • Understands two df parameters
  • Can interpret F-stat and p-value

How Do You Calculate Sample Size for A/B Tests? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: A/B Testing, Sample Size, Power | Asked by: Google, Meta, Amazon

View Answer

Sample Size Formula (Two Proportions):

\[n = \frac{2(z_{\alpha/2} + z_{\beta})^2 \bar{p}(1-\bar{p})}{\delta^2}\]

where: - δ = minimum detectable effect - p̄ = average proportion - α = significance level (usually 0.05) - 1-β = power (usually 0.80)

Python Calculation:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Current conversion: 10%
# Want to detect: 2% absolute lift (to 12%)
p1, p2 = 0.10, 0.12

# Effect size
effect_size = proportion_effectsize(p1, p2)

# Power analysis
power_analysis = NormalIndPower()
n_per_group = power_analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.80,
    ratio=1
)

print(f"Required per group: {n_per_group:.0f}")
# ~3,600 per group for 2% lift detection

Rule of Thumb:

For 80% power, 5% significance: - 1% absolute lift: ~15,000 per group - 2% absolute lift: ~3,800 per group - 5% absolute lift: ~600 per group

Factors:

Factor Effect on n
Smaller effect → Larger n
Higher power → Larger n
Lower α → Larger n

Interviewer's Insight

What they're testing: Experimental design skills.

Strong answer signals:

  • Knows key inputs (effect, power, α)
  • Uses standard library for calculation
  • Understands trade-offs
  • Gives practical rule of thumb

What is Bayesian vs Frequentist Probability? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Bayesian, Frequentist, Philosophy | Asked by: Google, Amazon, Meta

View Answer

Frequentist:

  • Probability = long-run frequency
  • Parameters are fixed (unknown constants)
  • Inference via sampling distribution
  • Uses p-values and confidence intervals

Bayesian:

  • Probability = degree of belief
  • Parameters have distributions
  • Inference via Bayes' theorem
  • Uses posterior and credible intervals

Comparison:

Aspect Frequentist Bayesian
Probability Long-run frequency Belief/uncertainty
Parameters Fixed Random
Prior info Not used Used explicitly
Intervals 95% CI: "95% of intervals contain true value" 95% credible: "95% probability parameter in interval"

Example:

# Frequentist: p-value
from scipy.stats import ttest_1samp
data = [52, 48, 55, 49, 51]
t_stat, p_value = ttest_1samp(data, 50)

# Bayesian: posterior
import pymc as pm
with pm.Model():
    mu = pm.Normal('mu', mu=50, sigma=10)  # Prior
    obs = pm.Normal('obs', mu=mu, sigma=3, observed=data)
    trace = pm.sample(1000)
# 95% credible interval from posterior

Interviewer's Insight

What they're testing: Statistical philosophy understanding.

Strong answer signals:

  • Explains both paradigms fairly
  • Knows interval interpretation difference
  • Mentions when each is preferred
  • Doesn't dogmatically favor one

What is the Multiple Comparisons Problem? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Multiple Testing, FWER, FDR | Asked by: Google, Meta, Amazon

View Answer

The Problem:

With many tests at α=0.05, false positives accumulate:

P(at least 1 false positive) = 1 - (1-α)ⁿ

  • 20 tests: 64% chance of false positive
  • 100 tests: 99.4% chance!

Solutions:

1. Bonferroni Correction (FWER):

Use α/n for each test:

n_tests = 20
alpha = 0.05
bonferroni_alpha = alpha / n_tests  # 0.0025

Conservative but controls family-wise error rate.

2. Benjamini-Hochberg (FDR):

Controls false discovery rate:

from scipy.stats import false_discovery_control

p_values = [0.001, 0.008, 0.012, 0.045, 0.060, 0.120]

# Adjust p-values
adjusted = false_discovery_control(p_values, method='bh')

# Or manually:
sorted_p = sorted(p_values)
n = len(p_values)
for i, p in enumerate(sorted_p):
    threshold = (i + 1) / n * alpha
    print(f"p={p:.3f}, threshold={threshold:.3f}")

When to Use:

Method Use Case
No correction Single pre-specified test
Bonferroni Few tests, must avoid any FP
BH Many tests, some FP acceptable

Interviewer's Insight

What they're testing: Rigorous testing knowledge.

Strong answer signals:

  • Explains why it's a problem
  • Knows Bonferroni is conservative
  • Uses FDR for exploratory analysis
  • Applies to A/B testing scenarios

What is Bootstrap Sampling? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Bootstrap, Resampling, Non-parametric | Asked by: Google, Amazon, Meta

View Answer

Bootstrap:

Resampling with replacement to estimate sampling distribution.

Process:

  1. Draw n samples with replacement from data
  2. Calculate statistic of interest
  3. Repeat B times (e.g., 10,000)
  4. Use distribution of statistics for inference

Example:

import numpy as np

data = [23, 25, 28, 22, 26, 27, 30, 24, 29, 25]
n_bootstrap = 10000

# Bootstrap confidence interval for mean
bootstrap_means = []
for _ in range(n_bootstrap):
    sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_means.append(np.mean(sample))

# 95% CI (percentile method)
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)

print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")

Use Cases:

  • Confidence intervals for any statistic
  • Estimating standard errors
  • When distribution unknown
  • Complex statistics (median, ratios)

Types:

Method Description
Percentile Use quantiles directly
Basic Reflect around estimate
BCa Bias-corrected accelerated

Interviewer's Insight

What they're testing: Modern statistical methods.

Strong answer signals:

  • Knows to resample WITH replacement
  • Uses for non-standard statistics
  • Knows different CI methods
  • Mentions computational cost

Average score on a dice role of at most 3 times - Jane Street, Hudson River Trading, Citadel Interview Question

Difficulty: 🔴 Hard | Tags: Probability, Expected Value, Game Theory | Asked by: Jane Street, Hudson River Trading, Citadel

View Answer

Consider a fair 6-sided dice. Your aim is to get the highest score you can, in at-most 3 roles.

A score is defined as the number that appears on the face of the dice facing up after the role. You can role at most 3 times but every time you role it is up to you to decide whether you want to role again.

The last score will be counted as your final score.

  • Find the average score if you rolled the dice only once?
  • Find the average score that you can get with at most 3 roles?
  • If the dice is fair, why is the average score for at most 3 roles and 1 role not the same?
Hint 1

Find what is the expected score on single role

And for cases when scores of single role < expected score on single role is when you will go for next role

Eg: if expected score of single role comes out to be 4.5, you will only role next turn for 1,2,3,4 and not for 5,6

Answer

If you role a fair dice once you can get:

Score Probability
1
2
3
4
5
6

So your average score with one role is:

sum of(score * scores's probability) = (1+2+3+4+5+6)*(⅙) = (21/6) = 3.5

The average score if you rolled the dice only once is 3.5

For at most 3 roles, let's try back-tracking. Let's say just did your second role and you have to decide whether to do your 3rd role!

We just found out if we role dice once on average we can expect score of 3.5. So we will only role the 3rd time if score on 2nd role is less than 3.5 i.e (1,2 or 3)

Possibilities

2nd role score Probability 3rd role score Probability
1 3.5
2 3.5
3 3.5
4 NA We won't role
5 NA 3rd time if we
6 NA get score >3 on 2nd

So if we had 2 roles, average score would be:

[We role again if current score is less than 3.4]
(3.5)*(1/6) + (3.5)*(1/6) + (3.5)*(1/6) 
+
(4)*(1/6) + (5)*(1/6) + (6)*(1/6) [Decide not to role again]
=
1.75 + 2.5 = 4.25

The average score if you rolled the dice twice is 4.25

So now if we look from the perspective of first role. We will only role again if our score is less than 4.25 i.e 1,2,3 or 4

Possibilities

1st role score Probability 2nd role score (Exp) Probability/Note
1 4.25
2 4.25
3 4.25
4 4.25
5 NA We won't role again if we
6 NA get score >4.25 on 1st

So if we had 3 roles, average score would be:

[We role again if current score is less than 4.25]
(4.25)*(1/6) + (4.25)*(1/6) + (4.25)*(1/6) + (4.25)*(1/6) 
+
(5)*(1/6) + (6)*(1/6) [[Decide not to role again]
=
17/6 + 11/6 = 4.66
The average score if you rolled the dice only once is 4.66

The average score for at most 3 roles and 1 role is not the same because although the dice is fair the event of rolling the dice is no longer independent. The scores would have been the same if we rolled the dice 2nd and 3rd time without considering what we got in the last roll i.e. if the event of rolling the dice was independent.

Interviewer's Insight

What they're testing: Optimal stopping and backward induction.

Explain the Coupon Collector Problem - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Coupon Collector, Expected Value, Puzzle | Asked by: Google, Amazon, Meta

View Answer

Problem:

How many items to collect before getting all n types? (Each type equally likely)

Expected Value:

\[E[T] = n \cdot H_n = n \cdot \sum_{i=1}^{n} \frac{1}{i}\]

where Hₙ is the nth harmonic number.

Intuition:

After collecting k types, expected trials until new type = n/(n-k)

Example:

import numpy as np

def expected_trials(n):
    """Expected trials to collect all n types"""
    return n * sum(1/i for i in range(1, n+1))

# 6 types (like Pokemon cards)
print(f"E[trials]: {expected_trials(6):.2f}")  # ~14.7

# Simulation
def simulate_coupon_collector(n, simulations=10000):
    trials = []
    for _ in range(simulations):
        collected = set()
        count = 0
        while len(collected) < n:
            collected.add(np.random.randint(n))
            count += 1
        trials.append(count)
    return np.mean(trials)

print(f"Simulated: {simulate_coupon_collector(6):.2f}")

Applications:

  • A/B testing (all user segments)
  • Load testing (all code paths)
  • Collecting rare items

Interviewer's Insight

What they're testing: Probability puzzle solving.

Strong answer signals:

  • Uses linearity of expectation
  • Knows harmonic series result
  • Can simulate to verify
  • Applies to real scenarios

What is Simpson's Paradox? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Simpson's Paradox, Confounding, Causality | Asked by: Google, Meta, Amazon

View Answer

Simpson's Paradox:

Trend appears in subgroups but reverses when combined.

Classic Example - UC Berkeley Admissions:

Men Apply Men Admit Women Apply Women Admit
Overall 8,442 44% 4,321 35%

Looks like discrimination against women!

But by department:

Dept Men Apply Men % Women Apply Women %
A 825 62% 108 82%
B 560 63% 25 68%
C 325 37% 593 34%

Women had HIGHER rates in each department!

Cause:

Women applied more to competitive departments.

import pandas as pd

# Weighted vs unweighted
data = pd.DataFrame({
    'dept': ['A', 'A', 'B', 'B'],
    'gender': ['M', 'F', 'M', 'F'],
    'applications': [825, 108, 560, 25],
    'rate': [0.62, 0.82, 0.63, 0.68]
})

# Department is a confounding variable

Lesson:

Always consider lurking/confounding variables before drawing conclusions.

Interviewer's Insight

What they're testing: Critical thinking about data.

Strong answer signals:

  • Gives clear example
  • Identifies confounding variable
  • Knows when to aggregate vs stratify
  • Relates to A/B testing concerns

What Are Quantiles and Percentiles? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Quantiles, Percentiles, Descriptive | Asked by: Google, Amazon, Meta

View Answer

Definitions:

  • Quantile: Values dividing distribution into intervals
  • Percentile: Quantile expressed as percentage
  • P-th percentile: Value below which P% of data falls

Common Quantiles:

Name Divides Into
Median (Q2) 2 equal parts
Quartiles (Q1, Q2, Q3) 4 equal parts
Deciles 10 equal parts
Percentiles 100 equal parts

Calculation:

import numpy as np

data = [12, 15, 18, 20, 22, 25, 28, 30, 35, 40]

# Percentiles
p25 = np.percentile(data, 25)  # Q1
p50 = np.percentile(data, 50)  # Median
p75 = np.percentile(data, 75)  # Q3
p90 = np.percentile(data, 90)  # 90th percentile

# IQR (Interquartile Range)
iqr = p75 - p25

Uses:

  • Latency: "p99 response time < 100ms"
  • Salaries: "In top 10% earners"
  • Outlier detection: Beyond 1.5*IQR

Z-score to Percentile:

from scipy.stats import norm

# Z = 1.645 → 95th percentile
norm.cdf(1.645)  # ≈ 0.95

Interviewer's Insight

What they're testing: Basic statistical literacy.

Strong answer signals:

  • Knows p50 = median
  • Uses for SLA metrics
  • Can convert z-scores to percentiles
  • Understands IQR for robustness

What is the Difference Between Standard Deviation and Standard Error? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Standard Deviation, Standard Error, Sampling | Asked by: Google, Amazon, Meta

View Answer

Standard Deviation (SD):

Measures spread of individual observations:

\[SD = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}\]

Standard Error (SE):

Measures uncertainty in sample mean:

\[SE = \frac{SD}{\sqrt{n}}\]

Key Difference:

SD SE
Describes data spread Describes estimate precision
Doesn't depend on n (conceptually) Decreases with larger n
Used for z-scores Used for confidence intervals

Example:

import numpy as np
from scipy.stats import sem

data = [23, 25, 28, 22, 26, 27, 24, 29, 25, 26]

sd = np.std(data, ddof=1)  # Sample SD
se = sem(data)  # Standard error of mean
# or se = sd / np.sqrt(len(data))

mean = np.mean(data)

# 95% CI using SE
ci = (mean - 1.96*se, mean + 1.96*se)

print(f"SD: {sd:.2f}")   # ~2.21
print(f"SE: {se:.2f}")   # ~0.70
print(f"95% CI: {ci}")

Intuition:

  • SD: "Typical distance of point from mean"
  • SE: "Typical error in our estimate of the mean"

Interviewer's Insight

What they're testing: Sampling variability understanding.

Strong answer signals:

  • Clearly distinguishes the two concepts
  • Knows SE = SD/√n
  • Uses SE for confidence intervals
  • Knows SE decreases with n

What is Moment Generating Function? - Amazon, Microsoft Interview Question

Difficulty: 🔴 Hard | Tags: MGF, Moments, Advanced | Asked by: Amazon, Microsoft, Google

View Answer

Moment Generating Function (MGF):

\[M_X(t) = E[e^{tX}] = \sum_x e^{tx} P(X=x)\]

Why "Moment Generating"?

nth moment = nth derivative at t=0:

\[E[X^n] = M_X^{(n)}(0)\]

Properties:

  1. Uniquely determines distribution
  2. Sum of independent RVs: MGF = product of MGFs
  3. Linear transform: M_{aX+b}(t) = e^{bt} M_X(at)

Examples:

Distribution MGF
Normal(μ,σ²) exp(μt + σ²t²/2)
Exponential(λ) λ/(λ-t) for t < λ
Poisson(λ) exp(λ(eᵗ-1))
Binomial(n,p) (1-p+peᵗ)ⁿ

Deriving Moments:

# For Exponential(λ=2): M(t) = 2/(2-t)
# E[X] = M'(0) = 2/(2-0)² = 1/2
# E[X²] = M''(0) = 4/(2-0)³ = 1/2
# Var(X) = E[X²] - (E[X])² = 1/2 - 1/4 = 1/4

Application:

Proving CLT: MGF of sum → MGF of normal

Interviewer's Insight

What they're testing: Advanced probability theory.

Strong answer signals:

  • Knows moment derivation via derivatives
  • Uses for proving sum distributions
  • Knows MGF uniquely identifies distribution
  • Can derive simple moments

What is the Waiting Time Paradox? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Waiting Time, Inspection Paradox, Counter-intuitive | Asked by: Google, Amazon, Meta

View Answer

The Paradox:

Average wait for a bus can exceed half the average interval!

Explanation:

You're more likely to arrive during a LONG interval than a short one.

Mathematical:

For Poisson arrivals (rate λ): - Average interval: 1/λ - Expected wait: 1/λ (same as full interval!)

Due to memoryless property.

Example:

import numpy as np

# Buses every 10 minutes on average (Poisson)
lambda_rate = 0.1  # per minute

# Simulate arrivals
n_buses = 10000
intervals = np.random.exponential(1/lambda_rate, n_buses)

# Arrive at random time within each interval
random_fraction = np.random.uniform(0, 1, n_buses)
wait_times = intervals * random_fraction

avg_wait = np.mean(wait_times)
avg_interval = np.mean(intervals)

print(f"Avg interval: {avg_interval:.1f} min")
print(f"Avg wait: {avg_wait:.1f} min")
# Both approximately 10 minutes!

Real-World:

If buses are scheduled (not Poisson), wait ≈ interval/2. But with variability, wait increases due to "length-biased sampling."

Interviewer's Insight

What they're testing: Counter-intuitive probability.

Strong answer signals:

  • Explains length-biased sampling
  • Connects to memoryless property
  • Can simulate to demonstrate
  • Knows scheduled vs random arrivals differ

How Do You Estimate Probability from Rare Events? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Rare Events, Estimation, Confidence | Asked by: Google, Amazon, Meta

View Answer

The Challenge:

0 events in n trials. Is probability really 0?

Rule of Three:

If 0 events in n trials, 95% confident p < 3/n

n = 1000  # trials
events = 0  # observed

# 95% upper bound
upper_bound = 3 / n  # 0.003 or 0.3%

Bayesian Approach:

from scipy.stats import beta

# Prior: Beta(1, 1) = Uniform
# Posterior: Beta(1 + k, 1 + n - k)

n, k = 1000, 0
posterior = beta(1 + k, 1 + n - k)

# 95% credible interval
ci = posterior.interval(0.95)
print(f"95% CI: ({ci[0]:.5f}, {ci[1]:.5f})")
# (0.0, 0.003)

# With 3 events in 1000:
posterior = beta(1 + 3, 1 + 1000 - 3)
mean_estimate = posterior.mean()  # ≈ 0.004

Methods Comparison:

Method Estimate CI
MLE (k/n) 0 Undefined
Rule of 3 - (0, 0.003)
Bayesian 0.001 (0, 0.003)
Wilson 0.0002 (0, 0.002)

Interviewer's Insight

What they're testing: Practical estimation skills.

Strong answer signals:

  • Knows Rule of Three for quick bounds
  • Uses Bayesian for proper intervals
  • Doesn't report 0 as point estimate
  • Mentions sample size requirements

Explain Type I and Type II Errors with Examples - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Hypothesis Testing, Error Types, Statistics | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Type I Error (False Positive):

Rejecting true null hypothesis. Denoted α (significance level).

Type II Error (False Negative):

Failing to reject false null hypothesis. Denoted β.

Power = 1 - β: Probability of correctly rejecting false null.

Medical Test Example:

Test Result Truth: No Disease Truth: Disease
Negative ✅ Correct ❌ Type II Error (β)
Positive ❌ Type I Error (α) ✅ Correct (Power)

Criminal Trial Analogy:

H₀: Defendant is innocent
H₁: Defendant is guilty

Type I Error: Convict innocent person (α)
Type II Error: Acquit guilty person (β)

Legal system prefers Type II over Type I
→ Set α = 0.05 (strict threshold)

Trade-off:

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Two distributions: H0 and H1
x = np.linspace(-4, 8, 1000)
h0_dist = stats.norm(0, 1)  # Null
h1_dist = stats.norm(3, 1)  # Alternative

# Critical value for α = 0.05
critical = h0_dist.ppf(0.95)  # 1.645

# α: Area under H0 beyond critical
alpha = 1 - h0_dist.cdf(critical)  # 0.05

# β: Area under H1 below critical
beta = h1_dist.cdf(critical)  # 0.09

power = 1 - beta  # 0.91

print(f"α (Type I): {alpha:.3f}")
print(f"β (Type II): {beta:.3f}")
print(f"Power: {power:.3f}")

How to Reduce Errors:

Action Effect on α Effect on β
Increase sample size Same Decreases ↓
Decrease α threshold Decreases ↓ Increases ↑
Increase α threshold Increases ↑ Decreases ↓

Interviewer's Insight

What they're testing: Understanding hypothesis test trade-offs.

Strong answer signals:

  • Uses clear real-world analogy
  • Explains α-β trade-off
  • Knows power = 1 - β
  • Mentions sample size as solution

What is the Likelihood Ratio Test? - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Hypothesis Testing, Likelihood, Statistical Tests | Asked by: Google, Meta, Amazon

View Answer

Likelihood Ratio Test (LRT):

Compares fit of two nested models:

\[\Lambda = \frac{L(\theta_0 | data)}{L(\hat{\theta} | data)}\]

Or equivalently:

\[-2\log(\Lambda) = 2[\log L(\hat{\theta}) - \log L(\theta_0)]\]

Follows χ² distribution with df = difference in parameters.

Example - Coin Fairness:

from scipy import stats
import numpy as np

# Data: 60 heads in 100 flips
heads = 60
n = 100

# H0: p = 0.5 (fair coin)
p0 = 0.5
L0 = stats.binom.pmf(heads, n, p0)

# H1: p = MLE = 60/100
p_hat = heads / n
L1 = stats.binom.pmf(heads, n, p_hat)

# Likelihood ratio
lambda_stat = L0 / L1

# Test statistic (asymptotically χ² with df=1)
test_stat = -2 * np.log(lambda_stat)

# p-value
p_value = 1 - stats.chi2.cdf(test_stat, df=1)

print(f"Test statistic: {test_stat:.3f}")
print(f"p-value: {p_value:.3f}")

if p_value < 0.05:
    print("Reject H0: Coin is biased")
else:
    print("Fail to reject H0: Coin appears fair")

Why LRT is Powerful:

  • Optimal under certain conditions (Neyman-Pearson lemma)
  • Works for complex hypotheses
  • Asymptotically χ² distributed

Common Applications:

  • Model selection (compare nested models)
  • Goodness of fit tests
  • Testing parameter significance in regression

Interviewer's Insight

What they're testing: Advanced statistical testing knowledge.

Strong answer signals:

  • Knows -2 log(Λ) ~ χ²
  • Can apply to real problem
  • Mentions nested models requirement
  • Links to model selection

Explain the Bias of an Estimator - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Estimation, Bias, Statistics | Asked by: Amazon, Microsoft, Google

View Answer

Bias of Estimator:

\[Bias(\hat{\theta}) = E[\hat{\theta}] - \theta\]

Unbiased: \(E[\hat{\theta}] = \theta\)

Example - Sample Variance:

import numpy as np

# Population variance: divide by n
population = np.random.normal(100, 15, size=10000)
pop_var = np.var(population)  # True variance

# Sample variance (biased): divide by n
sample = np.random.choice(population, 30)
biased_var = np.mean((sample - sample.mean())**2)  # ÷n

# Sample variance (unbiased): divide by n-1
unbiased_var = np.var(sample, ddof=1)  # ÷(n-1)

print(f"Population variance: {pop_var:.2f}")
print(f"Biased estimator: {biased_var:.2f}")
print(f"Unbiased estimator: {unbiased_var:.2f}")

# Repeat 10000 times
biased_estimates = []
unbiased_estimates = []
for _ in range(10000):
    s = np.random.choice(population, 30)
    biased_estimates.append(np.mean((s - s.mean())**2))
    unbiased_estimates.append(np.var(s, ddof=1))

print(f"\nBiased mean: {np.mean(biased_estimates):.2f}")
print(f"Unbiased mean: {np.mean(unbiased_estimates):.2f}")

Why Divide by n-1?

Using sample mean (not true mean) introduces bias: - Sample points closer to sample mean than true mean - Need Bessel's correction: n/(n-1) factor

Bias-Variance Tradeoff:

\[MSE = Bias^2 + Variance\]

Sometimes biased estimators have lower MSE!

Example:

Estimator Bias Variance MSE
Sample mean 0 σ²/n σ²/n
Median (normal) 0 πσ²/(2n) πσ²/(2n)

Interviewer's Insight

What they're testing: Deep understanding of estimation.

Strong answer signals:

  • Knows formula E[θ̂] - θ
  • Explains Bessel's correction
  • Mentions bias-variance tradeoff
  • Knows unbiased ≠ always better

What is the Maximum Likelihood Estimation (MLE)? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: MLE, Parameter Estimation, Statistics | Asked by: Google, Amazon, Microsoft, Meta

View Answer

Maximum Likelihood Estimation:

Find parameter θ that maximizes probability of observed data:

\[\hat{\theta}_{MLE} = \arg\max_\theta L(\theta | data)\]

Often maximize log-likelihood:

\[\hat{\theta}_{MLE} = \arg\max_\theta \log L(\theta | data)\]

Example - Coin Flip:

import numpy as np
from scipy.optimize import minimize_scalar

# Data: 7 heads in 10 flips
heads = 7
n = 10

# Likelihood function
def neg_log_likelihood(p):
    # Negative because we minimize
    from scipy.stats import binom
    return -binom.logpmf(heads, n, p)

# Find MLE
result = minimize_scalar(neg_log_likelihood, bounds=(0, 1), method='bounded')
p_mle = result.x

print(f"MLE estimate: p = {p_mle:.3f}")
# Expected: 7/10 = 0.7

# Analytical solution
p_analytical = heads / n
print(f"Analytical: p = {p_analytical:.3f}")

Example - Normal Distribution:

# Data
data = np.array([2.1, 1.9, 2.3, 2.0, 1.8, 2.2])

# MLE for normal: μ = mean, σ² = variance (biased)
mu_mle = np.mean(data)
sigma_mle = np.std(data, ddof=0)  # Note: biased MLE

print(f"MLE μ: {mu_mle:.3f}")
print(f"MLE σ: {sigma_mle:.3f}")

Properties of MLE:

Property Description
Consistent →θ as n→∞
Asymptotically normal √n(θ̂-θ) ~ N(0, I⁻¹)
Invariant If θ̂ is MLE, g(θ̂) is MLE of g(θ)
May be biased In finite samples

When to Use:

  • Have parametric model
  • Want point estimate
  • Large sample size
  • No strong prior belief (use MLE over Bayesian)

Interviewer's Insight

What they're testing: Fundamental parameter estimation.

Strong answer signals:

  • Maximizes likelihood of data
  • Takes log for computational ease
  • Knows properties (consistency, asymptotic normality)
  • Can derive analytically for simple cases

Explain the Weak vs Strong Law of Large Numbers - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Law of Large Numbers, Convergence, Theory | Asked by: Google, Meta, Microsoft

View Answer

Weak Law of Large Numbers (WLLN):

Convergence in probability:

\[\frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu \quad \text{as } n \to \infty\]

For any ε > 0:

\[P\left(\left|\frac{1}{n}\sum_{i=1}^n X_i - \mu\right| > \epsilon\right) \to 0\]

Strong Law of Large Numbers (SLLN):

Almost sure convergence:

\[P\left(\lim_{n\to\infty} \frac{1}{n}\sum_{i=1}^n X_i = \mu\right) = 1\]

Key Difference:

Type Convergence Meaning
WLLN In probability For large n, probably close to μ
SLLN Almost surely Path converges to μ with prob 1

Visualization:

import numpy as np
import matplotlib.pyplot as plt

# 100 paths of cumulative averages
n = 10000
num_paths = 100

for _ in range(num_paths):
    # Fair coin flips (Bernoulli(0.5))
    flips = np.random.randint(0, 2, n)
    cumsum = np.cumsum(flips)
    cum_avg = cumsum / np.arange(1, n+1)

    plt.plot(cum_avg, alpha=0.1, color='blue')

plt.axhline(y=0.5, color='red', linestyle='--', label='μ = 0.5')
plt.xlabel('Number of flips')
plt.ylabel('Cumulative average')
plt.title('Strong Law: All paths converge')
plt.legend()
plt.show()

Intuition:

  • WLLN: At n=1000, most samples close to μ
  • SLLN: Each individual sequence eventually stays near μ forever

Requirements:

Both need: - Independent observations - Identically distributed - Finite mean μ

WLLN only needs finite variance; SLLN is stronger result.

Interviewer's Insight

What they're testing: Theoretical understanding of convergence.

Strong answer signals:

  • Distinguishes convergence types
  • "SLLN is stronger than WLLN"
  • Mentions independence requirement
  • Can explain with simulation

What is Chebyshev's Inequality? When to Use It? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Concentration Inequality, Probability Bounds | Asked by: Amazon, Microsoft, Google

View Answer

Chebyshev's Inequality:

For any random variable X with finite mean μ and variance σ²:

\[P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}\]

Or equivalently:

\[P(|X - \mu| < k\sigma) \geq 1 - \frac{1}{k^2}\]

Key Insight: Works for ANY distribution!

Examples:

# At least 75% of data within 2 std devs
k = 2
prob_within = 1 - 1/k**2  # 1 - 1/4 = 0.75 or 75%

# At least 88.9% within 3 std devs
k = 3
prob_within = 1 - 1/k**2  # 1 - 1/9 ≈ 0.889

# Compare to normal (68-95-99.7):
# Normal: 95% within 2σ
# Chebyshev: ≥75% within 2σ (works for ANY distribution!)

When to Use:

  1. Unknown distribution: Only know mean and variance
  2. Conservative bounds: Guaranteed bound for any distribution
  3. Worst-case analysis: Planning for extreme scenarios

Application - Sample Size:

# How many samples for X̄ within 0.1 of μ with 95% confidence?

# Want: P(|X̄ - μ| < 0.1) ≥ 0.95
# Chebyshev: P(|X̄ - μ| < kσ/√n) ≥ 1 - 1/k²

# Set: kσ/√n = 0.1 and 1 - 1/k² = 0.95
# → k² = 20, so k = 4.47
# If σ = 1: n = (k*σ/0.1)² = (4.47)²/0.01 ≈ 2000

sigma = 1.0
epsilon = 0.1
confidence = 0.95

k = 1 / np.sqrt(1 - confidence)
n = (k * sigma / epsilon)**2

print(f"Required sample size: {int(np.ceil(n))}")

Comparison:

k Chebyshev Bound Normal (if applicable)
1 ≥ 0% 68%
2 ≥ 75% 95%
3 ≥ 88.9% 99.7%

Chebyshev is conservative but universally applicable!

Interviewer's Insight

What they're testing: Understanding of probability bounds.

Strong answer signals:

  • "Works for ANY distribution"
  • Can apply to sample means
  • Knows it's conservative
  • Uses for worst-case analysis

What is Jensen's Inequality? Give Examples - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Convexity, Inequalities, Theory | Asked by: Google, Meta, Amazon

View Answer

Jensen's Inequality:

For convex function f:

\[f(E[X]) \leq E[f(X)]\]

For concave function f:

\[f(E[X]) \geq E[f(X)]\]

Intuition: Average of function ≥ function of average (if convex)

Example 1 - Variance:

# E[X²] ≥ (E[X])²
# Because f(x) = x² is convex

# This gives us: Var(X) = E[X²] - (E[X])² ≥ 0

Example 2 - Log:

# f(x) = log(x) is concave
# So: log(E[X]) ≥ E[log(X)]

import numpy as np

X = np.array([1, 2, 3, 4, 5])

left = np.log(np.mean(X))  # log(3) ≈ 1.099
right = np.mean(np.log(X))  # mean of [0, 0.69, 1.10, 1.39, 1.61] ≈ 0.958

print(f"log(E[X]) = {left:.3f}")
print(f"E[log(X)] = {right:.3f}")
print(f"Inequality holds: {left >= right}")  # True

Example 3 - Machine Learning (Cross-Entropy):

# KL divergence is always ≥ 0
# Proof uses Jensen on f(x) = -log(x):

# KL(P||Q) = Σ P(x) log(P(x)/Q(x))
#          = -Σ P(x) log(Q(x)/P(x))
#          ≥ -log(Σ P(x) · Q(x)/P(x))  [Jensen]
#          = -log(Σ Q(x))
#          = -log(1) = 0

Applications in Data Science:

  1. Prove variance ≥ 0
  2. Derive information inequalities
  3. Optimization (EM algorithm)
  4. Risk analysis (concave utility functions)

Interviewer's Insight

What they're testing: Advanced mathematical maturity.

Strong answer signals:

  • Knows convex vs concave
  • Can prove Var(X) ≥ 0
  • Mentions ML applications
  • Draws visual representation

Explain the Kullback-Leibler (KL) Divergence - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Information Theory, Divergence, ML | Asked by: Google, Meta, Amazon

View Answer

KL Divergence:

Measures "distance" from distribution Q to P:

\[D_{KL}(P||Q) = \sum_x P(x) \log\frac{P(x)}{Q(x)}\]

Or for continuous:

\[D_{KL}(P||Q) = \int p(x) \log\frac{p(x)}{q(x)} dx\]

Properties:

Property Value
Non-negative D_KL ≥ 0
Zero iff P=Q D_KL = 0 ⟺ P=Q
NOT symmetric D_KL(P‖Q) ≠ D_KL(Q‖P)
NOT a metric Doesn't satisfy triangle inequality

Example:

import numpy as np
from scipy.special import rel_entr

# Two distributions
P = np.array([0.1, 0.2, 0.7])
Q = np.array([0.3, 0.3, 0.4])

# KL divergence P || Q
kl_pq = np.sum(rel_entr(P, Q))
print(f"KL(P||Q) = {kl_pq:.4f}")  # 0.2393

# KL divergence Q || P  
kl_qp = np.sum(rel_entr(Q, P))
print(f"KL(Q||P) = {kl_qp:.4f}")  # 0.2582

# Not symmetric!
print(f"Symmetric? {np.isclose(kl_pq, kl_qp)}")  # False

Interpretation:

  • Information gain: Extra bits needed if using Q instead of P
  • Relative entropy: How much P diverges from Q
  • Surprise: Expected surprise if Q is true but we assume P

ML Applications:

# 1. Variational Autoencoders (VAE)
# Minimize KL between learned Q(z|x) and prior P(z)

# 2. Knowledge Distillation
# Match student Q to teacher P

# 3. Policy Gradient (RL)
# KL constraint on policy updates

# 4. Model Selection
# AIC/BIC based on KL divergence

Cross-Entropy Connection:

\[D_{KL}(P||Q) = H(P, Q) - H(P)\]

Where H(P,Q) is cross-entropy. Minimizing cross-entropy = minimizing KL divergence!

Interviewer's Insight

What they're testing: Information theory for ML.

Strong answer signals:

  • Knows D_KL ≥ 0 (Jensen)
  • NOT symmetric or metric
  • Links to cross-entropy
  • Mentions VAE/RL applications

What is the Poisson Process? Give Real-World Examples - Google, Amazon Interview Question

Difficulty: 🔴 Hard | Tags: Stochastic Processes, Poisson, Applications | Asked by: Google, Amazon, Microsoft

View Answer

Poisson Process:

Models random events occurring continuously over time with:

  1. Events occur independently
  2. Constant average rate λ (events per time unit)
  3. Two events don't occur at exactly same time

Key Results:

Quantity Distribution
N(t) = # events in [0,t] Poisson(λt)
T = time until first event Exponential(λ)
T_n = time until nth event Gamma(n, λ)
S = time between events Exponential(λ)

Example - Customer Arrivals:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson, expon

# λ = 5 customers per hour
lambda_rate = 5
t_max = 2  # 2 hours

# Simulate Poisson process
np.random.seed(42)

# Method 1: Generate inter-arrival times
arrivals = []
t = 0
while t < t_max:
    # Time to next customer ~ Exp(λ)
    dt = np.random.exponential(1/lambda_rate)
    t += dt
    if t < t_max:
        arrivals.append(t)

print(f"Total arrivals in {t_max} hours: {len(arrivals)}")
print(f"Expected: λt = {lambda_rate * t_max}")

# Visualize
plt.figure(figsize=(10, 4))
plt.eventplot(arrivals, lineoffsets=1, linelengths=0.5)
plt.xlim(0, t_max)
plt.xlabel('Time (hours)')
plt.title(f'Poisson Process (λ={lambda_rate}/hour)')
plt.yticks([])
plt.show()

Real-World Applications:

  1. Customer Service:
  2. Call center arrivals
  3. Queue management

  4. Infrastructure:

  5. Equipment failures
  6. Server requests

  7. Natural Phenomena:

  8. Radioactive decay
  9. Earthquake occurrences

  10. Web Analytics:

  11. Page views
  12. Ad clicks

Interview Questions:

# Q: Server gets 10 requests/minute. 
# What's P(≥15 requests in next minute)?

from scipy.stats import poisson

lambda_rate = 10
k = 15

# P(X ≥ 15) = 1 - P(X ≤ 14)
p = 1 - poisson.cdf(14, lambda_rate)
print(f"P(X ≥ 15) = {p:.4f}")  # 0.0487

# Q: Average time between requests?
avg_time = 1 / lambda_rate  # 0.1 minutes = 6 seconds

Interviewer's Insight

What they're testing: Applied probability modeling.

Strong answer signals:

  • States 3 key properties
  • Links to exponential distribution
  • Gives relevant examples
  • Can calculate probabilities

What is a Memoryless Property? Which Distributions Have It? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Probability Properties, Exponential, Geometric | Asked by: Amazon, Microsoft, Google

View Answer

Memoryless Property:

For random variable X:

\[P(X > s + t | X > s) = P(X > t)\]

"Given you've waited s time, probability of waiting additional t is same as waiting t from start."

Only Two Distributions:

  1. Exponential (continuous)
  2. Geometric (discrete)

Exponential Example:

from scipy.stats import expon

# Waiting time for bus: λ = 1/10 (avg 10 min)
lambda_rate = 0.1

s, t = 5, 5  # Already waited 5 min, what's P(wait 5+ more)?

# Direct calculation
p_conditional = expon.sf(s + t, scale=1/lambda_rate) / expon.sf(s, scale=1/lambda_rate)

# Memoryless: should equal P(X > 5)
p_unconditional = expon.sf(t, scale=1/lambda_rate)

print(f"P(X > 10 | X > 5) = {p_conditional:.4f}")
print(f"P(X > 5) = {p_unconditional:.4f}")
print(f"Memoryless? {np.isclose(p_conditional, p_unconditional)}")
# True: both ≈ 0.6065

Geometric Example:

# Rolling die until 6 appears
# Already rolled 3 times without 6
# What's P(need 5+ more rolls)?

from scipy.stats import geom

p = 1/6  # P(6 on single roll)

s, t = 3, 5

# P(X > 8 | X > 3) = P(X > 5)
p_conditional = geom.sf(s + t, p) / geom.sf(s, p)
p_unconditional = geom.sf(t, p)

print(f"Conditional: {p_conditional:.4f}")
print(f"Unconditional: {p_unconditional:.4f}")
# Both ≈ 0.4019

Why Important?

Context Implication
Queues Waiting time doesn't depend on time already waited
Reliability Equipment failure rate constant over time
Modeling Simplifies calculations dramatically

Counter-Example (NOT memoryless):

# Normal distribution is NOT memoryless
# If X ~ N(100, 15), knowing X > 90 changes distribution

from scipy.stats import norm

# This will NOT be equal:
p1 = norm.sf(110, 100, 15) / norm.sf(90, 100, 15)
p2 = norm.sf(10, 0, 15)
print(f"Normal memoryless? {np.isclose(p1, p2)}")  # False

Interviewer's Insight

What they're testing: Deep distribution knowledge.

Strong answer signals:

  • "Only exponential and geometric"
  • Explains with waiting time
  • Can prove mathematically
  • Knows why it matters (simplification)

Explain the Difference Between Probability and Odds - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Fundamentals, Odds, Probability | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Probability:

\[P(A) = \frac{\text{# favorable outcomes}}{\text{# total outcomes}}\]

Range: [0, 1]

Odds:

\[\text{Odds}(A) = \frac{P(A)}{1 - P(A)} = \frac{P(A)}{P(A^c)}\]

Range: [0, ∞)

Conversion:

# Probability → Odds
p = 0.75
odds = p / (1 - p)  # 0.75/0.25 = 3
print(f"Probability {p} = Odds {odds}:1")

# Odds → Probability
odds = 3
p = odds / (1 + odds)  # 3/4 = 0.75
print(f"Odds {odds}:1 = Probability {p}")

Examples:

Scenario Probability Odds Odds Notation
Coin flip (heads) 0.5 1 1:1 or "even"
Roll 6 on die ⅙ ≈ 0.167 ⅕ = 0.2 1:5 or "5 to 1 against"
Disease prevalence 1% 0.01 0.0101 1:99
Rain 80% 0.8 4 4:1

Why Odds in Logistic Regression:

# Logistic regression models log-odds:
# log(p/(1-p)) = β₀ + β₁x₁ + ...

import numpy as np

# Example
beta_0 = -2
beta_1 = 0.5
x = 6

# Log-odds
log_odds = beta_0 + beta_1 * x  # -2 + 0.5*6 = 1

# Convert to probability
odds = np.exp(log_odds)  # e^1 ≈ 2.718
p = odds / (1 + odds)     # 2.718/3.718 ≈ 0.731

print(f"Log-odds: {log_odds}")
print(f"Odds: {odds:.3f}")
print(f"Probability: {p:.3f}")

Betting Example:

  • Odds 5:1 against means bet $1 to win $5
  • Implies probability = 1/(5+1) = ⅙ ≈ 0.167
  • Odds 1:2 for means bet $2 to win $1
  • Implies probability = 2/(1+2) = ⅔ ≈ 0.667

Interviewer's Insight

What they're testing: Basic probability literacy.

Strong answer signals:

  • Clear formula for both
  • Can convert between them
  • Mentions logistic regression connection
  • Explains betting context

What is the Gambler's Fallacy vs Hot Hand Fallacy? - Meta, Google Interview Question

Difficulty: 🟢 Easy | Tags: Cognitive Bias, Independence, Misconceptions | Asked by: Meta, Google, Amazon

View Answer

Gambler's Fallacy:

Believing that past independent events affect future probabilities.

"Red came up 5 times, black is 'due' now!"

Hot Hand Fallacy:

Believing that success/failure streaks will continue.

"I made 5 baskets in a row, I'm on fire!"

Why They're Wrong:

For independent events, each trial has same probability.

# Coin flips
# After 5 heads: P(6th is heads) = 0.5
# NOT higher (hot hand) or lower (gambler's fallacy)

import numpy as np

# Simulation
flips = np.random.randint(0, 2, 100000)

# Find all positions after 5 consecutive heads
streak_positions = []
for i in range(5, len(flips)):
    if all(flips[i-5:i] == 1):  # 5 heads
        streak_positions.append(i)

# What happens next?
if len(streak_positions) > 0:
    next_flips = flips[streak_positions]
    prob_heads = np.mean(next_flips)
    print(f"P(heads after 5 heads) = {prob_heads:.3f}")
    # ≈ 0.5, not different!

Examples:

Scenario Fallacy Reality
Roulette: 10 reds in a row "Black is due!" Still 18/37 ≈ 0.486
Lottery: Same numbers twice "Won't repeat!" Same 1/millions chance
Basketball: 5 made shots "On fire, keep shooting!" Might be, if skill varies

When Hot Hand is REAL:

  • Not independent: Basketball (confidence, defense adjustment)
  • Changing conditions: Weather in sports, market trends
  • Adaptive systems: Video games (difficulty adjustment)

In Data Science:

# A/B test: first 100 users show lift
# Gambler's fallacy: "Next 100 will reverse"
# Reality: If real effect, will persist

# Stock trading: 5 winning trades
# Hot hand: "I'm skilled, bet bigger"
# Reality: Check if strategy or luck

Interviewer's Insight

What they're testing: Understanding independence vs dependence.

Strong answer signals:

  • Distinguishes both fallacies clearly
  • "For independent events..."
  • Knows when hot hand IS real
  • Gives data science examples

What is a Martingale? Give an Example - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Stochastic Processes, Martingale, Finance | Asked by: Google, Meta, Amazon

View Answer

Martingale:

Sequence of random variables {X₀, X₁, X₂, ...} where:

\[E[X_{n+1} | X_0, ..., X_n] = X_n\]

Intuition: Expected future value = current value (given history)

"Fair game" - no expected gain or loss.

Example 1 - Fair Coin Toss:

import numpy as np
import matplotlib.pyplot as plt

# Start with $100, bet $1 per flip
# Win $1 if heads, lose $1 if tails

n_flips = 1000
n_paths = 10

for _ in range(n_paths):
    wealth = [100]
    for _ in range(n_flips):
        flip = np.random.choice([-1, 1])
        wealth.append(wealth[-1] + flip)

    plt.plot(wealth, alpha=0.5)

plt.axhline(y=100, color='red', linestyle='--', label='Starting value')
plt.xlabel('Flip number')
plt.ylabel('Wealth')
plt.title('Martingale: Fair Coin Betting')
plt.legend()
plt.show()

# E[wealth_n | wealth_0, ..., wealth_{n-1}] = wealth_{n-1}

Example 2 - Random Walk:

# S_n = X_1 + X_2 + ... + X_n
# where X_i are independent with E[X_i] = 0

# This is a martingale:
# E[S_{n+1} | S_n] = S_n + E[X_{n+1}] = S_n + 0 = S_n

Properties:

  1. Optional Stopping Theorem:
  2. E[X_τ] = E[X_0] for stopping time τ (under conditions)
  3. "Can't beat the house with any strategy"

  4. Martingale Convergence:

  5. Bounded martingales converge

Not a Martingale:

# Unfair coin: P(heads) = 0.6
# Bet $1, win $1 if heads, lose $1 if tails

# E[W_{n+1} | W_n] = W_n + 0.6*1 + 0.4*(-1)
#                   = W_n + 0.2 ≠ W_n

# This is a SUB-martingale (expected increase)

Applications:

Field Example
Finance Stock prices (efficient market hypothesis)
Gambling Betting strategies analysis
Statistics Sequential analysis
Machine Learning Stochastic gradient descent analysis

Interviewer's Insight

What they're testing: Advanced probability/finance knowledge.

Strong answer signals:

  • E[X_{n+1}|history] = X_n
  • "Fair game" intuition
  • Mentions random walk
  • Optional stopping theorem

Explain the Wald's Equation (Wald's Identity) - Amazon, Microsoft Interview Question

Difficulty: 🔴 Hard | Tags: Random Sums, Theory, Expectations | Asked by: Amazon, Microsoft, Google

View Answer

Wald's Equation:

If X₁, X₂, ... are i.i.d. with E[Xᵢ] = μ, and N is a stopping time with E[N] < ∞:

\[E\left[\sum_{i=1}^N X_i\right] = E[N] \cdot E[X]\]

Intuition: Expected sum = (expected # terms) × (expected value per term)

Key Requirement: N must be a stopping time (decision to stop at n only uses X₁,...,Xₙ)

Example 1 - Gambling:

# Play until you win (or 100 games)
# Each game: win $5 with p=0.3, lose $2 with p=0.7

import numpy as np

p_win = 0.3
win_amount = 5
lose_amount = -2

# E[X] per game
E_X = p_win * win_amount + (1 - p_win) * lose_amount
print(f"E[X per game] = ${E_X:.2f}")  # $0.10

# Play until first win (N ~ Geometric)
E_N = 1 / p_win  # 3.33 games
print(f"E[N games] = {E_N:.2f}")

# Total expected winnings (Wald's)
E_total = E_N * E_X
print(f"E[Total] = ${E_total:.2f}")  # $0.33

# Verify with simulation
simulations = []
for _ in range(10000):
    total = 0
    n = 0
    while np.random.rand() > p_win and n < 100:
        total += lose_amount
        n += 1
    total += win_amount  # Final win
    simulations.append(total)

print(f"Simulated E[Total] = ${np.mean(simulations):.2f}")

Example 2 - Quality Control:

# Inspect items until 3rd defect
# Each inspection costs $10
# P(defective) = 0.05

p_defect = 0.05
cost_per_inspection = 10
target_defects = 3

# N ~ Negative Binomial
# E[N] = target_defects / p_defect
E_N = target_defects / p_defect  # 60 inspections

# E[Cost] = E[N] × cost
E_cost = E_N * cost_per_inspection
print(f"Expected cost: ${E_cost:.2f}")  # $600

Why Important:

  • Extends linearity of expectation to random # terms
  • Applies to many real scenarios (queues, sequential sampling)
  • Foundation for renewal theory

Violations (when Wald's fails):

  • N is not a stopping time
  • X's are not i.i.d.
  • E[N] is infinite
  • N depends on future X's

Interviewer's Insight

What they're testing: Advanced expectation theory.

Strong answer signals:

  • E[sum] = E[N]·E[X]
  • "N must be stopping time"
  • Applies to sequential problems
  • Can calculate for geometric/negative binomial

What is Rejection Sampling? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Monte Carlo, Sampling, Simulation | Asked by: Google, Amazon, Meta

View Answer

Rejection Sampling:

Method to sample from difficult distribution f(x) using easy distribution g(x):

  1. Find M where f(x) ≤ M·g(x) for all x
  2. Sample x ~ g(x)
  3. Sample u ~ Uniform(0, 1)
  4. Accept x if u ≤ f(x)/(M·g(x)), otherwise reject and repeat

Example - Beta Distribution:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta, uniform

# Target: Beta(2, 5)
target = beta(2, 5)

# Proposal: Uniform(0, 1)
proposal = uniform(0, 1)

# Find M: max of f(x)/g(x)
x_grid = np.linspace(0, 1, 1000)
f_vals = target.pdf(x_grid)
g_vals = proposal.pdf(x_grid)  # All 1's
M = np.max(f_vals / g_vals)

# Rejection sampling
samples = []
attempts = 0

while len(samples) < 1000:
    # Sample from proposal
    x = proposal.rvs()
    u = np.random.rand()

    # Accept/reject
    if u <= target.pdf(x) / (M * proposal.pdf(x)):
        samples.append(x)
    attempts += 1

acceptance_rate = len(samples) / attempts
print(f"Acceptance rate: {acceptance_rate:.2%}")

# Visualize
plt.hist(samples, bins=30, density=True, alpha=0.6, label='Samples')
plt.plot(x_grid, target.pdf(x_grid), 'r-', lw=2, label='Target Beta(2,5)')
plt.legend()
plt.show()

Example - Sampling from Complex Distribution:

# Target: f(x) = c·x²·exp(-x) for x > 0
# Use exponential proposal: g(x) = λ·exp(-λx)

def target_unnormalized(x):
    return x**2 * np.exp(-x)

# Proposal: Exponential(λ=1)
lambda_rate = 1.0

# Find M
from scipy.optimize import minimize_scalar
result = minimize_scalar(
    lambda x: -target_unnormalized(x) / (lambda_rate * np.exp(-lambda_rate * x)),
    bounds=(0, 10),
    method='bounded'
)
M = -result.fun

# Sample
samples = []
for _ in range(10000):
    x = np.random.exponential(1/lambda_rate)
    u = np.random.rand()

    g_x = lambda_rate * np.exp(-lambda_rate * x)
    if u <= target_unnormalized(x) / (M * g_x):
        samples.append(x)

plt.hist(samples, bins=50, density=True)
plt.title('Samples from x²·exp(-x)')
plt.show()

Efficiency:

  • Acceptance rate = 1/M
  • Want M as small as possible
  • Choose g(x) similar to f(x)

When to Use:

  • f(x) known up to normalizing constant
  • Can't sample from f(x) directly
  • Low-dimensional (high-d needs MCMC)

Interviewer's Insight

What they're testing: Sampling method knowledge.

Strong answer signals:

  • Explains accept/reject mechanism
  • Knows acceptance rate = 1/M
  • Mentions need for good proposal
  • Can implement from scratch

Explain Importance Sampling - Meta, Google Interview Question

Difficulty: 🔴 Hard | Tags: Monte Carlo, Variance Reduction, Sampling | Asked by: Meta, Google, Amazon

View Answer

Importance Sampling:

Estimate E_f[h(X)] by sampling from different distribution g(x):

\[E_f[h(X)] = \int h(x) f(x) dx = \int h(x) \frac{f(x)}{g(x)} g(x) dx = E_g\left[h(X) \frac{f(X)}{g(X)}\right]\]

Algorithm:

  1. Sample X₁,...,Xₙ ~ g(x)
  2. Compute weights: wᵢ = f(Xᵢ)/g(Xᵢ)
  3. Estimate: \(\hat{\theta} = \frac{1}{n}\sum_{i=1}^n h(X_i) w_i\)

Example - Rare Event Probability:

import numpy as np
from scipy.stats import norm

# Estimate P(X > 5) where X ~ N(0,1)
# This is rare: P(X > 5) ≈ 2.87×10⁻⁷

# Method 1: Direct sampling (poor)
samples = np.random.normal(0, 1, 1000000)
estimate_direct = np.mean(samples > 5)
print(f"Direct: {estimate_direct:.2e}")
# Often gives 0!

# Method 2: Importance sampling
# Use g(x) = N(5, 1) to focus on rare region

n = 10000
samples_g = np.random.normal(5, 1, n)

# Indicator function
h = (samples_g > 5).astype(float)

# Importance weights: f(x)/g(x)
f_vals = norm.pdf(samples_g, 0, 1)
g_vals = norm.pdf(samples_g, 5, 1)
weights = f_vals / g_vals

estimate_importance = np.mean(h * weights)
print(f"Importance sampling: {estimate_importance:.2e}")

# True value
true_value = 1 - norm.cdf(5, 0, 1)
print(f"True value: {true_value:.2e}")

Variance Comparison:

# Run multiple trials
n_trials = 1000
direct_estimates = []
importance_estimates = []

for _ in range(n_trials):
    # Direct
    samp = np.random.normal(0, 1, 10000)
    direct_estimates.append(np.mean(samp > 5))

    # Importance
    samp_g = np.random.normal(5, 1, 10000)
    h = (samp_g > 5).astype(float)
    w = norm.pdf(samp_g, 0, 1) / norm.pdf(samp_g, 5, 1)
    importance_estimates.append(np.mean(h * w))

print(f"Direct variance: {np.var(direct_estimates):.2e}")
print(f"Importance variance: {np.var(importance_estimates):.2e}")
# Importance sampling has much lower variance!

Choosing Good g(x):

Criterion Guideline
Coverage g(x) > 0 wherever f(x)·h(x) > 0
Similarity g(x) similar shape to f(x)·h(x)
Heavy tails g(x) should have heavier tails than f(x)

Applications:

  • Rare event estimation (finance, reliability)
  • Bayesian computation
  • Reinforcement learning (off-policy evaluation)

Interviewer's Insight

What they're testing: Advanced Monte Carlo knowledge.

Strong answer signals:

  • Formula with f(x)/g(x) ratio
  • "Reduce variance for rare events"
  • Knows good g needs heavy tails
  • Can implement and compare variance

What is the Inverse Transform Method? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Random Generation, CDF, Simulation | Asked by: Google, Amazon, Microsoft

View Answer

Inverse Transform Method:

Generate samples from distribution F(x) using:

  1. Generate U ~ Uniform(0,1)
  2. Return X = F⁻¹(U)

Why it works: P(X ≤ x) = P(F⁻¹(U) ≤ x) = P(U ≤ F(x)) = F(x) ✓

Example - Exponential Distribution:

import numpy as np
import matplotlib.pyplot as plt

# Generate Exp(λ=0.5) samples
lambda_rate = 0.5

# Method 1: Using inverse CDF
u = np.random.uniform(0, 1, 10000)

# CDF: F(x) = 1 - e^(-λx)
# Inverse: F^(-1)(u) = -log(1-u)/λ
x = -np.log(1 - u) / lambda_rate

# Method 2: Built-in (for comparison)
x_builtin = np.random.exponential(1/lambda_rate, 10000)

# Compare
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(x, bins=50, density=True, alpha=0.6, label='Inverse transform')
plt.hist(x_builtin, bins=50, density=True, alpha=0.6, label='Built-in')
plt.legend()
plt.title('Generated samples')

plt.subplot(1, 2, 2)
from scipy.stats import expon
plt.plot(np.sort(x), np.linspace(0, 1, len(x)), label='Generated')
x_theory = np.linspace(0, 10, 1000)
plt.plot(x_theory, expon.cdf(x_theory, scale=1/lambda_rate), 
         'r--', label='Theoretical')
plt.legend()
plt.title('CDF comparison')
plt.show()

Example - Custom Distribution:

# Generate from triangular distribution on [0,1]
# PDF: f(x) = 2x for x in [0,1]
# CDF: F(x) = x²
# Inverse: F^(-1)(u) = √u

u = np.random.uniform(0, 1, 10000)
x = np.sqrt(u)

plt.hist(x, bins=50, density=True, label='Generated')
x_theory = np.linspace(0, 1, 100)
plt.plot(x_theory, 2*x_theory, 'r-', lw=2, label='True PDF: 2x')
plt.legend()
plt.title('Triangular Distribution')
plt.show()

Example - Discrete Distribution:

# Roll a weighted die
# P(1)=0.1, P(2)=0.2, P(3)=0.3, P(4)=0.25, P(5)=0.1, P(6)=0.05

probs = [0.1, 0.2, 0.3, 0.25, 0.1, 0.05]
cdf = np.cumsum(probs)  # [0.1, 0.3, 0.6, 0.85, 0.95, 1.0]

def weighted_die():
    u = np.random.uniform()
    for i, c in enumerate(cdf):
        if u <= c:
            return i + 1

# Generate 10000 rolls
rolls = [weighted_die() for _ in range(10000)]

# Verify
from collections import Counter
counts = Counter(rolls)
for face in range(1, 7):
    observed = counts[face] / 10000
    expected = probs[face-1]
    print(f"Face {face}: Observed={observed:.3f}, Expected={expected:.3f}")

When to Use:

Pros Cons
Exact samples (not approximate) Need closed-form F⁻¹(u)
Fast if F⁻¹ is simple Doesn't work for complex F
No tuning needed Need to derive inverse

Interviewer's Insight

What they're testing: Random generation fundamentals.

Strong answer signals:

  • X = F⁻¹(U) formula
  • Can prove why it works
  • Implements for exponential
  • Knows when it's practical

What is Box-Muller Transform? - Amazon, Microsoft Interview Question

Difficulty: 🟡 Medium | Tags: Normal Generation, Transformation, Simulation | Asked by: Amazon, Microsoft, Google

View Answer

Box-Muller Transform:

Generate two independent N(0,1) from two independent U(0,1):

\[Z_0 = \sqrt{-2\ln(U_1)} \cos(2\pi U_2)$$ $$Z_1 = \sqrt{-2\ln(U_1)} \sin(2\pi U_2)\]

Implementation:

import numpy as np
import matplotlib.pyplot as plt

def box_muller(n):
    """Generate n pairs of independent N(0,1) samples"""
    # Generate uniform samples
    u1 = np.random.uniform(0, 1, n)
    u2 = np.random.uniform(0, 1, n)

    # Box-Muller transform
    r = np.sqrt(-2 * np.log(u1))
    theta = 2 * np.pi * u2

    z0 = r * np.cos(theta)
    z1 = r * np.sin(theta)

    return z0, z1

# Generate samples
z0, z1 = box_muller(10000)

# Verify normality
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram Z0
axes[0, 0].hist(z0, bins=50, density=True, alpha=0.6)
x = np.linspace(-4, 4, 100)
axes[0, 0].plot(x, norm.pdf(x), 'r-', lw=2)
axes[0, 0].set_title('Z0 Distribution')

# Histogram Z1
axes[0, 1].hist(z1, bins=50, density=True, alpha=0.6)
axes[0, 1].plot(x, norm.pdf(x), 'r-', lw=2)
axes[0, 1].set_title('Z1 Distribution')

# Q-Q plot
from scipy import stats
stats.probplot(z0, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot Z0')

# 2D scatter (independence check)
axes[1, 1].scatter(z0, z1, alpha=0.1, s=1)
axes[1, 1].set_xlabel('Z0')
axes[1, 1].set_ylabel('Z1')
axes[1, 1].set_title('Independence check')
axes[1, 1].axis('equal')

plt.tight_layout()
plt.show()

# Verify mean and variance
print(f"Z0: mean={np.mean(z0):.3f}, std={np.std(z0):.3f}")
print(f"Z1: mean={np.mean(z1):.3f}, std={np.std(z1):.3f}")
print(f"Correlation: {np.corrcoef(z0, z1)[0,1]:.3f}")

Why It Works:

Uses polar coordinates (R, Θ) in 2D: - R² = X² + Y² ~ Exponential(½) for X,Y ~ N(0,1) - R² = -2ln(U) gives correct distribution - Θ ~ Uniform(0, 2π) from U₂

Polar Form (more efficient):

def box_muller_polar(n):
    """Marsaglia polar method - faster"""
    z0, z1 = [], []

    while len(z0) < n:
        # Generate in unit circle
        u1 = np.random.uniform(-1, 1)
        u2 = np.random.uniform(-1, 1)
        s = u1**2 + u2**2

        # Reject if outside circle
        if s >= 1 or s == 0:
            continue

        # Transform
        factor = np.sqrt(-2 * np.log(s) / s)
        z0.append(u1 * factor)
        z1.append(u2 * factor)

    return np.array(z0[:n]), np.array(z1[:n])

# Compare efficiency
import time

start = time.time()
z0, z1 = box_muller(100000)
time_basic = time.time() - start

start = time.time()
z0, z1 = box_muller_polar(100000)
time_polar = time.time() - start

print(f"Basic: {time_basic:.3f}s")
print(f"Polar: {time_polar:.3f}s")

Applications:

  • Monte Carlo simulations
  • Generate multivariate normal (with Cholesky)
  • Random initialization in ML

Interviewer's Insight

What they're testing: Practical random generation.

Strong answer signals:

  • Formulas with √(-2ln) and 2π
  • "Generates TWO independent normals"
  • Mentions polar form as optimization
  • Knows why: polar coordinates

Explain the Alias Method for Discrete Sampling - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Discrete Sampling, Algorithms, Efficiency | Asked by: Google, Meta, Amazon

View Answer

Alias Method:

Sample from discrete distribution in O(1) time after O(n) preprocessing.

Problem: Sample from {x₁,...,xₙ} with probabilities {p₁,...,pₙ}

Idea: Split each probability into two "aliases" to create uniform structure.

Algorithm:

import numpy as np

class AliasMethod:
    def __init__(self, probs):
        """
        Setup alias method for discrete distribution
        probs: array of probabilities (must sum to 1)
        """
        n = len(probs)
        self.n = n
        self.prob = np.zeros(n)
        self.alias = np.zeros(n, dtype=int)

        # Scale probabilities
        scaled = np.array(probs) * n

        # Separate into small and large
        small = []
        large = []
        for i, p in enumerate(scaled):
            if p < 1:
                small.append(i)
            else:
                large.append(i)

        # Build tables
        while small and large:
            s = small.pop()
            l = large.pop()

            self.prob[s] = scaled[s]
            self.alias[s] = l

            # Update large probability
            scaled[l] = scaled[l] - (1 - scaled[s])

            if scaled[l] < 1:
                small.append(l)
            else:
                large.append(l)

        # Remaining probabilities
        while large:
            l = large.pop()
            self.prob[l] = 1.0

        while small:
            s = small.pop()
            self.prob[s] = 1.0

    def sample(self):
        """Generate single sample in O(1)"""
        # Pick random bin
        i = np.random.randint(self.n)

        # Flip biased coin
        if np.random.rand() < self.prob[i]:
            return i
        else:
            return self.alias[i]

# Example: Weighted die
probs = [0.1, 0.2, 0.3, 0.25, 0.1, 0.05]
sampler = AliasMethod(probs)

# Generate samples
samples = [sampler.sample() for _ in range(100000)]

# Verify
from collections import Counter
counts = Counter(samples)
print("Face | Observed | Expected")
for i in range(6):
    obs = counts[i] / 100000
    exp = probs[i]
    print(f"{i+1:4d} | {obs:8.3f} | {exp:8.3f}")

Complexity:

Operation Time
Setup O(n)
Single sample O(1)
k samples O(k)

Comparison:

import time

# Method 1: Linear search (naive)
def naive_sample(probs, k=10000):
    cdf = np.cumsum(probs)
    samples = []
    for _ in range(k):
        u = np.random.rand()
        for i, c in enumerate(cdf):
            if u <= c:
                samples.append(i)
                break
    return samples

# Method 2: Alias method
def alias_sample(probs, k=10000):
    sampler = AliasMethod(probs)
    return [sampler.sample() for _ in range(k)]

probs = [0.1, 0.2, 0.3, 0.25, 0.1, 0.05]
k = 100000

start = time.time()
s1 = naive_sample(probs, k)
time_naive = time.time() - start

start = time.time()
s2 = alias_sample(probs, k)
time_alias = time.time() - start

print(f"Naive:  {time_naive:.3f}s (O(nk))")
print(f"Alias:  {time_alias:.3f}s (O(n+k))")
print(f"Speedup: {time_naive/time_alias:.1f}x")

When to Use:

  • Need many samples from same distribution
  • Distribution doesn't change
  • Want guaranteed O(1) per sample

Interviewer's Insight

What they're testing: Advanced algorithms knowledge.

Strong answer signals:

  • "O(1) sampling after O(n) setup"
  • Explains alias table concept
  • Can implement from scratch
  • Knows use case: many samples

What is Stratified Sampling? When to Use It? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Sampling Methods, Variance Reduction, Survey | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Stratified Sampling:

Divide population into homogeneous subgroups (strata), then sample from each stratum.

\[\bar{X}_{st} = \sum_{h=1}^H W_h \bar{X}_h\]

Where: - H = number of strata - W_h = proportion of population in stratum h - \(\bar{X}_h\) = sample mean from stratum h

Variance:

\[Var(\bar{X}_{st}) = \sum_{h=1}^H W_h^2 \frac{\sigma_h^2}{n_h}\]

Always better than simple random sampling when strata differ!

Example - A/B Test by Country:

import numpy as np
import pandas as pd

# Population: users from 3 countries with different conversion rates
np.random.seed(42)

population = pd.DataFrame({
    'country': ['US']*5000 + ['UK']*3000 + ['CA']*2000,
    'converted': (
        list(np.random.binomial(1, 0.10, 5000)) +  # US: 10%
        list(np.random.binomial(1, 0.15, 3000)) +  # UK: 15%
        list(np.random.binomial(1, 0.08, 2000))    # CA: 8%
    )
})

# True overall conversion
true_rate = population['converted'].mean()
print(f"True conversion: {true_rate:.3%}")

# Method 1: Simple random sampling
n_trials = 1000
srs_estimates = []

for _ in range(n_trials):
    sample = population.sample(n=300)
    srs_estimates.append(sample['converted'].mean())

# Method 2: Stratified sampling
stratified_estimates = []

for _ in range(n_trials):
    samples = []
    # Sample proportionally from each stratum
    samples.append(population[population['country']=='US'].sample(n=150))  # 50%
    samples.append(population[population['country']=='UK'].sample(n=90))   # 30%
    samples.append(population[population['country']=='CA'].sample(n=60))   # 20%

    sample = pd.concat(samples)
    stratified_estimates.append(sample['converted'].mean())

# Compare
print(f"\nSimple Random Sampling:")
print(f"  Mean: {np.mean(srs_estimates):.3%}")
print(f"  Std:  {np.std(srs_estimates):.3%}")

print(f"\nStratified Sampling:")
print(f"  Mean: {np.mean(stratified_estimates):.3%}")
print(f"  Std:  {np.std(stratified_estimates):.3%}")

variance_reduction = 1 - np.var(stratified_estimates)/np.var(srs_estimates)
print(f"\nVariance reduction: {variance_reduction:.1%}")

Optimal Allocation (Neyman):

# Allocate samples proportional to σ_h * N_h

def optimal_allocation(strata_sizes, strata_stds, total_n):
    """
    Neyman optimal allocation
    Returns samples per stratum
    """
    products = [n * s for n, s in zip(strata_sizes, strata_stds)]
    total_product = sum(products)

    return [int(total_n * p / total_product) for p in products]

# Example
N = [5000, 3000, 2000]  # Stratum sizes
sigma = [0.3, 0.36, 0.27]  # Stratum std devs
total_n = 300

# Proportional allocation
prop_alloc = [int(300 * n/sum(N)) for n in N]
print(f"Proportional: {prop_alloc}")

# Optimal allocation
opt_alloc = optimal_allocation(N, sigma, total_n)
print(f"Optimal: {opt_alloc}")

When to Use:

Use Case Benefit
Heterogeneous population Reduce variance
Subgroup analysis Ensure representation
Rare subgroups Oversample minorities
Known stratification Leverage prior knowledge

Interviewer's Insight

What they're testing: Practical sampling knowledge.

Strong answer signals:

  • "Divide into homogeneous strata"
  • Lower variance than SRS
  • Mentions Neyman allocation
  • Real examples (A/B tests, surveys)

What is the Coupon Collector's Variance? - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Coupon Collector, Variance, Theory | Asked by: Google, Meta, Amazon

View Answer

Coupon Collector Problem:

How many draws (with replacement) to collect all n distinct coupons?

Expected Value:

\[E[T] = n \sum_{i=1}^n \frac{1}{i} = n \cdot H_n \approx n \ln(n)\]

Where H_n is harmonic number.

Variance:

\[Var(T) = n^2 \sum_{i=1}^n \frac{1}{i^2} - n \sum_{i=1}^n \frac{1}{i}\]
\[Var(T) \approx n^2 \left(\frac{\pi^2}{6} - (\ln n)^2\right)\]

For large n: \(Var(T) \approx n^2 \frac{\pi^2}{6}\)

Simulation:

import numpy as np
import matplotlib.pyplot as plt

def coupon_collector(n_coupons):
    """Simulate single coupon collecting run"""
    collected = set()
    draws = 0

    while len(collected) < n_coupons:
        coupon = np.random.randint(0, n_coupons)
        collected.add(coupon)
        draws += 1

    return draws

# Simulate for different n
n_values = [10, 20, 50, 100]

for n in n_values:
    # Run simulations
    trials = [coupon_collector(n) for _ in range(10000)]

    # Theoretical
    harmonic = sum(1/i for i in range(1, n+1))
    E_T_theory = n * harmonic

    var_theory = n**2 * sum(1/i**2 for i in range(1, n+1)) - E_T_theory**2

    # Empirical
    E_T_sim = np.mean(trials)
    var_sim = np.var(trials)

    print(f"n={n}:")
    print(f"  E[T]: Theory={E_T_theory:.1f}, Sim={E_T_sim:.1f}")
    print(f"  Var(T): Theory={var_theory:.1f}, Sim={var_sim:.1f}")
    print()

# Visualize distribution for n=50
n = 50
trials = [coupon_collector(n) for _ in range(10000)]

plt.hist(trials, bins=50, density=True, alpha=0.6, edgecolor='black')
plt.axvline(np.mean(trials), color='red', linestyle='--', 
            label=f'Mean={np.mean(trials):.0f}')
plt.xlabel('Number of draws')
plt.ylabel('Density')
plt.title(f'Coupon Collector Distribution (n={n})')
plt.legend()
plt.show()

Decomposition:

Let T_i = draws to get ith new coupon (given i-1 collected)

# T_i ~ Geometric(p_i) where p_i = (n-i+1)/n

n = 50

for i in range(1, 6):
    p_i = (n - i + 1) / n
    E_Ti = 1 / p_i
    Var_Ti = (1 - p_i) / p_i**2

    print(f"Coupon {i}:")
    print(f"  p={p_i:.3f}, E[T_{i}]={E_Ti:.2f}, Var(T_{i})={Var_Ti:.2f}")

Applications:

  1. Hash collisions: How many items until collision?
  2. Testing: How many tests to cover all branches?
  3. Matching problems: Collecting pairs/sets

Follow-up Questions:

# Q1: Expected draws to collect m < n coupons?
def expected_m_coupons(n, m):
    return n * sum(1/(n-i) for i in range(m))

# Q2: Probability of collecting all in k draws?
# Use inclusion-exclusion principle (complex!)

Interviewer's Insight

What they're testing: Deep combinatorics + probability.

Strong answer signals:

  • E[T] = n·H_n formula
  • Knows variance exists and scales as n²
  • Decomposes into geometric RVs
  • Mentions applications

Explain the Chinese Restaurant Process - Meta, Google Interview Question

Difficulty: 🔴 Hard | Tags: Stochastic Process, Clustering, Bayesian | Asked by: Meta, Google, Amazon

View Answer

Chinese Restaurant Process (CRP):

Stochastic process for clustering with unbounded number of clusters.

Setup: - Restaurant with infinite tables - Customers enter one by one - Each customer: - Sits at occupied table k with prob ∝ n_k (# customers at table k) - Sits at new table with prob ∝ α (concentration parameter)

Probability:

\[P(\text{table } k) = \begin{cases} \frac{n_k}{n-1+\alpha} & \text{if table } k \text{ occupied} \\ \frac{\alpha}{n-1+\alpha} & \text{if new table} \end{cases}\]

Implementation:

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

def chinese_restaurant_process(n_customers, alpha):
    """
    Simulate CRP
    Returns: list of table assignments
    """
    tables = [0]  # First customer at table 0

    for i in range(1, n_customers):
        # Count customers at each table
        counts = Counter(tables)

        # Probabilities
        probs = []
        table_ids = []

        for table, count in counts.items():
            probs.append(count)
            table_ids.append(table)

        # New table probability
        probs.append(alpha)
        table_ids.append(max(table_ids) + 1)

        # Normalize
        probs = np.array(probs) / (i + alpha)

        # Sample
        chosen = np.random.choice(len(probs), p=probs)
        tables.append(table_ids[chosen])

    return tables

# Simulate
n_customers = 100
alpha = 2.0

assignments = chinese_restaurant_process(n_customers, alpha)

# Visualize
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(assignments, 'o-', markersize=3, alpha=0.6)
plt.xlabel('Customer')
plt.ylabel('Table')
plt.title(f'CRP: α={alpha}')

plt.subplot(1, 2, 2)
table_sizes = Counter(assignments)
plt.bar(table_sizes.keys(), table_sizes.values())
plt.xlabel('Table ID')
plt.ylabel('Number of customers')
plt.title(f'Table sizes (Total tables: {len(table_sizes)})')

plt.tight_layout()
plt.show()

Effect of α:

# Compare different α values
n = 100
alphas = [0.5, 1.0, 5.0, 10.0]

for alpha in alphas:
    trials = []
    for _ in range(1000):
        tables = chinese_restaurant_process(n, alpha)
        n_tables = len(set(tables))
        trials.append(n_tables)

    print(f"α={alpha:4.1f}: E[# tables] = {np.mean(trials):.1f} ± {np.std(trials):.1f}")

# α small → few large tables (low diversity)
# α large → many small tables (high diversity)

Expected Number of Tables:

\[E[K_n] \approx \alpha \log\left(\frac{n}{\alpha} + 1\right)\]

For large n: \(E[K_n] \approx \alpha \log n\)

Applications:

  1. Topic modeling: Dirichlet process mixture models
  2. Clustering: Nonparametric Bayesian clustering
  3. Natural language: Word clustering
  4. Genetics: Species sampling problems

Connection to Dirichlet Process:

CRP is the "exchangeable" partition distribution induced by a Dirichlet Process.

Interviewer's Insight

What they're testing: Advanced ML/statistics knowledge.

Strong answer signals:

  • "Rich get richer" intuition
  • Knows α controls # clusters
  • Can simulate from scratch
  • Mentions DP mixture models

What is the Secretary Problem (Optimal Stopping)? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Optimal Stopping, Decision Theory, Probability | Asked by: Google, Amazon, Meta

View Answer

Secretary Problem:

Interview n candidates sequentially, must accept/reject immediately. Goal: maximize probability of selecting the best candidate.

Optimal Strategy:

  1. Reject first n/e ≈ 0.368n candidates (observation phase)
  2. Then accept first candidate better than all observed

Success Probability: ≈ 1/e ≈ 0.368 (37%)

Proof Intuition:

import numpy as np

def secretary_problem(n, r):
    """
    Simulate secretary problem
    n: number of candidates
    r: number to observe before selecting
    Returns: True if best candidate selected
    """
    # Random permutation (1 = best)
    candidates = np.random.permutation(n) + 1

    # Observe first r
    best_observed = max(candidates[:r])

    # Select first better than best_observed
    for i in range(r, n):
        if candidates[i] > best_observed:
            return candidates[i] == n  # n is the best

    return False  # No one selected

# Find optimal r for different n
n_values = [10, 50, 100, 1000]

for n in n_values:
    best_r = 0
    best_prob = 0

    # Try different r values
    for r in range(1, n):
        # Simulate
        trials = [secretary_problem(n, r) for _ in range(10000)]
        prob = np.mean(trials)

        if prob > best_prob:
            best_prob = prob
            best_r = r

    print(f"n={n:4d}: Optimal r={best_r:3d} (r/n={best_r/n:.3f}), P(success)={best_prob:.3f}")

# All approach r/n ≈ 1/e ≈ 0.368

Visualize for n=100:

n = 100
r_values = range(1, n)
success_probs = []

for r in r_values:
    trials = [secretary_problem(n, r) for _ in range(5000)]
    success_probs.append(np.mean(trials))

plt.figure(figsize=(10, 6))
plt.plot(np.array(r_values)/n, success_probs, 'b-', linewidth=2)
plt.axvline(x=1/np.e, color='r', linestyle='--', label='1/e ≈ 0.368')
plt.axhline(y=1/np.e, color='r', linestyle='--')
plt.xlabel('r/n (fraction observed)')
plt.ylabel('P(selecting best)')
plt.title('Secretary Problem Optimal Strategy')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

Variants:

# 1. Want top k candidates (not just best)
# Strategy: observe n/k candidates, select next better

# 2. Know value distribution
# Use threshold strategy with known distribution

# 3. Maximize expected rank
# Different strategy, different cutoff

# 4. Multiple positions
# Generalized secretary problem

Real-World Applications:

Domain Application
Hiring When to stop interviewing
Dating When to propose
Real estate When to make offer
Trading When to sell asset
Parking When to take spot

Interviewer's Insight

What they're testing: Decision theory under uncertainty.

Strong answer signals:

  • "Observe n/e, then select"
  • Success probability 1/e
  • Can simulate strategy
  • Mentions real applications

What is the False Discovery Rate (FDR)? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Multiple Testing, FDR, Statistics | Asked by: Google, Meta, Amazon, Microsoft

View Answer

False Discovery Rate:

Among all rejected (declared significant) null hypotheses, what proportion are false rejections?

\[FDR = E\left[\frac{V}{R}\right] = E\left[\frac{\text{# false positives}}{\text{# rejections}}\right]\]

Where V = false discoveries, R = total rejections.

Contrast with FWER:

Metric Definition Control
FWER P(≥1 false positive) Bonferroni: α/m
FDR E[false positives / rejections] BH: less stringent

Benjamini-Hochberg Procedure:

import numpy as np
from scipy import stats

def benjamini_hochberg(p_values, alpha=0.05):
    """
    BH procedure for FDR control
    Returns: boolean array of rejections
    """
    m = len(p_values)

    # Sort p-values with indices
    sorted_indices = np.argsort(p_values)
    sorted_p = p_values[sorted_indices]

    # BH threshold: p_i ≤ (i/m)·α
    thresholds = np.arange(1, m+1) / m * alpha

    # Find largest i where p_i ≤ threshold
    comparisons = sorted_p <= thresholds
    if not np.any(comparisons):
        return np.zeros(m, dtype=bool)

    k = np.max(np.where(comparisons)[0])

    # Reject all hypotheses up to k
    reject = np.zeros(m, dtype=bool)
    reject[sorted_indices[:k+1]] = True

    return reject

# Example: 100 hypotheses
np.random.seed(42)

# 90 true nulls (p ~ Uniform)
# 10 false nulls (p ~ small values)
p_true_null = np.random.uniform(0, 1, 90)
p_false_null = np.random.beta(0.5, 10, 10)  # Skewed to small values

p_values = np.concatenate([p_true_null, p_false_null])
truth = np.array([False]*90 + [True]*10)  # True = alternative is true

# Method 1: Bonferroni (control FWER)
alpha = 0.05
bonf_reject = p_values < alpha / len(p_values)

# Method 2: BH (control FDR)
bh_reject = benjamini_hochberg(p_values, alpha)

# Evaluate
print("Bonferroni:")
print(f"  Rejections: {bonf_reject.sum()}")
print(f"  True positives: {(bonf_reject & truth).sum()}")
print(f"  False positives: {(bonf_reject & ~truth).sum()}")

print("\nBenjamini-Hochberg:")
print(f"  Rejections: {bh_reject.sum()}")
print(f"  True positives: {(bh_reject & truth).sum()}")
print(f"  False positives: {(bh_reject & ~truth).sum()}")
if bh_reject.sum() > 0:
    fdr = (bh_reject & ~truth).sum() / bh_reject.sum()
    print(f"  Empirical FDR: {fdr:.2%}")

Visualization:

# Plot sorted p-values with BH threshold
sorted_p = np.sort(p_values)
bh_line = np.arange(1, len(p_values)+1) / len(p_values) * alpha

plt.figure(figsize=(10, 6))
plt.plot(sorted_p, 'bo', markersize=4, label='Sorted p-values')
plt.plot(bh_line, 'r-', linewidth=2, label=f'BH line: (i/m)·{alpha}')
plt.xlabel('Rank')
plt.ylabel('p-value')
plt.title('Benjamini-Hochberg Procedure')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

When to Use FDR:

  • Genomics: Test thousands of genes
  • A/B testing: Multiple variants
  • Feature selection: Many candidate features
  • Exploratory analysis: Generate hypotheses

Interviewer's Insight

What they're testing: Multiple testing awareness.

Strong answer signals:

  • Distinguishes FDR from FWER
  • "Less conservative than Bonferroni"
  • Knows BH procedure
  • Mentions genomics/big data

Explain the Bonferroni Correction - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Multiple Testing, FWER, Correction | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Bonferroni Correction:

When performing m hypothesis tests, control family-wise error rate (FWER) by testing each at level:

\[\alpha_{Bonf} = \frac{\alpha}{m}\]

Guarantees: P(at least one false positive) ≤ α

Proof:

By union bound: \(\(P(\cup_{i=1}^m \{reject H_i | H_i true\}) \leq \sum_{i=1}^m P(reject H_i | H_i true) = m \cdot \frac{\alpha}{m} = \alpha\)\)

Example:

import numpy as np
from scipy import stats

# Test 20 features for correlation with outcome
np.random.seed(42)

n = 100  # samples
m = 20   # features
alpha = 0.05

# Generate data: all features independent of outcome
features = np.random.randn(n, m)
outcome = np.random.randn(n)

# Test each feature
p_values = []
for i in range(m):
    corr, p = stats.pearsonr(features[:, i], outcome)
    p_values.append(p)

p_values = np.array(p_values)

# No correction
naive_reject = p_values < alpha
print(f"No correction: {naive_reject.sum()} rejections")

# Bonferroni correction
bonf_reject = p_values < alpha / m
print(f"Bonferroni: {bonf_reject.sum()} rejections")

# All features are actually null, so any rejection is false positive
print(f"\nFalse positives (no correction): {naive_reject.sum()}")
print(f"False positives (Bonferroni): {bonf_reject.sum()}")

Simulation - FWER Control:

# Verify FWER control over many trials
n_trials = 10000
fwer_naive = 0
fwer_bonf = 0

for _ in range(n_trials):
    # Generate null data
    features = np.random.randn(n, m)
    outcome = np.random.randn(n)

    # Test
    p_values = []
    for i in range(m):
        _, p = stats.pearsonr(features[:, i], outcome)
        p_values.append(p)
    p_values = np.array(p_values)

    # Check if any false positive
    if np.any(p_values < alpha):
        fwer_naive += 1
    if np.any(p_values < alpha/m):
        fwer_bonf += 1

print(f"Empirical FWER (no correction): {fwer_naive/n_trials:.3f}")
print(f"Empirical FWER (Bonferroni): {fwer_bonf/n_trials:.3f}")
print(f"Target FWER: {alpha}")
# Bonferroni successfully controls FWER ≤ 0.05

Limitations:

Issue Impact
Too conservative Low power for large m
Assumes independence Actually works for any dependence!
Loses power May miss true effects

When to Use:

  • Small number of tests (m < 20)
  • Need strong FWER control
  • Tests are critical (avoid any false positive)

Alternatives:

  • Šidák correction: \(1-(1-\alpha)^{1/m}\) (assumes independence)
  • Holm-Bonferroni: More powerful, still controls FWER
  • FDR methods: BH for exploratory analysis

Interviewer's Insight

What they're testing: Multiple testing basics.

Strong answer signals:

  • Formula α/m
  • "Controls FWER"
  • Knows it's conservative
  • Mentions alternatives (FDR, Holm)

What is the Two-Child Problem (Boy-Girl Paradox)? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: Conditional Probability, Paradox, Bayes | Asked by: Google, Meta, Amazon

View Answer

The Problem:

A family has two children. Given different information, what's P(both boys)?

Scenario 1: "At least one is a boy"

Scenario 2: "The eldest is a boy"

Solution:

# Sample space: {BB, BG, GB, GG} where first is eldest

# Scenario 1: At least one boy
# Condition eliminates GG
# Remaining: {BB, BG, GB}
# P(both boys | at least one boy) = 1/3

# Scenario 2: Eldest is boy
# Condition eliminates {GG, GB}
# Remaining: {BB, BG}
# P(both boys | eldest is boy) = 1/2

Simulation:

import numpy as np

def simulate_two_child(n_families=100000):
    """Simulate two-child families"""
    # 0 = girl, 1 = boy
    eldest = np.random.randint(0, 2, n_families)
    youngest = np.random.randint(0, 2, n_families)

    # Scenario 1: At least one boy
    at_least_one_boy = (eldest == 1) | (youngest == 1)
    both_boys_given_one = ((eldest == 1) & (youngest == 1))[at_least_one_boy]
    prob1 = np.mean(both_boys_given_one)

    # Scenario 2: Eldest is boy
    eldest_is_boy = eldest == 1
    both_boys_given_eldest = ((eldest == 1) & (youngest == 1))[eldest_is_boy]
    prob2 = np.mean(both_boys_given_eldest)

    return prob1, prob2

p1, p2 = simulate_two_child()
print(f"P(both boys | at least one boy) = {p1:.3f} ≈ 1/3")
print(f"P(both boys | eldest is boy) = {p2:.3f} ≈ 1/2")

Bayes' Theorem Calculation:

# Scenario 1: At least one boy
# Prior: P(BB) = P(BG) = P(GB) = P(GG) = 1/4

# P(at least one boy | BB) = 1
# P(at least one boy | BG) = 1
# P(at least one boy | GB) = 1
# P(at least one boy | GG) = 0

# P(at least one boy) = 3/4

# P(BB | at least one boy) = P(at least one | BB)·P(BB) / P(at least one)
#                          = 1 · (1/4) / (3/4) = 1/3

Famous Variant - Tuesday Boy:

# "I have two children, one is a boy born on Tuesday"
# What's P(both boys)?

# This is surprisingly DIFFERENT from 1/3!

# Sample space: 14×14 = 196 equally likely outcomes
# (day_eldest, sex_eldest) × (day_youngest, sex_youngest)

def tuesday_boy_problem():
    days = 7
    count_condition = 0  # At least one Tuesday boy
    count_both_boys = 0   # Both boys given condition

    for day1 in range(days):
        for sex1 in [0, 1]:  # 0=girl, 1=boy
            for day2 in range(days):
                for sex2 in [0, 1]:
                    # Check condition: at least one Tuesday boy
                    tuesday_boy = (sex1==1 and day1==2) or (sex2==1 and day2==2)

                    if tuesday_boy:
                        count_condition += 1
                        if sex1 == 1 and sex2 == 1:
                            count_both_boys += 1

    return count_both_boys / count_condition

prob = tuesday_boy_problem()
print(f"P(both boys | Tuesday boy) = {prob:.3f} ≈ 13/27 ≈ 0.481")
# NOT 1/3! The specific day information changes probability

Why This Matters:

Subtle differences in information drastically change probabilities!

Interviewer's Insight

What they're testing: Careful conditional probability reasoning.

Strong answer signals:

  • Distinguishes the two scenarios clearly
  • ⅓ vs ½ with explanation
  • Can use Bayes or counting
  • Mentions Tuesday boy variant

Explain the St. Petersburg Paradox - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Expected Value, Utility Theory, Paradox | Asked by: Google, Meta, Amazon

View Answer

The Paradox:

Game: Flip fair coin repeatedly until tails. You win $2^n where n = number of flips.

Outcome Probability Payoff
T ½ $2
HT ¼ $4
HHT $8
HHH...T ½^n $2^n

Expected Value:

\[E[X] = \sum_{n=1}^\infty \frac{1}{2^n} \cdot 2^n = \sum_{n=1}^\infty 1 = \infty\]

The Paradox: Infinite expected value, but no one would pay much to play!

Simulation:

import numpy as np
import matplotlib.pyplot as plt

def play_st_petersburg():
    """Play one game"""
    n = 1
    while np.random.rand() >= 0.5:  # While heads
        n += 1
    return 2**n

# Simulate many games
payoffs = [play_st_petersburg() for _ in range(10000)]

print(f"Mean payoff: ${np.mean(payoffs):.2f}")
print(f"Median payoff: ${np.median(payoffs):.2f}")
print(f"Max payoff: ${np.max(payoffs):.2f}")

# Distribution is extremely skewed!
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(payoffs, bins=50, edgecolor='black')
plt.xlabel('Payoff ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Payoffs')

plt.subplot(1, 2, 2)
plt.hist(np.log2(payoffs), bins=50, edgecolor='black')
plt.xlabel('log₂(Payoff)')
plt.ylabel('Frequency')
plt.title('Log-scale Distribution')

plt.tight_layout()
plt.show()

Resolution 1: Utility Theory:

People maximize expected utility, not expected value.

# Log utility: U(x) = log(x)

def expected_utility_log():
    """Expected utility with log utility"""
    eu = 0
    for n in range(1, 100):  # Approximate infinite sum
        prob = 1 / 2**n
        payoff = 2**n
        utility = np.log(payoff)  # log utility
        eu += prob * utility
    return eu

eu = expected_utility_log()
print(f"Expected utility (log): {eu:.3f}")

# Certainty equivalent: amount x where U(x) = E[U(game)]
ce = np.exp(eu)
print(f"Certainty equivalent: ${ce:.2f}")
# Person would pay ~$4-5, not infinite!

Resolution 2: Finite Wealth:

Casino has finite wealth → can't pay arbitrarily large payoffs.

# Casino has $1M
max_payoff = 1_000_000

# Find maximum n where 2^n ≤ 1M
max_n = int(np.log2(max_payoff))  # 19

# Expected value with cap
ev_capped = sum(2**n / 2**n for n in range(1, max_n)) + \
            max_payoff / 2**max_n

print(f"Expected value (capped at ${max_payoff}): ${ev_capped:.2f}")
# Now finite: ~$20

Resolution 3: Diminishing Marginal Utility:

# Square root utility: U(x) = √x

def expected_utility_sqrt():
    eu = 0
    for n in range(1, 100):
        prob = 1 / 2**n
        payoff = 2**n
        utility = np.sqrt(payoff)
        eu += prob * utility
    return eu

eu = expected_utility_sqrt()
ce = eu**2  # Invert square root
print(f"Certainty equivalent (√ utility): ${ce:.2f}")

Interviewer's Insight

What they're testing: Deep understanding of expected value limitations.

Strong answer signals:

  • Knows E[X] = ∞
  • "But no one would pay infinite!"
  • Mentions utility theory
  • Discusses finite wealth constraint

What is the Gambler's Ruin Problem? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Random Walk, Absorbing States, Markov Chain | Asked by: Google, Amazon, Meta

View Answer

Problem Setup:

Gambler starts with $a, plays until wealth = $0 or $N. Each bet wins $1 with prob p, loses $1 with prob q=1-p.

Question: P(ruin | start at $a)?

Solution:

Let P_i = probability of ruin starting at $i.

Boundary: P_0 = 1, P_N = 0

Recurrence: \(P_i = p \cdot P_{i+1} + q \cdot P_{i-1}\)

Closed Form:

If p ≠ ½:

\[P_a = \frac{(q/p)^a - (q/p)^N}{1 - (q/p)^N}\]

If p = ½ (fair game):

\[P_a = 1 - \frac{a}{N}\]

Implementation:

import numpy as np
import matplotlib.pyplot as plt

def gamblers_ruin_analytical(a, N, p):
    """Analytical solution"""
    if p == 0.5:
        return 1 - a / N
    else:
        q = 1 - p
        ratio = q / p
        return (ratio**a - ratio**N) / (1 - ratio**N)

def gamblers_ruin_simulation(a, N, p, n_sims=10000):
    """Monte Carlo simulation"""
    ruins = 0

    for _ in range(n_sims):
        wealth = a

        while 0 < wealth < N:
            if np.random.rand() < p:
                wealth += 1
            else:
                wealth -= 1

        if wealth == 0:
            ruins += 1

    return ruins / n_sims

# Example: Start with $30, play until $0 or $100
a = 30
N = 100

# Fair game (p=0.5)
p = 0.5
prob_analytical = gamblers_ruin_analytical(a, N, p)
prob_simulation = gamblers_ruin_simulation(a, N, p)

print(f"Fair game (p={p}):")
print(f"  Analytical: P(ruin) = {prob_analytical:.3f}")
print(f"  Simulation: P(ruin) = {prob_simulation:.3f}")

# Unfavorable game (p=0.48)
p = 0.48
prob_analytical = gamblers_ruin_analytical(a, N, p)
prob_simulation = gamblers_ruin_simulation(a, N, p)

print(f"\nUnfavorable game (p={p}):")
print(f"  Analytical: P(ruin) = {prob_analytical:.3f}")
print(f"  Simulation: P(ruin) = {prob_simulation:.3f}")

Visualize P(ruin) vs Starting Wealth:

N = 100
starting_wealths = range(1, N)

plt.figure(figsize=(10, 6))

for p in [0.45, 0.48, 0.50, 0.52]:
    probs = [gamblers_ruin_analytical(a, N, p) for a in starting_wealths]
    plt.plot(starting_wealths, probs, label=f'p={p}')

plt.xlabel('Starting wealth ($)')
plt.ylabel('P(ruin)')
plt.title("Gambler's Ruin Probability")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Key Insights:

  1. Fair game (p=0.5): Eventually go broke with prob 1 if N=∞
  2. Unfavorable game (p<0.5): Very likely to go broke
  3. Favorable game (p>0.5): Can win if enough capital

Expected Duration:

# Expected # games until absorbing state

def expected_duration(a, N, p):
    if p == 0.5:
        return a * (N - a)
    else:
        q = 1 - p
        P_ruin = gamblers_ruin_analytical(a, N, p)
        return (a - N * P_ruin) / (2*p - 1)

a, N = 50, 100
for p in [0.48, 0.50, 0.52]:
    duration = expected_duration(a, N, p)
    print(f"p={p}: E[duration] = {duration:.0f} games")

Interviewer's Insight

What they're testing: Random walk & Markov chain knowledge.

Strong answer signals:

  • States recurrence relation
  • Knows closed form solution
  • Fair game: linear in a/N
  • Can simulate to verify

Explain Benford's Law - Meta, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: First Digit Law, Applications, Fraud Detection | Asked by: Meta, Amazon, Google

View Answer

Benford's Law:

In many naturally occurring datasets, the leading digit d appears with probability:

\[P(d) = \log_{10}\left(1 + \frac{1}{d}\right)\]

Distribution:

Digit Probability
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%

Why It Works:

Applies to data spanning multiple orders of magnitude with scale-invariance property.

Test with Real Data:

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

def benfords_law(d):
    """Theoretical probability for digit d"""
    return np.log10(1 + 1/d)

# Example 1: Population of US cities
# (You'd load real data, here we simulate)
populations = np.random.lognormal(10, 2, 1000)
first_digits = [int(str(int(p))[0]) for p in populations]

# Count frequencies
counts = Counter(first_digits)
observed = [counts[d]/len(first_digits) for d in range(1, 10)]
expected = [benfords_law(d) for d in range(1, 10)]

# Plot
plt.figure(figsize=(10, 6))
x = np.arange(1, 10)
width = 0.35

plt.bar(x - width/2, observed, width, label='Observed', alpha=0.7)
plt.bar(x + width/2, expected, width, label="Benford's Law", alpha=0.7)

plt.xlabel('First Digit')
plt.ylabel('Probability')
plt.title("Benford's Law: Population Data")
plt.xticks(x)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Chi-Square Test:

from scipy.stats import chisquare

# Test if data follows Benford's law
observed_counts = [counts[d] for d in range(1, 10)]
expected_counts = [benfords_law(d) * len(first_digits) for d in range(1, 10)]

chi2, p_value = chisquare(observed_counts, expected_counts)

print(f"Chi-square test:")
print(f"  χ² = {chi2:.2f}")
print(f"  p-value = {p_value:.4f}")

if p_value > 0.05:
    print("  → Consistent with Benford's law")
else:
    print("  → Deviates from Benford's law")

Fraud Detection Application:

# Example: Expense reports

# Legitimate expenses (log-normal)
legit = np.random.lognormal(4, 1, 1000)

# Fraudulent expenses (made up, tend to use all digits equally)
fraud = np.random.uniform(10, 999, 300)

def get_first_digit_dist(data):
    first_digits = [int(str(int(x))[0]) for x in data]
    counts = Counter(first_digits)
    return [counts[d]/len(first_digits) for d in range(1, 10)]

legit_dist = get_first_digit_dist(legit)
fraud_dist = get_first_digit_dist(fraud)
benfords = [benfords_law(d) for d in range(1, 10)]

# Calculate deviation from Benford
legit_dev = np.sum((np.array(legit_dist) - np.array(benfords))**2)
fraud_dev = np.sum((np.array(fraud_dist) - np.array(benfords))**2)

print(f"Deviation from Benford's law:")
print(f"  Legitimate: {legit_dev:.4f}")
print(f"  Fraudulent: {fraud_dev:.4f}")

When Benford's Law Applies:

  • Financial data (stock prices, expenses)
  • Scientific data (physical constants, populations)
  • Data spanning orders of magnitude

When It Doesn't Apply:

  • Assigned numbers (phone numbers, SSN)
  • Data with artificial limits
  • Uniform distributions

Interviewer's Insight

What they're testing: Awareness of statistical patterns.

Strong answer signals:

  • log₁₀(1 + 1/d) formula
  • "1 appears ~30% of time"
  • Mentions fraud detection
  • Knows scale-invariance property

What is the Hyperparameter Tuning Problem in Bayesian Terms? - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Bayesian Optimization, ML, Probability | Asked by: Google, Meta, Amazon

View Answer

Problem:

Given expensive black-box function f(x) (e.g., model validation accuracy), find x* that maximizes f.

Bayesian Optimization Approach:

  1. Prior: Gaussian Process over f
  2. Acquisition: Balance exploration/exploitation
  3. Update: Posterior after observing f(x)

Key Components:

import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize

class BayesianOptimization:
    def __init__(self):
        self.X_observed = []
        self.y_observed = []

    def gp_predict(self, X_test):
        """
        Simplified GP prediction
        Returns: mean, std
        """
        if len(self.X_observed) == 0:
            return np.zeros(len(X_test)), np.ones(len(X_test))

        # Simplified: use nearest neighbor
        # (Real implementation uses kernel functions)
        means = []
        stds = []

        for x in X_test:
            # Find nearest observed point
            distances = [abs(x - x_obs) for x_obs in self.X_observed]
            nearest_idx = np.argmin(distances)
            nearest_dist = distances[nearest_idx]

            # Interpolate
            mean = self.y_observed[nearest_idx]
            std = 0.1 + nearest_dist * 0.5  # Uncertainty increases with distance

            means.append(mean)
            stds.append(std)

        return np.array(means), np.array(stds)

    def acquisition_ucb(self, X, kappa=2.0):
        """Upper Confidence Bound acquisition"""
        mean, std = self.gp_predict(X)
        return mean + kappa * std

    def acquisition_ei(self, X, xi=0.01):
        """Expected Improvement acquisition"""
        mean, std = self.gp_predict(X)

        if len(self.y_observed) == 0:
            return np.zeros(len(X))

        best = max(self.y_observed)

        # Expected improvement
        z = (mean - best - xi) / (std + 1e-9)
        ei = (mean - best - xi) * norm.cdf(z) + std * norm.pdf(z)

        return ei

    def suggest_next(self, bounds, acquisition='ei'):
        """Suggest next point to evaluate"""
        # Grid search over acquisition function
        X_candidates = np.linspace(bounds[0], bounds[1], 1000)

        if acquisition == 'ei':
            scores = self.acquisition_ei(X_candidates)
        else:
            scores = self.acquisition_ucb(X_candidates)

        best_idx = np.argmax(scores)
        return X_candidates[best_idx]

    def observe(self, x, y):
        """Add observation"""
        self.X_observed.append(x)
        self.y_observed.append(y)

# Example: Optimize noisy function
def objective(x):
    """True function to optimize (unknown to optimizer)"""
    return np.sin(x) + 0.1 * np.random.randn()

# Run Bayesian Optimization
bo = BayesianOptimization()
bounds = [0, 10]

# Initial random samples
for _ in range(3):
    x = np.random.uniform(bounds[0], bounds[1])
    y = objective(x)
    bo.observe(x, y)

# Iterative optimization
for iteration in range(20):
    # Suggest next point
    x_next = bo.suggest_next(bounds, acquisition='ei')
    y_next = objective(x_next)
    bo.observe(x_next, y_next)

    print(f"Iter {iteration+1}: x={x_next:.3f}, f(x)={y_next:.3f}, best={max(bo.y_observed):.3f}")

# Visualize
import matplotlib.pyplot as plt

X_plot = np.linspace(bounds[0], bounds[1], 200)
y_true = [np.sin(x) for x in X_plot]

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(X_plot, y_true, 'k-', label='True function')
plt.scatter(bo.X_observed, bo.y_observed, c='red', s=100, zorder=5, label='Observations')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Bayesian Optimization Progress')
plt.legend()

plt.subplot(1, 2, 2)
best_so_far = [max(bo.y_observed[:i+1]) for i in range(len(bo.y_observed))]
plt.plot(best_so_far, 'b-', linewidth=2)
plt.axhline(y=max(y_true), color='k', linestyle='--', label='True maximum')
plt.xlabel('Iteration')
plt.ylabel('Best f(x) found')
plt.title('Convergence')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Acquisition Functions:

Function Formula Trade-off
UCB μ + κσ κ controls exploration
EI E[max(0, f - f_best)] Probabilistic improvement
PI P(f > f_best) Binary improvement

Applications:

  • Hyperparameter tuning (learning rate, depth, etc.)
  • Neural architecture search
  • A/B test parameter optimization

Interviewer's Insight

What they're testing: ML + probability integration.

Strong answer signals:

  • Mentions Gaussian Process
  • Knows acquisition functions (EI, UCB)
  • Exploration vs exploitation
  • "Fewer expensive evaluations"

What is a Sufficient Statistic? Give Examples - Google, Amazon Interview Question

Difficulty: 🔴 Hard | Tags: Sufficient Statistics, Theory, Estimation | Asked by: Google, Amazon, Microsoft

View Answer

Sufficient Statistic:

Statistic T(X) is sufficient for parameter θ if:

\[P(X | T(X), \theta) = P(X | T(X))\]

Meaning: Given T(X), the data X provides no additional information about θ.

Factorization Theorem:

T(X) is sufficient iff:

\[f(x|\theta) = g(T(x), \theta) \cdot h(x)\]

Example 1 - Bernoulli:

import numpy as np
from scipy.stats import binom

# Data: n Bernoulli trials
data = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 1])
n = len(data)

# Sufficient statistic: T(X) = sum(X)
T = data.sum()  # = 7

# Given T=7 out of n=10, any sequence with 7 ones is equally likely
# The order doesn't matter for estimating p!

# MLE using full data
p_mle_full = data.mean()

# MLE using only sufficient statistic
p_mle_suff = T / n

print(f"MLE from full data: {p_mle_full}")
print(f"MLE from sufficient stat: {p_mle_suff}")
# Identical!

Example 2 - Normal Distribution:

# X₁,...,Xₙ ~ N(μ, σ²) with σ² known
# Sufficient statistic: T(X) = mean(X)

data = np.random.normal(5, 2, 100)

# Estimate μ using full data
mu_mle_full = np.mean(data)

# Estimate μ using only sufficient statistic (sample mean)
T = np.mean(data)  # This is sufficient
mu_mle_suff = T

print(f"\nμ MLE from full data: {mu_mle_full:.3f}")
print(f"μ MLE from sufficient stat: {mu_mle_suff:.3f}")

# If both μ and σ² unknown:
# Sufficient statistic: (mean(X), variance(X))
T = (np.mean(data), np.var(data, ddof=1))
print(f"\nSufficient stat (μ,σ² unknown): mean={T[0]:.3f}, var={T[1]:.3f}")

Example 3 - Uniform(0, θ):

# X₁,...,Xₙ ~ Uniform(0, θ)
# Sufficient statistic: T(X) = max(X)

true_theta = 10
data = np.random.uniform(0, true_theta, 50)

# MLE: θ̂ = max(X)
theta_mle = np.max(data)

print(f"\nTrue θ: {true_theta}")
print(f"MLE (using max only): {theta_mle:.3f}")

# The maximum is sufficient - we don't need individual values!

Minimal Sufficient Statistic:

Can't be reduced further without losing information.

# For Bernoulli: sum(X) is minimal sufficient
# Can't do better than just counting successes

# For Normal(μ, σ²): (mean, variance) is minimal sufficient
# Need both for complete inference

Why It Matters:

Benefit Explanation
Data reduction Store T(X) instead of all data
Efficient estimation Use T(X) for MLE
Theory Basis for optimal tests/estimators

Checking Sufficiency:

# Method: Show factorization

# Bernoulli example:
# L(p|x) = ∏ p^xᵢ(1-p)^(1-xᵢ)
#        = p^Σxᵢ (1-p)^(n-Σxᵢ)
#        = g(T(x), p) · h(x)
# where T(x) = Σxᵢ, h(x) = 1

Interviewer's Insight

What they're testing: Deep statistical theory.

Strong answer signals:

  • Factorization theorem
  • Examples: sum for Bernoulli, mean for normal
  • "Captures all info about θ"
  • Mentions data reduction benefit

Explain the Rao-Blackwell Theorem - Google, Microsoft Interview Question

Difficulty: 🔴 Hard | Tags: Estimation Theory, UMVUE, Statistics | Asked by: Google, Microsoft, Amazon

View Answer

Rao-Blackwell Theorem:

Given: - Unbiased estimator δ(X) for θ - Sufficient statistic T(X)

Define: \(\delta^*(X) = E[\delta(X) | T(X)]\)

Then: 1. δ* is unbiased 2. Var(δ*) ≤ Var(δ) (improvement!)

Intuition: Conditioning on sufficient statistic T can only reduce variance.

Example - Bernoulli:

import numpy as np

# X₁,...,Xₙ ~ Bernoulli(p)
# Goal: Estimate p

n = 10
p_true = 0.6
n_sims = 10000

# Original (crude) estimator: δ(X) = X₁ (just use first obs!)
estimates_crude = []
estimates_rb = []

for _ in range(n_sims):
    X = np.random.binomial(1, p_true, n)

    # Crude estimator
    delta = X[0]  # Just first observation
    estimates_crude.append(delta)

    # Rao-Blackwellized estimator
    # T(X) = sum(X) is sufficient
    # E[X₁ | T(X) = sum(X)] = sum(X) / n
    T = X.sum()
    delta_star = T / n
    estimates_rb.append(delta_star)

print(f"True p: {p_true}")
print(f"\nCrude estimator (X₁):")
print(f"  Mean: {np.mean(estimates_crude):.3f}")
print(f"  Variance: {np.var(estimates_crude):.4f}")

print(f"\nRao-Blackwellized (X̄):")
print(f"  Mean: {np.mean(estimates_rb):.3f}")
print(f"  Variance: {np.var(estimates_rb):.4f}")

# Theoretical variances
var_crude = p_true * (1 - p_true)  # Var(Bernoulli)
var_rb = p_true * (1 - p_true) / n  # Var(mean)

print(f"\nTheoretical reduction: {var_crude / var_rb:.1f}x")

Example - Normal:

# X ~ N(μ, σ²), σ² known
# Crude estimator: median(X)
# Sufficient statistic: mean(X)

mu_true = 5
sigma = 2
n = 20

estimates_median = []
estimates_mean = []

for _ in range(10000):
    X = np.random.normal(mu_true, sigma, n)

    # Crude: sample median
    estimates_median.append(np.median(X))

    # RB: sample mean (conditioning on sufficient stat)
    estimates_mean.append(np.mean(X))

print(f"\nNormal example:")
print(f"Median variance: {np.var(estimates_median):.4f}")
print(f"Mean variance: {np.var(estimates_mean):.4f}")
print(f"Improvement: {np.var(estimates_median) / np.var(estimates_mean):.2f}x")

Proof Sketch:

# Var(δ) = E[Var(δ|T)] + Var(E[δ|T])
#        = E[Var(δ|T)] + Var(δ*)
# 
# Since E[Var(δ|T)] ≥ 0:
# Var(δ) ≥ Var(δ*)

Connection to UMVUE:

If δ* is also complete, then it's the Uniformly Minimum Variance Unbiased Estimator (UMVUE).

Interviewer's Insight

What they're testing: Advanced estimation theory.

Strong answer signals:

  • "Condition on sufficient statistic"
  • "Always reduces variance"
  • Example: X̄ improves on X₁
  • Mentions UMVUE connection

What is Simpson's Paradox? Provide Examples - Most Tech Companies Interview Question

Difficulty: 🟡 Medium | Tags: Paradox, Confounding, Causal Inference | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Simpson's Paradox:

A trend appears in subgroups but disappears/reverses when groups are combined.

Classic Example - UC Berkeley Admissions:

import pandas as pd
import numpy as np

# Berkeley admission data (simplified)
data = pd.DataFrame({
    'Department': ['A', 'A', 'B', 'B'],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Applied': [825, 108, 560, 25],
    'Admitted': [512, 89, 353, 17]
})

data['Rate'] = data['Admitted'] / data['Applied']

print("By Department:")
print(data)

# Aggregate (ignoring department)
total_male = data[data['Gender'] == 'Male']['Applied'].sum()
total_female = data[data['Gender'] == 'Female']['Applied'].sum()
admit_male = data[data['Gender'] == 'Male']['Admitted'].sum()
admit_female = data[data['Gender'] == 'Female']['Admitted'].sum()

print(f"\nOverall:")
print(f"Male: {admit_male}/{total_male} = {admit_male/total_male:.1%}")
print(f"Female: {admit_female}/{total_female} = {admit_female/total_female:.1%}")

# Paradox: Males have higher overall rate, but females have higher/equal rate in each dept!

Medical Example:

# Treatment effectiveness paradox

treatment_data = pd.DataFrame({
    'Group': ['Healthy', 'Healthy', 'Sick', 'Sick'],
    'Treatment': ['Drug', 'Control', 'Drug', 'Control'],
    'Total': [50, 450, 450, 50],
    'Recovered': [45, 405, 360, 30]
})

treatment_data['Recovery_Rate'] = treatment_data['Recovered'] / treatment_data['Total']

print("\nBy Health Status:")
print(treatment_data)

# Aggregate
drug_total = treatment_data[treatment_data['Treatment'] == 'Drug']['Total'].sum()
drug_recovered = treatment_data[treatment_data['Treatment'] == 'Drug']['Recovered'].sum()

control_total = treatment_data[treatment_data['Treatment'] == 'Control']['Total'].sum()
control_recovered = treatment_data[treatment_data['Treatment'] == 'Control']['Recovered'].sum()

print(f"\nOverall:")
print(f"Drug: {drug_recovered}/{drug_total} = {drug_recovered/drug_total:.1%}")
print(f"Control: {control_recovered}/{control_total} = {control_recovered/control_total:.1%}")

# Drug better in both groups, but worse overall!
# Reason: sick people more likely to get drug

Visualization:

import matplotlib.pyplot as plt

# Simpson's paradox visualization

# Group 1
x1 = np.array([1, 2, 3, 4, 5])
y1 = np.array([2, 3, 4, 5, 6])  # Positive trend

# Group 2
x2 = np.array([6, 7, 8, 9, 10])
y2 = np.array([3, 4, 5, 6, 7])  # Positive trend

# Combined negative trend
x_all = np.concatenate([x1, x2])
y_all = np.concatenate([y1, y2])

plt.figure(figsize=(10, 6))
plt.scatter(x1, y1, color='blue', s=100, label='Group 1', alpha=0.6)
plt.scatter(x2, y2, color='red', s=100, label='Group 2', alpha=0.6)

# Fit lines
z1 = np.polyfit(x1, y1, 1)
z2 = np.polyfit(x2, y2, 1)
z_all = np.polyfit(x_all, y_all, 1)

plt.plot(x1, np.poly1d(z1)(x1), 'b-', linewidth=2, label='Group 1 trend')
plt.plot(x2, np.poly1d(z2)(x2), 'r-', linewidth=2, label='Group 2 trend')
plt.plot(x_all, np.poly1d(z_all)(x_all), 'k--', linewidth=2, label='Combined trend')

plt.xlabel('X')
plt.ylabel('Y')
plt.title("Simpson's Paradox")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Why It Happens:

Confounding variable Z affects both X and Y:

  • Within each Z value: X → Y positive
  • Aggregated: X → Y negative (Z confounds)

Resolution:

# Use stratification or causal inference

# Correct analysis: stratify by confounder
for dept in ['A', 'B']:
    subset = data[data['Department'] == dept]
    print(f"\nDepartment {dept}:")
    print(subset[['Gender', 'Rate']])

Interviewer's Insight

What they're testing: Confounding awareness.

Strong answer signals:

  • Berkeley admission example
  • "Trend reverses when aggregated"
  • Mentions confounding variable
  • Knows to stratify

Explain the Delta Method - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Asymptotics, Central Limit Theorem, Approximation | Asked by: Google, Meta, Amazon

View Answer

Delta Method:

If \(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0, \sigma^2)\), then for smooth g:

\[\sqrt{n}(g(\hat{\theta}) - g(\theta)) \xrightarrow{d} N(0, [g'(\theta)]^2 \sigma^2)\]

Intuition: Use first-order Taylor approximation for asymptotic distribution of transformed estimator.

Example 1 - Bernoulli Variance:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Estimate variance of Bernoulli: θ(1-θ)

p_true = 0.3
n = 100
n_sims = 10000

# Simulation
estimates_var = []

for _ in range(n_sims):
    X = np.random.binomial(1, p_true, n)
    p_hat = X.mean()

    # Plug-in estimator for variance
    var_hat = p_hat * (1 - p_hat)
    estimates_var.append(var_hat)

# True variance
true_var = p_true * (1 - p_true)

# Delta method approximation
# g(p) = p(1-p), g'(p) = 1 - 2p
g_prime = 1 - 2*p_true

# Var(p̂) = p(1-p)/n
var_p_hat = p_true * (1 - p_true) / n

# Delta method: Var(g(p̂)) ≈ [g'(p)]² · Var(p̂)
var_delta = (g_prime**2) * var_p_hat
std_delta = np.sqrt(var_delta)

# Compare
print(f"True variance: {true_var:.4f}")
print(f"Mean estimate: {np.mean(estimates_var):.4f}")
print(f"Std of estimates (simulation): {np.std(estimates_var):.4f}")
print(f"Std (Delta method): {std_delta:.4f}")

# Plot
plt.figure(figsize=(10, 6))
plt.hist(estimates_var, bins=50, density=True, alpha=0.7, label='Simulation')

# Delta method normal approximation
x = np.linspace(true_var - 4*std_delta, true_var + 4*std_delta, 100)
plt.plot(x, stats.norm.pdf(x, true_var, std_delta), 'r-', linewidth=2, label='Delta method')

plt.axvline(true_var, color='black', linestyle='--', label='True value')
plt.xlabel('Estimated variance')
plt.ylabel('Density')
plt.title('Delta Method Approximation')
plt.legend()
plt.show()

Example 2 - Log Odds:

# Transform: g(p) = log(p/(1-p))  [log-odds]

p_true = 0.6
n = 200

estimates_logodds = []

for _ in range(10000):
    X = np.random.binomial(1, p_true, n)
    p_hat = X.mean()

    # Avoid 0/1
    p_hat = np.clip(p_hat, 0.01, 0.99)

    logodds_hat = np.log(p_hat / (1 - p_hat))
    estimates_logodds.append(logodds_hat)

# True log-odds
true_logodds = np.log(p_true / (1 - p_true))

# Delta method
# g(p) = log(p/(1-p))
# g'(p) = 1/(p(1-p))
g_prime = 1 / (p_true * (1 - p_true))

var_p_hat = p_true * (1 - p_true) / n
var_delta = (g_prime**2) * var_p_hat
std_delta = np.sqrt(var_delta)

print(f"\nLog-odds example:")
print(f"True log-odds: {true_logodds:.4f}")
print(f"Std (simulation): {np.std(estimates_logodds):.4f}")
print(f"Std (Delta method): {std_delta:.4f}")

Multivariate Delta Method:

# For vector θ̂ → g(θ̂)

# Example: Ratio estimator
# X ~ N(μx, σx²), Y ~ N(μy, σy²)
# Estimate ratio R = μx/μy

mu_x, mu_y = 10, 5
sigma_x, sigma_y = 2, 1
n = 100

estimates_ratio = []

for _ in range(10000):
    X = np.random.normal(mu_x, sigma_x, n)
    Y = np.random.normal(mu_y, sigma_y, n)

    ratio = X.mean() / Y.mean()
    estimates_ratio.append(ratio)

true_ratio = mu_x / mu_y

# Multivariate delta method
# g(μx, μy) = μx/μy
# ∇g = [1/μy, -μx/μy²]

gradient = np.array([1/mu_y, -mu_x/mu_y**2])

# Covariance matrix of (X̄, Ȳ)
cov_matrix = np.array([
    [sigma_x**2/n, 0],
    [0, sigma_y**2/n]
])

# Var(g(θ̂)) ≈ ∇g^T Σ ∇g
var_delta = gradient @ cov_matrix @ gradient
std_delta = np.sqrt(var_delta)

print(f"\nRatio example:")
print(f"True ratio: {true_ratio:.4f}")
print(f"Std (simulation): {np.std(estimates_ratio):.4f}")
print(f"Std (Delta method): {std_delta:.4f}")

Interviewer's Insight

What they're testing: Asymptotic theory knowledge.

Strong answer signals:

  • Taylor expansion intuition
  • Formula: [g'(θ)]² σ²
  • "First-order approximation"
  • Example: log-odds or ratio

What is the Likelihood Ratio in Hypothesis Testing? - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Hypothesis Testing, Likelihood, Neyman-Pearson | Asked by: Google, Amazon, Meta

View Answer

Likelihood Ratio:

\[\Lambda(x) = \frac{L(\theta_1 | x)}{L(\theta_0 | x)} = \frac{P(X=x | H_1)}{P(X=x | H_0)}\]

Decision Rule: Reject H₀ if Λ(x) > k (threshold)

Neyman-Pearson Lemma:

For testing H₀: θ = θ₀ vs H₁: θ = θ₁, the LR test is most powerful (maximizes power for fixed α).

Example - Normal Mean:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# H₀: μ = 0 vs H₁: μ = 1
# X ~ N(μ, σ²=1)

sigma = 1
n = 20
alpha = 0.05

# Generate data under H₁
np.random.seed(42)
mu_0 = 0
mu_1 = 1

X = np.random.normal(mu_1, sigma, n)
x_bar = X.mean()

# Likelihood ratio
# L(μ|x) ∝ exp(-n(x̄-μ)²/(2σ²))

L_0 = np.exp(-n * (x_bar - mu_0)**2 / (2 * sigma**2))
L_1 = np.exp(-n * (x_bar - mu_1)**2 / (2 * sigma**2))

LR = L_1 / L_0

print(f"Sample mean: {x_bar:.3f}")
print(f"L(μ=0|x): {L_0:.6f}")
print(f"L(μ=1|x): {L_1:.6f}")
print(f"Likelihood Ratio: {LR:.3f}")

# Critical value
# Under H₀, x̄ ~ N(0, σ²/n)
# Reject if x̄ > c where P(x̄ > c | H₀) = α

c = stats.norm.ppf(1 - alpha, loc=mu_0, scale=sigma/np.sqrt(n))
print(f"\nCritical value: {c:.3f}")
print(f"Decision: {'Reject H₀' if x_bar > c else 'Fail to reject H₀'}")

# Equivalence: LR test ↔ reject if x̄ > c
# LR > k ↔ x̄ > some threshold

ROC Curve:

# Vary threshold, plot TPR vs FPR

def compute_roc(mu_0, mu_1, sigma, n, n_sims=10000):
    # Generate data under both hypotheses
    data_H0 = np.random.normal(mu_0, sigma, (n_sims, n))
    data_H1 = np.random.normal(mu_1, sigma, (n_sims, n))

    x_bar_H0 = data_H0.mean(axis=1)
    x_bar_H1 = data_H1.mean(axis=1)

    # Try different thresholds
    thresholds = np.linspace(mu_0 - 3*sigma/np.sqrt(n), 
                             mu_1 + 3*sigma/np.sqrt(n), 100)

    fpr = []
    tpr = []

    for t in thresholds:
        # False positive rate: P(reject | H₀)
        fpr.append(np.mean(x_bar_H0 > t))

        # True positive rate: P(reject | H₁)
        tpr.append(np.mean(x_bar_H1 > t))

    return fpr, tpr

fpr, tpr = compute_roc(mu_0, mu_1, sigma, n)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, 'b-', linewidth=2, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Likelihood Ratio Test')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# AUC
from sklearn.metrics import auc
roc_auc = auc(fpr, tpr)
print(f"\nAUC: {roc_auc:.3f}")

Generalized Likelihood Ratio Test (GLRT):

# Composite hypotheses: unknown parameters

# H₀: μ = 0, σ unknown
# H₁: μ ≠ 0, σ unknown

X = np.random.normal(0.5, 1, 50)

# MLE under H₀
mu_0_mle = 0
sigma_0_mle = np.sqrt(np.mean((X - mu_0_mle)**2))

# MLE under H₁ (unconstrained)
mu_1_mle = X.mean()
sigma_1_mle = np.sqrt(np.mean((X - mu_1_mle)**2))

# Likelihood ratio
n = len(X)
LR = (sigma_0_mle / sigma_1_mle)**n

# Test statistic: -2 log(LR) ~ χ²(df)
test_stat = -2 * np.log(LR)

# Under H₀: ~ χ²(1) [df = # params in H₁ - # params in H₀]
p_value = 1 - stats.chi2.cdf(test_stat, df=1)

print(f"\nGLRT:")
print(f"Test statistic: {test_stat:.3f}")
print(f"p-value: {p_value:.4f}")

Interviewer's Insight

What they're testing: Hypothesis testing theory.

Strong answer signals:

  • Ratio of likelihoods under H₁/H₀
  • Neyman-Pearson lemma (most powerful)
  • Connection to ROC
  • GLRT for composite hypotheses

Explain the Bootstrap Method - Most Tech Companies Interview Question

Difficulty: 🟡 Medium | Tags: Resampling, Inference, Confidence Intervals | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Bootstrap:

Estimate sampling distribution of statistic by resampling from data with replacement.

Algorithm:

  1. Original sample: \(X = \{x_1, ..., x_n\}\)
  2. For b = 1 to B:
  3. Draw \(X^*_b\) by sampling n points from X with replacement
  4. Compute statistic \(\theta^*_b = T(X^*_b)\)
  5. Use \(\{\theta^*_1, ..., \theta^*_B\}\) to approximate distribution of T

Implementation:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Original data
np.random.seed(42)
data = np.random.exponential(scale=2, size=50)

# Statistic: median
observed_median = np.median(data)

# Bootstrap
n_bootstrap = 10000
bootstrap_medians = []

for _ in range(n_bootstrap):
    # Resample with replacement
    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_medians.append(np.median(bootstrap_sample))

bootstrap_medians = np.array(bootstrap_medians)

# Bootstrap standard error
se_bootstrap = np.std(bootstrap_medians)

print(f"Observed median: {observed_median:.3f}")
print(f"Bootstrap SE: {se_bootstrap:.3f}")

# Bootstrap confidence interval (percentile method)
ci_lower = np.percentile(bootstrap_medians, 2.5)
ci_upper = np.percentile(bootstrap_medians, 97.5)

print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")

# Plot
plt.figure(figsize=(10, 6))
plt.hist(bootstrap_medians, bins=50, density=True, alpha=0.7, edgecolor='black')
plt.axvline(observed_median, color='red', linestyle='--', linewidth=2, label='Observed')
plt.axvline(ci_lower, color='green', linestyle='--', label='95% CI')
plt.axvline(ci_upper, color='green', linestyle='--')
plt.xlabel('Median')
plt.ylabel('Density')
plt.title('Bootstrap Distribution of Median')
plt.legend()
plt.show()

Types of Bootstrap CI:

# 1. Percentile method (above)
ci_percentile = (np.percentile(bootstrap_medians, 2.5),
                 np.percentile(bootstrap_medians, 97.5))

# 2. Basic/Normal approximation
ci_normal = (observed_median - 1.96 * se_bootstrap,
             observed_median + 1.96 * se_bootstrap)

# 3. BCa (bias-corrected and accelerated)
# More complex, accounts for bias and skewness

print(f"\nPercentile CI: {ci_percentile}")
print(f"Normal CI: {ci_normal}")

Bootstrap for Hypothesis Testing:

# Test: H₀: median = 1 vs H₁: median ≠ 1

null_value = 1

# Center bootstrap samples at null
shifted_data = data - observed_median + null_value

bootstrap_null = []
for _ in range(n_bootstrap):
    bootstrap_sample = np.random.choice(shifted_data, size=len(data), replace=True)
    bootstrap_null.append(np.median(bootstrap_sample))

# p-value: proportion of bootstrap stats as extreme as observed
bootstrap_null = np.array(bootstrap_null)
p_value = np.mean(np.abs(bootstrap_null - null_value) >= 
                  np.abs(observed_median - null_value))

print(f"\nBootstrap hypothesis test:")
print(f"p-value: {p_value:.4f}")

Comparison with Analytical:

# For mean of normal data, we have analytical SE
data_normal = np.random.normal(5, 2, 100)

# Analytical
se_analytical = stats.sem(data_normal)

# Bootstrap
bootstrap_means = []
for _ in range(10000):
    bootstrap_sample = np.random.choice(data_normal, size=len(data_normal), replace=True)
    bootstrap_means.append(np.mean(bootstrap_sample))

se_bootstrap = np.std(bootstrap_means)

print(f"\nComparison for mean:")
print(f"Analytical SE: {se_analytical:.4f}")
print(f"Bootstrap SE: {se_bootstrap:.4f}")

When to Use Bootstrap:

Scenario Bootstrap?
Complex statistic (median, ratio) ✓ Yes
Small sample, non-normal ✓ Yes
Simple mean, large n △ Optional (analytical works)
Time series, dependence ✗ Need block bootstrap

Limitations:

  • Assumes sample represents population
  • Can fail for extreme statistics (max, min)
  • Computational cost

Interviewer's Insight

What they're testing: Practical inference methods.

Strong answer signals:

  • "Resample with replacement"
  • Explains SE and CI
  • Mentions percentile method
  • Knows when it's useful

What is the Curse of Dimensionality in Probability? - Google, Meta Interview Question

Difficulty: 🟡 Medium | Tags: High Dimensions, Geometry, ML | Asked by: Google, Meta, Amazon, Microsoft

View Answer

Curse of Dimensionality:

As dimensions increase, intuitions from low dimensions fail dramatically.

Phenomenon 1 - Volume Concentration:

import numpy as np
import matplotlib.pyplot as plt

# Hypersphere volume as fraction of hypercube

def sphere_volume_fraction(d):
    """Volume of unit sphere / volume of unit cube in d dimensions"""
    # Unit sphere: radius = 1/2 (to fit in unit cube)
    # V_sphere = π^(d/2) / Γ(d/2 + 1) * r^d
    # V_cube = 1
    from scipy.special import gamma
    r = 0.5
    vol_sphere = np.pi**(d/2) / gamma(d/2 + 1) * r**d
    return vol_sphere

dims = range(1, 21)
fractions = [sphere_volume_fraction(d) for d in dims]

plt.figure(figsize=(10, 6))
plt.plot(dims, fractions, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Dimension')
plt.ylabel('Sphere volume / Cube volume')
plt.title('Curse of Dimensionality: Volume Concentration')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()

print("Sphere volume as % of cube:")
for d in [2, 5, 10, 20]:
    print(f"  d={d}: {sphere_volume_fraction(d)*100:.4f}%")
# Almost all volume is in corners!

Phenomenon 2 - Distance Concentration:

# In high dimensions, all points are nearly equidistant

def distance_ratio_simulation(n_points=100, dimensions=[2, 10, 100]):
    results = {}

    for d in dimensions:
        # Random points in unit hypercube
        points = np.random.rand(n_points, d)

        # Compute all pairwise distances
        from scipy.spatial.distance import pdist
        distances = pdist(points)

        # Ratio of max to min distance
        ratio = np.max(distances) / np.min(distances)

        # Relative standard deviation
        rel_std = np.std(distances) / np.mean(distances)

        results[d] = {
            'mean': np.mean(distances),
            'std': np.std(distances),
            'rel_std': rel_std,
            'ratio': ratio
        }

    return results

results = distance_ratio_simulation()

print("\nDistance concentration:")
for d, stats in results.items():
    print(f"d={d}:")
    print(f"  Mean distance: {stats['mean']:.4f}")
    print(f"  Rel. std: {stats['rel_std']:.4f}")
    print(f"  Max/min ratio: {stats['ratio']:.2f}")
# In high dims: all distances ≈ same!

Phenomenon 3 - Sampling Sparsity:

# To maintain same density, need exponentially more samples

def samples_needed(d, density_per_dim=10):
    """Number of samples to maintain density"""
    return density_per_dim ** d

print("\nSamples needed for fixed density:")
for d in [1, 2, 3, 5, 10]:
    n = samples_needed(d)
    print(f"d={d}: {n:,} samples")
# Explodes exponentially!

Impact on ML:

# K-NN becomes useless in high dimensions

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# Test k-NN performance vs dimensionality
dimensions = [2, 5, 10, 20, 50, 100]
accuracies = []

for d in dimensions:
    X, y = make_classification(n_samples=200, n_features=d, 
                               n_informative=d, n_redundant=0, random_state=42)

    knn = KNeighborsClassifier(n_neighbors=5)
    scores = cross_val_score(knn, X, y, cv=5)
    accuracies.append(scores.mean())

plt.figure(figsize=(10, 6))
plt.plot(dimensions, accuracies, 'ro-', linewidth=2, markersize=8)
plt.xlabel('Dimension')
plt.ylabel('Cross-val accuracy')
plt.title('k-NN Performance Degrades in High Dimensions')
plt.axhline(y=0.5, color='k', linestyle='--', label='Random')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Solutions:

Problem Solution
Too many features Dimensionality reduction (PCA, etc.)
Distance meaningless Use distance-agnostic methods
Sparse data Regularization, feature selection
Curse of volume Manifold assumption

Interviewer's Insight

What they're testing: High-dimensional intuition.

Strong answer signals:

  • "Volume concentrates in corners"
  • "All points equidistant"
  • "Need exponentially more data"
  • Mentions PCA/regularization

Explain the Pitman-Koopman-Darmois Theorem - Google Interview Question

Difficulty: 🔴 Hard | Tags: Exponential Family, Sufficient Statistics, Theory | Asked by: Google, Microsoft

View Answer

Pitman-Koopman-Darmois Theorem:

If a distribution admits a sufficient statistic of fixed dimension (independent of sample size), then it must belong to the exponential family.

Exponential Family Form:

\[f(x|\theta) = h(x) \exp\{\eta(\theta) \cdot T(x) - A(\theta)\}\]

Where: - T(x): Sufficient statistic - η(θ): Natural parameter - A(θ): Log-partition function

Examples:

import numpy as np
from scipy import stats

# 1. Bernoulli: exponential family
# f(x|p) = p^x (1-p)^(1-x)
#        = exp{x log(p/(1-p)) + log(1-p)}
# T(x) = x, η = log(p/(1-p))

# Sufficient statistic: sum(X) has fixed dimension 1
data = np.random.binomial(1, 0.6, 100)
print(f"Bernoulli sufficient stat: sum = {data.sum()}")

# 2. Normal (μ unknown, σ² known): exponential family
# T(x) = mean(X)

# 3. Uniform(0, θ): NOT exponential family
# Sufficient stat: max(X), but this changes with n!
# No fixed-dimension sufficient statistic

data_unif = np.random.uniform(0, 10, 100)
print(f"Uniform sufficient stat: max = {data_unif.max()}")
# This is sufficient but depends on n

Why It Matters:

# Exponential family has nice properties:

# 1. Natural conjugate priors
# Example: Bernoulli + Beta

from scipy.stats import beta, binom

# Prior: Beta(a, b)
a, b = 2, 2

# Data: n trials, k successes
n, k = 100, 60

# Posterior: Beta(a+k, b+n-k)
a_post = a + k
b_post = b + n - k

print(f"\nConjugate prior example:")
print(f"Prior: Beta({a}, {b})")
print(f"Data: {k}/{n} successes")
print(f"Posterior: Beta({a_post}, {b_post})")

# 2. MLE has closed form
# 3. Sufficient statistics compress data optimally

Complete Proof Sketch:

# Theorem: Fixed-dimension sufficient stat → exponential family

# Proof idea:
# If T(X) is sufficient with fixed dimension d,
# then by factorization theorem:
# 
# f(x|θ) = g(T(x), θ) h(x)
#
# For this to work for all n with T having fixed dimension,
# must have exponential family structure.

# Contrapositive: Not exponential family → no fixed-dim sufficient stat
# Example: Uniform(0, θ) requires max(X), which grows with n

Identifying Exponential Families:

# Check if distribution can be written in form:
# f(x|θ) = h(x) exp{η(θ)·T(x) - A(θ)}

distributions = {
    'Bernoulli': 'Yes - T(x)=x',
    'Normal': 'Yes - T(x)=(x, x²) if both μ,σ² unknown',
    'Poisson': 'Yes - T(x)=x',
    'Exponential': 'Yes - T(x)=x',
    'Gamma': 'Yes - T(x)=(x, log x)',
    'Beta': 'Yes - T(x)=(log x, log(1-x))',
    'Uniform(0,θ)': 'No - not exponential family',
    'Cauchy': 'No - no sufficient statistic'
}

print("\nExponential family membership:")
for dist, status in distributions.items():
    print(f"  {dist}: {status}")

Interviewer's Insight

What they're testing: Deep theoretical knowledge.

Strong answer signals:

  • "Fixed-dimension sufficient stat → exponential family"
  • Knows exponential family form
  • Examples: Bernoulli yes, Uniform no
  • Mentions conjugate priors

What is the Cramér-Rao Lower Bound? - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Estimation Theory, Fisher Information, Lower Bound | Asked by: Google, Meta, Microsoft

View Answer

Cramér-Rao Lower Bound (CRLB):

For any unbiased estimator \(\hat{\theta}\) of parameter θ:

\[\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot I(\theta)}\]

Where I(θ) is the Fisher Information:

\[I(\theta) = E\left[\left(\frac{\partial}{\partial \theta} \log f(X|\theta)\right)^2\right]\]

Interpretation: No unbiased estimator can have variance below this bound.

Example - Bernoulli:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# X ~ Bernoulli(p)
# Find CRLB for estimating p

# Log-likelihood: log f(x|p) = x log(p) + (1-x) log(1-p)
# Score: d/dp log f = x/p - (1-x)/(1-p)

# Fisher information
# I(p) = E[(d/dp log f)²]
#      = E[(x/p - (1-x)/(1-p))²]
#      = 1/(p(1-p))

def fisher_info_bernoulli(p):
    return 1 / (p * (1 - p))

def crlb_bernoulli(p, n):
    return 1 / (n * fisher_info_bernoulli(p))

# Simulation
p_true = 0.3
n = 100
n_sims = 10000

estimates = []
for _ in range(n_sims):
    X = np.random.binomial(1, p_true, n)
    p_hat = X.mean()
    estimates.append(p_hat)

var_empirical = np.var(estimates)
crlb = crlb_bernoulli(p_true, n)

print(f"Bernoulli estimation:")
print(f"True p: {p_true}")
print(f"CRLB: {crlb:.6f}")
print(f"Var(p̂) empirical: {var_empirical:.6f}")
print(f"Var(p̂) theoretical: {p_true*(1-p_true)/n:.6f}")
print(f"Efficiency: {crlb/var_empirical:.4f}")
# MLE for Bernoulli achieves CRLB (efficient!)

Example - Normal:

# X ~ N(μ, σ²), σ² known
# Estimate μ

sigma = 2
mu_true = 5
n = 50

# Fisher information
# I(μ) = n/σ²
fisher_info = n / sigma**2
crlb = 1 / fisher_info

print(f"\nNormal estimation (μ):")
print(f"CRLB: {crlb:.6f}")
print(f"Var(X̄): {sigma**2 / n:.6f}")
# X̄ achieves CRLB

Efficiency:

# Efficiency = CRLB / Var(estimator)
# Efficient estimator: efficiency = 1

def compare_estimators(p_true, n, n_sims=10000):
    """Compare different estimators for Bernoulli p"""

    crlb = crlb_bernoulli(p_true, n)

    estimators = {}

    for _ in range(n_sims):
        X = np.random.binomial(1, p_true, n)

        # MLE: sample mean
        estimators.setdefault('MLE', []).append(X.mean())

        # Inefficient: use only first half
        estimators.setdefault('Half', []).append(X[:n//2].mean())

    print(f"\nEstimator comparison (p={p_true}, n={n}):")
    print(f"CRLB: {crlb:.6f}")

    for name, ests in estimators.items():
        var = np.var(ests)
        eff = crlb / var
        print(f"{name}:")
        print(f"  Variance: {var:.6f}")
        print(f"  Efficiency: {eff:.4f}")

compare_estimators(0.3, 100)

Multivariate CRLB:

# For vector parameter θ
# Cov(θ̂) ⪰ I(θ)⁻¹  (matrix inequality)

# Example: Normal(μ, σ²), both unknown

mu_true = 5
sigma_true = 2
n = 100

# Fisher information matrix
# I(μ,σ²) = [[n/σ², 0], [0, n/(2σ⁴)]]

I_matrix = np.array([
    [n / sigma_true**2, 0],
    [0, n / (2 * sigma_true**4)]
])

# CRLB: inverse of Fisher information
crlb_matrix = np.linalg.inv(I_matrix)

print(f"\nMultivariate CRLB:")
print("Covariance lower bound:")
print(crlb_matrix)

print(f"\nVar(μ̂) ≥ {crlb_matrix[0,0]:.6f}")
print(f"Var(σ̂²) ≥ {crlb_matrix[1,1]:.6f}")

Visualization:

# Plot CRLB vs p for Bernoulli

p_values = np.linspace(0.01, 0.99, 100)
n = 50

crlb_values = [crlb_bernoulli(p, n) for p in p_values]

plt.figure(figsize=(10, 6))
plt.plot(p_values, crlb_values, 'b-', linewidth=2)
plt.xlabel('True p')
plt.ylabel('CRLB for Var(p̂)')
plt.title(f'Cramér-Rao Lower Bound (n={n})')
plt.grid(True, alpha=0.3)
plt.show()
# Hardest to estimate: p near 0.5

Interviewer's Insight

What they're testing: Advanced estimation theory.

Strong answer signals:

  • Formula: 1/(n·I(θ))
  • "Lower bound on variance"
  • Knows Fisher Information
  • Mentions efficiency
  • Example: MLE achieves bound

What is the Difference Between Joint, Marginal, and Conditional Distributions? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Distributions, Fundamentals, Probability | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Joint Distribution:

P(X, Y) - probability of X and Y together

Marginal Distribution:

P(X) = ∑_y P(X, Y=y) - distribution of X alone

Conditional Distribution:

P(X | Y) = P(X, Y) / P(Y) - distribution of X given Y

Example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Joint distribution: dice rolls
# X = first die, Y = second die

# Create joint distribution
outcomes = []
for x in range(1, 7):
    for y in range(1, 7):
        outcomes.append((x, y))

joint = pd.DataFrame(outcomes, columns=['X', 'Y'])
joint_prob = joint.groupby(['X', 'Y']).size() / len(joint)

# Reshape to matrix
joint_matrix = joint_prob.unstack(fill_value=0)

print("Joint Distribution P(X, Y):")
print(joint_matrix)

# Marginal distributions
marginal_X = joint_matrix.sum(axis=1)  # Sum over Y
marginal_Y = joint_matrix.sum(axis=0)  # Sum over X

print("\nMarginal P(X):")
print(marginal_X)

print("\nMarginal P(Y):")
print(marginal_Y)

# Conditional distribution: P(X | Y=3)
y_value = 3
conditional_X_given_Y = joint_matrix[y_value] / marginal_Y[y_value]

print(f"\nConditional P(X | Y={y_value}):")
print(conditional_X_given_Y)

Real Example - Customer Data:

# Customer age and purchase amount
np.random.seed(42)

# Age groups: Young, Middle, Senior
# Purchase: Low, Medium, High

data = pd.DataFrame({
    'Age': np.random.choice(['Young', 'Middle', 'Senior'], 1000, p=[0.3, 0.5, 0.2]),
    'Purchase': np.random.choice(['Low', 'Mid', 'High'], 1000)
})

# Joint distribution
joint = pd.crosstab(data['Age'], data['Purchase'], normalize='all')
print("\nJoint P(Age, Purchase):")
print(joint)

# Marginal distributions
marginal_age = pd.crosstab(data['Age'], data['Purchase'], normalize='all').sum(axis=1)
marginal_purchase = pd.crosstab(data['Age'], data['Purchase'], normalize='all').sum(axis=0)

print("\nMarginal P(Age):")
print(marginal_age)

# Conditional: P(Purchase | Age=Young)
conditional = pd.crosstab(data['Age'], data['Purchase'], normalize='index')
print("\nConditional P(Purchase | Age):")
print(conditional)

Visualization:

# Visualize joint vs marginals

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Joint distribution
axes[0, 0].imshow(joint_matrix, cmap='Blues', aspect='auto')
axes[0, 0].set_title('Joint P(X, Y)')
axes[0, 0].set_xlabel('Y')
axes[0, 0].set_ylabel('X')

# Marginal X
axes[0, 1].bar(marginal_X.index, marginal_X.values)
axes[0, 1].set_title('Marginal P(X)')
axes[0, 1].set_xlabel('X')

# Marginal Y
axes[1, 0].bar(marginal_Y.index, marginal_Y.values)
axes[1, 0].set_title('Marginal P(Y)')
axes[1, 0].set_xlabel('Y')

# Conditional P(X|Y=3)
axes[1, 1].bar(conditional_X_given_Y.index, conditional_X_given_Y.values, color='orange')
axes[1, 1].set_title(f'Conditional P(X | Y={y_value})')
axes[1, 1].set_xlabel('X')

plt.tight_layout()
plt.show()

Key Relationships:

Relationship Formula
Marginal from joint P(X) = ∑_y P(X,y)
Conditional from joint P(X|Y) = P(X,Y)/P(Y)
Joint from conditional P(X,Y) = P(X|Y)P(Y)
Independence P(X,Y) = P(X)P(Y)

Interviewer's Insight

What they're testing: Basic probability concepts.

Strong answer signals:

  • Clear definitions
  • "Marginal = sum over other variables"
  • "Conditional = joint / marginal"
  • Can compute from each other

Explain the Expectation-Maximization (EM) Algorithm - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: EM, Latent Variables, ML | Asked by: Google, Meta, Amazon, Microsoft

View Answer

EM Algorithm:

Iterative method to find MLE when data has latent (hidden) variables.

E-step: Compute expected value of log-likelihood w.r.t. latent variables

M-step: Maximize this expectation to update parameters

Example - Gaussian Mixture Model:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

class GaussianMixture:
    def __init__(self, n_components=2):
        self.n_components = n_components
        self.weights = None
        self.means = None
        self.stds = None

    def fit(self, X, max_iters=100, tol=1e-4):
        n = len(X)
        k = self.n_components

        # Initialize
        self.weights = np.ones(k) / k
        self.means = np.random.choice(X, k)
        self.stds = np.ones(k) * np.std(X)

        log_likelihoods = []

        for iteration in range(max_iters):
            # E-step: Compute responsibilities
            responsibilities = np.zeros((n, k))

            for j in range(k):
                responsibilities[:, j] = self.weights[j] * \
                    norm.pdf(X, self.means[j], self.stds[j])

            # Normalize
            responsibilities /= responsibilities.sum(axis=1, keepdims=True)

            # M-step: Update parameters
            N = responsibilities.sum(axis=0)

            self.weights = N / n
            self.means = (responsibilities.T @ X) / N

            # Update standard deviations
            for j in range(k):
                diff = X - self.means[j]
                self.stds[j] = np.sqrt((responsibilities[:, j] * diff**2).sum() / N[j])

            # Compute log-likelihood
            log_likelihood = self._log_likelihood(X)
            log_likelihoods.append(log_likelihood)

            # Check convergence
            if iteration > 0 and abs(log_likelihoods[-1] - log_likelihoods[-2]) < tol:
                break

        return log_likelihoods

    def _log_likelihood(self, X):
        ll = 0
        for j in range(self.n_components):
            ll += self.weights[j] * norm.pdf(X, self.means[j], self.stds[j])
        return np.log(ll).sum()

    def predict_proba(self, X):
        """Predict cluster probabilities"""
        n = len(X)
        k = self.n_components
        probs = np.zeros((n, k))

        for j in range(k):
            probs[:, j] = self.weights[j] * norm.pdf(X, self.means[j], self.stds[j])

        probs /= probs.sum(axis=1, keepdims=True)
        return probs

# Generate data from mixture
np.random.seed(42)

# True parameters
true_weights = [0.3, 0.7]
true_means = [-2, 3]
true_stds = [0.8, 1.2]

n_samples = 500
components = np.random.choice([0, 1], n_samples, p=true_weights)
X = np.array([
    np.random.normal(true_means[c], true_stds[c]) 
    for c in components
])

# Fit GMM with EM
gmm = GaussianMixture(n_components=2)
log_likelihoods = gmm.fit(X)

print("Learned parameters:")
print(f"Weights: {gmm.weights}")
print(f"Means: {gmm.means}")
print(f"Stds: {gmm.stds}")

print(f"\nTrue parameters:")
print(f"Weights: {true_weights}")
print(f"Means: {true_means}")
print(f"Stds: {true_stds}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Data and fitted components
axes[0].hist(X, bins=50, density=True, alpha=0.5, label='Data')

x_plot = np.linspace(X.min(), X.max(), 1000)
for j in range(2):
    y = gmm.weights[j] * norm.pdf(x_plot, gmm.means[j], gmm.stds[j])
    axes[0].plot(x_plot, y, linewidth=2, label=f'Component {j+1}')

# Total density
total = sum(gmm.weights[j] * norm.pdf(x_plot, gmm.means[j], gmm.stds[j]) 
            for j in range(2))
axes[0].plot(x_plot, total, 'k--', linewidth=2, label='Mixture')

axes[0].set_xlabel('X')
axes[0].set_ylabel('Density')
axes[0].set_title('Fitted Gaussian Mixture')
axes[0].legend()

# Log-likelihood convergence
axes[1].plot(log_likelihoods, 'b-', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Log-likelihood')
axes[1].set_title('EM Convergence')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Properties:

  1. Guaranteed improvement: Log-likelihood never decreases
  2. Local maxima: May not find global optimum
  3. Sensitive to initialization: Multiple random starts help

Applications:

  • Clustering (GMM)
  • Hidden Markov Models
  • Missing data imputation
  • Topic modeling (LDA)

Interviewer's Insight

What they're testing: ML algorithms + probability.

Strong answer signals:

  • E-step: compute expectations
  • M-step: maximize
  • "Handles latent variables"
  • Example: Gaussian mixture
  • "Increases likelihood each iteration"

What is Rejection Sampling vs Importance Sampling? - Meta, Google Interview Question

Difficulty: 🟡 Medium | Tags: Monte Carlo, Sampling, Simulation | Asked by: Meta, Google, Amazon

View Answer

Rejection Sampling:

Sample from target p(x) using proposal q(x):

  1. Sample x ~ q(x)
  2. Accept x with probability p(x)/(M·q(x))
  3. Repeat until accepted

Importance Sampling:

Estimate E_p[f(X)] using samples from q(x):

\[E_p[f(X)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i) \frac{p(x_i)}{q(x_i)}\]

Comparison:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, expon

# Target: Standard normal truncated to [0, ∞)
def target_pdf(x):
    if x < 0:
        return 0
    return 2 * norm.pdf(x, 0, 1)  # Normalized truncated normal

# Proposal: Exponential(1)
def proposal_pdf(x):
    return expon.pdf(x, scale=1)

# Find M: max(target/proposal)
x_test = np.linspace(0, 5, 1000)
ratios = [target_pdf(x) / proposal_pdf(x) for x in x_test]
M = max(ratios) * 1.1  # Add margin

print(f"M = {M:.2f}")

# Method 1: Rejection Sampling
def rejection_sampling(n_samples):
    samples = []
    n_rejected = 0

    while len(samples) < n_samples:
        # Propose
        x = np.random.exponential(scale=1)

        # Accept/reject
        u = np.random.uniform(0, 1)
        if u < target_pdf(x) / (M * proposal_pdf(x)):
            samples.append(x)
        else:
            n_rejected += 1

    acceptance_rate = n_samples / (n_samples + n_rejected)
    return np.array(samples), acceptance_rate

samples_rejection, acc_rate = rejection_sampling(10000)
print(f"\nRejection Sampling:")
print(f"Acceptance rate: {acc_rate:.2%}")

# Method 2: Importance Sampling
def importance_sampling(n_samples):
    # Sample from proposal
    samples = np.random.exponential(scale=1, size=n_samples)

    # Compute weights
    weights = np.array([target_pdf(x) / proposal_pdf(x) for x in samples])

    return samples, weights

samples_importance, weights = importance_sampling(10000)

# Normalize weights
weights_normalized = weights / weights.sum()

# Estimate mean
true_mean = 0.798  # For truncated normal [0,∞)

mean_rejection = samples_rejection.mean()
mean_importance = (samples_importance * weights_normalized).sum()

print(f"\nMean estimation:")
print(f"True: {true_mean:.3f}")
print(f"Rejection: {mean_rejection:.3f}")
print(f"Importance: {mean_importance:.3f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Rejection sampling
x_plot = np.linspace(0, 5, 1000)
target_vals = [target_pdf(x) for x in x_plot]
proposal_vals = [M * proposal_pdf(x) for x in x_plot]

axes[0].fill_between(x_plot, 0, target_vals, alpha=0.3, label='Target')
axes[0].plot(x_plot, proposal_vals, 'r-', linewidth=2, label=f'M·Proposal')
axes[0].set_title('Rejection Sampling Setup')
axes[0].legend()
axes[0].set_xlabel('x')
axes[0].set_ylabel('Density')

# Rejection sampling histogram
axes[1].hist(samples_rejection, bins=50, density=True, alpha=0.7, label='Samples')
axes[1].plot(x_plot, target_vals, 'k-', linewidth=2, label='Target')
axes[1].set_title('Rejection Sampling Result')
axes[1].legend()
axes[1].set_xlabel('x')

# Importance sampling (weighted histogram)
axes[2].hist(samples_importance, bins=50, density=True, 
             weights=weights_normalized*len(samples_importance), 
             alpha=0.7, label='Weighted samples')
axes[2].plot(x_plot, target_vals, 'k-', linewidth=2, label='Target')
axes[2].set_title('Importance Sampling Result')
axes[2].legend()
axes[2].set_xlabel('x')

plt.tight_layout()
plt.show()

Comparison Table:

Aspect Rejection Sampling Importance Sampling
Output Exact samples from p(x) Weighted samples
Efficiency Wastes rejected samples Uses all samples
Requirement Need M bound Just need p(x)/q(x)
Best use Generate samples Estimate expectations
High dimensions Poor (low acceptance) Better

Interviewer's Insight

What they're testing: Sampling methods knowledge.

Strong answer signals:

  • Rejection: accept/reject mechanism
  • Importance: reweight samples
  • "Rejection gives exact samples"
  • "Importance better for high-D"
  • Mentions proposal distribution

Explain the Poisson Distribution and Its Applications - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Distributions, Poisson, Applications | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Poisson Distribution:

Models number of events in fixed interval:

\[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]

Where λ = expected number of events

Properties:

  • E[X] = λ
  • Var(X) = λ
  • Sum of Poisson: X ~ Pois(λ₁), Y ~ Pois(λ₂) → X+Y ~ Pois(λ₁+λ₂)

Implementation:

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

# Example: Website visitors per hour
lambda_rate = 5  # Average 5 visitors/hour

# PMF
k = np.arange(0, 15)
pmf = poisson.pmf(k, lambda_rate)

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.bar(k, pmf, edgecolor='black', alpha=0.7)
plt.xlabel('Number of events')
plt.ylabel('Probability')
plt.title(f'Poisson PMF (λ={lambda_rate})')
plt.grid(True, alpha=0.3)

# Compare different λ
plt.subplot(1, 3, 2)
for lam in [1, 3, 5, 10]:
    pmf = poisson.pmf(k, lam)
    plt.plot(k, pmf, 'o-', label=f'λ={lam}', markersize=6)
plt.xlabel('k')
plt.ylabel('P(X=k)')
plt.title('Different Rates')
plt.legend()
plt.grid(True, alpha=0.3)

# Simulation
plt.subplot(1, 3, 3)
samples = poisson.rvs(lambda_rate, size=1000)
plt.hist(samples, bins=range(0, 16), density=True, alpha=0.7, 
         edgecolor='black', label='Simulation')
plt.plot(k, pmf, 'ro-', linewidth=2, markersize=8, label='Theory')
plt.xlabel('Number of events')
plt.ylabel('Probability')
plt.title('Simulation vs Theory')
plt.legend()

plt.tight_layout()
plt.show()

Real Applications:

# Application 1: Server requests
def server_load_analysis():
    """Model requests per minute"""
    avg_requests = 50

    # P(more than 60 requests)
    prob_overload = 1 - poisson.cdf(60, avg_requests)
    print(f"P(overload) = {prob_overload:.3f}")

    # 95th percentile (capacity planning)
    capacity = poisson.ppf(0.95, avg_requests)
    print(f"95th percentile: {capacity:.0f} requests")

    return capacity

# Application 2: A/B test - rare events
def ab_test_poisson(control_rate, treatment_rate, n_days):
    """Test if treatment changes conversion rate"""
    # Control: λ₀ = control_rate per day
    # Treatment: λ₁ = treatment_rate per day

    # Simulate
    control_conversions = poisson.rvs(control_rate, size=n_days)
    treatment_conversions = poisson.rvs(treatment_rate, size=n_days)

    # Test difference
    from scipy.stats import ttest_ind
    t_stat, p_value = ttest_ind(control_conversions, treatment_conversions)

    print(f"\nA/B Test (Poisson events):")
    print(f"Control: mean={control_conversions.mean():.2f}, total={control_conversions.sum()}")
    print(f"Treatment: mean={treatment_conversions.mean():.2f}, total={treatment_conversions.sum()}")
    print(f"p-value: {p_value:.4f}")

    return p_value

# Application 3: Call center staffing
def call_center_staffing(avg_calls_per_hour, service_time_minutes):
    """Determine number of agents needed"""
    lambda_per_minute = avg_calls_per_hour / 60

    # Erlang C formula approximation
    # For simplicity, use rule of thumb:
    # Need enough capacity for 90th percentile

    calls_90th = poisson.ppf(0.90, lambda_per_minute)
    agents_needed = int(np.ceil(calls_90th * service_time_minutes))

    print(f"\nCall center staffing:")
    print(f"Average: {lambda_per_minute:.2f} calls/minute")
    print(f"90th percentile: {calls_90th:.0f} calls/minute")
    print(f"Agents needed: {agents_needed}")

    return agents_needed

# Run examples
server_load_analysis()
ab_test_poisson(control_rate=3, treatment_rate=3.5, n_days=30)
call_center_staffing(avg_calls_per_hour=120, service_time_minutes=5)

Poisson Approximation to Binomial:

# When n large, p small, np moderate: Binomial ≈ Poisson

n = 1000
p = 0.005
lambda_approx = n * p

k = np.arange(0, 15)

# Exact binomial
from scipy.stats import binom
pmf_binom = binom.pmf(k, n, p)

# Poisson approximation
pmf_poisson = poisson.pmf(k, lambda_approx)

plt.figure(figsize=(10, 6))
plt.bar(k - 0.2, pmf_binom, width=0.4, label='Binomial', alpha=0.7)
plt.bar(k + 0.2, pmf_poisson, width=0.4, label='Poisson approx', alpha=0.7)
plt.xlabel('k')
plt.ylabel('Probability')
plt.title(f'Binomial({n}, {p}) ≈ Poisson({lambda_approx})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Error
max_error = np.max(np.abs(pmf_binom - pmf_poisson))
print(f"\nMax approximation error: {max_error:.6f}")

Interviewer's Insight

What they're testing: Practical probability knowledge.

Strong answer signals:

  • Formula: λᵏe⁻λ/k!
  • "Count of rare events"
  • E[X] = Var(X) = λ
  • Examples: web traffic, defects, calls
  • Approximates binomial when n↑, p↓

What is the Law of Total Probability? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Probability Theory, Fundamentals, Partition | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Law of Total Probability:

If {B₁, B₂, ..., Bₙ} partition the sample space:

\[P(A) = \sum_{i=1}^n P(A|B_i) P(B_i)\]

Continuous version:

\[P(A) = \int P(A|B=b) P(B=b) db\]

Example - Medical Testing:

import numpy as np
import matplotlib.pyplot as plt

# Disease prevalence by age group
age_groups = ['Young', 'Middle', 'Senior']
age_probs = [0.30, 0.50, 0.20]  # P(Age group)

# Disease rate in each age group
disease_rate = {
    'Young': 0.01,
    'Middle': 0.05,
    'Senior': 0.15
}

# Law of total probability: P(Disease)
p_disease = sum(age_probs[i] * disease_rate[group] 
                for i, group in enumerate(age_groups))

print("Law of Total Probability:")
print(f"P(Disease) = ", end="")
for i, group in enumerate(age_groups):
    print(f"P(Disease|{group})·P({group})", end="")
    if i < len(age_groups) - 1:
        print(" + ", end="")
print(f"\n           = {p_disease:.4f}")

# Breakdown
print("\nContributions:")
for i, group in enumerate(age_groups):
    contrib = age_probs[i] * disease_rate[group]
    print(f"{group}: {age_probs[i]:.2f} × {disease_rate[group]:.2f} = {contrib:.4f}")

Example - Machine Learning:

# Classification error rate

# Classes and their proportions
classes = ['A', 'B', 'C']
class_probs = [0.5, 0.3, 0.2]

# Error rate for each class
error_rates = {
    'A': 0.10,
    'B': 0.15,
    'C': 0.20
}

# Total error rate (law of total probability)
total_error = sum(class_probs[i] * error_rates[cls] 
                 for i, cls in enumerate(classes))

print(f"\nML Example:")
print(f"Overall error rate: {total_error:.2%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Class distribution
axes[0].bar(classes, class_probs, edgecolor='black', alpha=0.7)
axes[0].set_title('Class Distribution P(Class)')
axes[0].set_ylabel('Probability')
axes[0].grid(True, alpha=0.3)

# Error contributions
contributions = [class_probs[i] * error_rates[cls] 
                for i, cls in enumerate(classes)]

axes[1].bar(classes, contributions, edgecolor='black', alpha=0.7, color='orange')
axes[1].axhline(y=total_error, color='red', linestyle='--', 
                linewidth=2, label=f'Total error: {total_error:.2%}')
axes[1].set_title('Error Contribution by Class')
axes[1].set_ylabel('Contribution to total error')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Example - Continuous:

# Mixture of normals

# Component probabilities
weights = [0.3, 0.7]

# Component parameters
means = [0, 5]
stds = [1, 2]

from scipy.stats import norm

# Total density at x
def total_density(x):
    return sum(weights[i] * norm.pdf(x, means[i], stds[i]) 
              for i in range(len(weights)))

# Plot
x = np.linspace(-5, 10, 1000)

plt.figure(figsize=(10, 6))

# Individual components
for i in range(len(weights)):
    y = weights[i] * norm.pdf(x, means[i], stds[i])
    plt.plot(x, y, '--', linewidth=2, label=f'Component {i+1}', alpha=0.7)

# Total density
y_total = [total_density(xi) for xi in x]
plt.plot(x, y_total, 'k-', linewidth=3, label='Total (Law of Total Prob)')

plt.xlabel('x')
plt.ylabel('Density')
plt.title('Continuous Law of Total Probability')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Connection to Bayes:

# Law of total probability gives denominator in Bayes' theorem

# P(B|A) = P(A|B)P(B) / P(A)
#
# Where P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ)  [Law of total probability]

# Example: Disease testing
p_pos_given_disease = 0.95  # Sensitivity
p_pos_given_no_disease = 0.05  # FPR
p_disease = 0.01  # Prevalence

# P(Positive) by law of total probability
p_positive = (p_pos_given_disease * p_disease + 
             p_pos_given_no_disease * (1 - p_disease))

# P(Disease | Positive) by Bayes
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive

print(f"\nBayes + Law of Total Prob:")
print(f"P(Positive) = {p_positive:.4f}")
print(f"P(Disease|Positive) = {p_disease_given_pos:.4f}")

Interviewer's Insight

What they're testing: Fundamental probability rules.

Strong answer signals:

  • "Partition sample space"
  • Formula: Σ P(A|Bᵢ)P(Bᵢ)
  • "Weighted average over conditions"
  • Connection to Bayes denominator
  • Example: disease by age group

Explain the Geometric Distribution - Google, Amazon Interview Question

Difficulty: 🟢 Easy | Tags: Distributions, Geometric, Waiting Time | Asked by: Google, Amazon, Meta

View Answer

Geometric Distribution:

Number of trials until first success in Bernoulli trials.

\[P(X=k) = (1-p)^{k-1} p\]

Properties:

  • E[X] = 1/p
  • Var(X) = (1-p)/p²
  • Memoryless: P(X > n+k | X > n) = P(X > k)

Implementation:

import numpy as np
from scipy.stats import geom
import matplotlib.pyplot as plt

# Example: Coin flips until heads
p = 0.3  # P(heads)

# PMF
k = np.arange(1, 21)
pmf = geom.pmf(k, p)

plt.figure(figsize=(12, 4))

# PMF
plt.subplot(1, 3, 1)
plt.bar(k, pmf, edgecolor='black', alpha=0.7)
plt.xlabel('Number of trials until success')
plt.ylabel('Probability')
plt.title(f'Geometric PMF (p={p})')
plt.grid(True, alpha=0.3)

# Different p values
plt.subplot(1, 3, 2)
for p_val in [0.1, 0.3, 0.5, 0.8]:
    pmf = geom.pmf(k, p_val)
    plt.plot(k, pmf, 'o-', label=f'p={p_val}', markersize=6)
plt.xlabel('k')
plt.ylabel('P(X=k)')
plt.title('Different Success Probabilities')
plt.legend()
plt.grid(True, alpha=0.3)

# CDF
plt.subplot(1, 3, 3)
cdf = geom.cdf(k, p)
plt.plot(k, cdf, 'bo-', linewidth=2, markersize=6)
plt.xlabel('k')
plt.ylabel('P(X ≤ k)')
plt.title('CDF')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Properties
print(f"Geometric(p={p}):")
print(f"E[X] = {geom.mean(p):.2f} (theory: {1/p:.2f})")
print(f"Var(X) = {geom.var(p):.2f} (theory: {(1-p)/p**2:.2f})")

Memoryless Property:

# "The coin doesn't remember past failures"

p = 0.3

# P(X > 5)
prob_more_than_5 = 1 - geom.cdf(5, p)

# P(X > 10 | X > 5) = P(X > 5)
# Conditional probability
prob_10_given_5 = (1 - geom.cdf(10, p)) / (1 - geom.cdf(5, p))

print(f"\nMemoryless property:")
print(f"P(X > 5) = {prob_more_than_5:.4f}")
print(f"P(X > 10 | X > 5) = {prob_10_given_5:.4f}")
print(f"Equal? {np.isclose(prob_more_than_5, prob_10_given_5)}")

# Simulation
n_sims = 100000

# All trials
trials = geom.rvs(p, size=n_sims)

# Conditional: given X > 5
conditional = trials[trials > 5] - 5  # Reset counter

# Check distributions match
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(trials, conditional)
print(f"\nKS test p-value: {p_value:.4f}")
print("(High p-value confirms memoryless property)")

Applications:

# Application 1: Customer acquisition
def customer_acquisition(conversion_rate=0.05):
    """Expected ads until conversion"""
    expected_ads = 1 / conversion_rate

    # P(convert within 50 ads)
    prob_within_50 = geom.cdf(50, conversion_rate)

    print(f"\nCustomer acquisition (p={conversion_rate}):")
    print(f"Expected ads to convert: {expected_ads:.0f}")
    print(f"P(convert within 50 ads): {prob_within_50:.2%}")

    # Budget planning: 95% confidence
    ads_95 = geom.ppf(0.95, conversion_rate)
    print(f"95th percentile: {ads_95:.0f} ads")

# Application 2: Reliability testing
def reliability_testing(failure_rate=0.01):
    """Device testing until first failure"""
    expected_tests = 1 / failure_rate

    # P(survive at least 100 tests)
    prob_survive_100 = 1 - geom.cdf(100, failure_rate)

    print(f"\nReliability testing (p={failure_rate}):")
    print(f"Expected tests until failure: {expected_tests:.0f}")
    print(f"P(survive ≥100 tests): {prob_survive_100:.2%}")

# Application 3: A/B test duration
def ab_test_duration(daily_conversion=0.10):
    """Days until first conversion"""
    expected_days = 1 / daily_conversion

    # Distribution of wait time
    days = np.arange(1, 31)
    probs = geom.pmf(days, daily_conversion)

    plt.figure(figsize=(10, 6))
    plt.bar(days, probs, edgecolor='black', alpha=0.7)
    plt.xlabel('Days until first conversion')
    plt.ylabel('Probability')
    plt.title(f'A/B Test First Conversion (p={daily_conversion})')
    plt.axvline(expected_days, color='red', linestyle='--', 
               linewidth=2, label=f'Expected: {expected_days:.1f} days')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

    print(f"\nA/B test duration:")
    print(f"Expected days: {expected_days:.1f}")

customer_acquisition()
reliability_testing()
ab_test_duration()

Interviewer's Insight

What they're testing: Discrete distributions knowledge.

Strong answer signals:

  • "Trials until first success"
  • E[X] = 1/p
  • Memoryless property
  • Examples: waiting time, retries
  • Relates to exponential (continuous)

What is Power Analysis in Hypothesis Testing? - Most Tech Companies Interview Question

Difficulty: 🟡 Medium | Tags: Power, Sample Size, Hypothesis Testing | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Power:

Probability of correctly rejecting H₀ when it's false.

\[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ true})\]

Four key quantities:

  1. Effect size (Δ)
  2. Sample size (n)
  3. Significance level (α)
  4. Power (1-β)

Given any 3, can solve for the 4th.

Implementation:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def power_analysis_proportion(p0, p1, alpha=0.05, power=0.80):
    """
    Calculate sample size for proportion test
    H₀: p = p0 vs H₁: p = p1
    """
    # Standard errors
    se0 = np.sqrt(p0 * (1 - p0))
    se1 = np.sqrt(p1 * (1 - p1))

    # Critical value for two-sided test
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    # Sample size formula
    n = ((z_alpha * se0 + z_beta * se1) / (p1 - p0))**2

    return int(np.ceil(n))

# Example: A/B test
p_control = 0.10  # Current conversion rate
p_treatment = 0.12  # Target improvement

n_needed = power_analysis_proportion(p_control, p_treatment)

print(f"Power Analysis:")
print(f"Control rate: {p_control:.1%}")
print(f"Treatment rate: {p_treatment:.1%}")
print(f"Effect size: {p_treatment - p_control:.1%}")
print(f"α = 0.05, Power = 0.80")
print(f"Sample size needed: {n_needed:,} per group")

Power Curve:

def compute_power(n, p0, p1, alpha=0.05):
    """Compute power for given sample size"""
    # Critical value
    z_alpha = stats.norm.ppf(1 - alpha/2)

    # Under H₁
    se0 = np.sqrt(p0 * (1 - p0) / n)
    se1 = np.sqrt(p1 * (1 - p1) / n)

    # Critical region boundaries
    critical_lower = p0 - z_alpha * se0
    critical_upper = p0 + z_alpha * se0

    # Power: P(fall in rejection region | H₁)
    z_lower = (critical_lower - p1) / se1
    z_upper = (critical_upper - p1) / se1

    power = stats.norm.cdf(z_lower) + (1 - stats.norm.cdf(z_upper))

    return power

# Plot power vs sample size
sample_sizes = np.arange(100, 5000, 50)
powers = [compute_power(n, p_control, p_treatment) for n in sample_sizes]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(sample_sizes, powers, 'b-', linewidth=2)
plt.axhline(0.80, color='red', linestyle='--', label='Target power=0.80')
plt.axvline(n_needed, color='green', linestyle='--', 
            label=f'n={n_needed}')
plt.xlabel('Sample size per group')
plt.ylabel('Power')
plt.title('Power vs Sample Size')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot power vs effect size
effect_sizes = np.linspace(0.005, 0.05, 50)
n_fixed = 1000
powers = [compute_power(n_fixed, p_control, p_control + delta) 
         for delta in effect_sizes]

plt.subplot(1, 2, 2)
plt.plot(effect_sizes * 100, powers, 'b-', linewidth=2)
plt.axhline(0.80, color='red', linestyle='--', label='Target power=0.80')
plt.xlabel('Effect size (%)')
plt.ylabel('Power')
plt.title(f'Power vs Effect Size (n={n_fixed})')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Simulation-based Power:

def simulate_power(n, p0, p1, alpha=0.05, n_sims=10000):
    """Estimate power via simulation"""
    rejections = 0

    for _ in range(n_sims):
        # Generate data under H₁
        data = np.random.binomial(1, p1, n)
        p_hat = data.mean()

        # Test H₀: p = p0
        se = np.sqrt(p0 * (1 - p0) / n)
        z = (p_hat - p0) / se

        # Two-sided test
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))

        if p_value < alpha:
            rejections += 1

    return rejections / n_sims

# Verify analytical power
n_test = 2000
power_analytical = compute_power(n_test, p_control, p_treatment)
power_simulated = simulate_power(n_test, p_control, p_treatment)

print(f"\nPower validation (n={n_test}):")
print(f"Analytical: {power_analytical:.3f}")
print(f"Simulated: {power_simulated:.3f}")

Trade-offs:

# Explore α vs power trade-off

alphas = [0.01, 0.05, 0.10]
n = 1500

print(f"\nα vs Power trade-off (n={n}):")
for alpha in alphas:
    power = compute_power(n, p_control, p_treatment, alpha=alpha)
    print(f"α = {alpha:.2f}: Power = {power:.3f}")

# More liberal α → higher power (but more Type I errors)

Interviewer's Insight

What they're testing: Experimental design knowledge.

Strong answer signals:

  • "Power = 1 - β"
  • "P(reject H₀ | H₁ true)"
  • Four quantities: α, power, n, effect
  • "Need before collecting data"
  • Trade-off: sample size vs power

Explain the Negative Binomial Distribution - Google, Amazon Interview Question

Difficulty: 🟡 Medium | Tags: Distributions, Negative Binomial, Waiting Time | Asked by: Google, Amazon, Meta

View Answer

Negative Binomial:

Number of trials until r successes (generalization of geometric).

\[P(X=k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}\]

Properties:

  • E[X] = r/p
  • Var(X) = r(1-p)/p²
  • Sum of r independent Geometric(p)

Implementation:

import numpy as np
from scipy.stats import nbinom
import matplotlib.pyplot as plt

# Example: Trials until 5 successes
r = 5  # Number of successes
p = 0.3  # Success probability

# PMF (scipy uses different parameterization: n=r, p=p)
k = np.arange(r, 50)
pmf = nbinom.pmf(k - r, r, p)  # k-r failures before r successes

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.bar(k, pmf, edgecolor='black', alpha=0.7)
plt.xlabel('Number of trials until r successes')
plt.ylabel('Probability')
plt.title(f'Negative Binomial (r={r}, p={p})')
plt.axvline(r/p, color='red', linestyle='--', linewidth=2, label=f'E[X]={r/p:.1f}')
plt.legend()
plt.grid(True, alpha=0.3)

# Different r values
plt.subplot(1, 2, 2)
for r_val in [1, 3, 5, 10]:
    k_plot = np.arange(r_val, 80)
    pmf = nbinom.pmf(k_plot - r_val, r_val, p)
    plt.plot(k_plot, pmf, 'o-', label=f'r={r_val}', markersize=4)
plt.xlabel('k (trials)')
plt.ylabel('P(X=k)')
plt.title(f'Different r (p={p})')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Negative Binomial(r={r}, p={p}):")
print(f"E[X] = {r/p:.2f}")
print(f"Var(X) = {r*(1-p)/p**2:.2f}")

Overdispersion Modeling:

# Negative binomial for count data with variance > mean

# Compare Poisson vs Negative Binomial

# Simulate overdispersed count data
np.random.seed(42)

# Negative binomial can model overdispersion
# Poisson has Var = Mean, NB has Var > Mean

from scipy.stats import poisson

# True: Negative Binomial
r_true = 5
p_true = 0.5
data = nbinom.rvs(r_true, p_true, size=1000)

print(f"\nData statistics:")
print(f"Mean: {np.mean(data):.2f}")
print(f"Variance: {np.var(data):.2f}")
print(f"Variance/Mean: {np.var(data)/np.mean(data):.2f}")

# Fit Poisson (will be poor fit)
lambda_mle = np.mean(data)

# Fit Negative Binomial
# (In practice, use MLE; here we use true params)

# Compare fits
count_range = np.arange(0, 30)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, density=True, alpha=0.5, label='Data', edgecolor='black')

# Poisson fit
poisson_pmf = poisson.pmf(count_range, lambda_mle)
plt.plot(count_range, poisson_pmf, 'ro-', linewidth=2, label='Poisson', markersize=6)

# Negative Binomial fit
nb_pmf = nbinom.pmf(count_range, r_true, p_true)
plt.plot(count_range, nb_pmf, 'bo-', linewidth=2, label='Negative Binomial', markersize=6)

plt.xlabel('Count')
plt.ylabel('Probability')
plt.title('Negative Binomial handles overdispersion')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Applications:

  • Customer retention: Trials until r customers acquired
  • Reliability: Tests until r failures
  • Overdispersed counts: When Poisson doesn't fit (Var > Mean)

Interviewer's Insight

What they're testing: Extended distributions knowledge.

Strong answer signals:

  • "Trials until r successes"
  • Generalizes geometric (r=1)
  • E[X] = r/p
  • "Models overdispersion"
  • Variance > mean (vs Poisson)

What is the Exponential Distribution? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Distributions, Exponential, Memoryless | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Exponential Distribution:

Continuous analog of geometric - time until event.

\[f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\]

Properties:

  • E[X] = 1/λ
  • Var(X) = 1/λ²
  • Memoryless: P(X > s+t | X > s) = P(X > t)
  • CDF: \(F(x) = 1 - e^{-\lambda x}\)

Implementation:

import numpy as np
from scipy.stats import expon
import matplotlib.pyplot as plt

# Example: Server response time
lambda_rate = 2  # Events per unit time (rate)
mean_time = 1 / lambda_rate  # Mean = 1/λ

# PDF and CDF
x = np.linspace(0, 5, 1000)
pdf = expon.pdf(x, scale=mean_time)  # scipy uses scale=1/λ
cdf = expon.cdf(x, scale=mean_time)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(x, pdf, 'b-', linewidth=2)
plt.fill_between(x, 0, pdf, alpha=0.3)
plt.xlabel('Time')
plt.ylabel('Density')
plt.title(f'Exponential PDF (λ={lambda_rate})')
plt.axvline(mean_time, color='red', linestyle='--', label=f'Mean={mean_time:.2f}')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(x, cdf, 'b-', linewidth=2)
plt.xlabel('Time')
plt.ylabel('P(X ≤ x)')
plt.title('CDF')
plt.axhline(0.5, color='gray', linestyle=':', alpha=0.5)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Exponential(λ={lambda_rate}):")
print(f"Mean: {mean_time:.3f}")
print(f"Std: {1/lambda_rate:.3f}")
print(f"Median: {np.log(2)/lambda_rate:.3f}")

Memoryless Property:

# Time already waited doesn't affect future waiting time

lambda_rate = 1

# P(X > 2)
prob_gt_2 = 1 - expon.cdf(2, scale=1/lambda_rate)

# P(X > 5 | X > 3) should equal P(X > 2)
prob_gt_5 = 1 - expon.cdf(5, scale=1/lambda_rate)
prob_gt_3 = 1 - expon.cdf(3, scale=1/lambda_rate)
prob_conditional = prob_gt_5 / prob_gt_3

print(f"\nMemoryless property:")
print(f"P(X > 2) = {prob_gt_2:.4f}")
print(f"P(X > 5 | X > 3) = {prob_conditional:.4f}")
print(f"Equal? {np.isclose(prob_gt_2, prob_conditional)}")

# Simulation
samples = expon.rvs(scale=1/lambda_rate, size=100000)

# Conditional samples
conditional = samples[samples > 3] - 3

print(f"\nSimulation:")
print(f"Mean of all samples: {samples.mean():.3f}")
print(f"Mean of conditional: {conditional.mean():.3f}")
# Both have same mean!

Applications:

# Application 1: System reliability
def reliability_analysis(failure_rate=0.001, mission_time=1000):
    """
    failure_rate: failures per hour
    mission_time: hours
    """
    # P(survive mission)
    reliability = 1 - expon.cdf(mission_time, scale=1/failure_rate)

    # Mean time to failure
    mttf = 1 / failure_rate

    print(f"\nReliability Analysis:")
    print(f"Failure rate: {failure_rate} per hour")
    print(f"MTTF: {mttf:.0f} hours")
    print(f"P(survive {mission_time}h): {reliability:.2%}")

    # Time for 90% survival
    t_90 = expon.ppf(0.10, scale=1/failure_rate)  # F(t) = 0.10
    print(f"90% survive up to: {t_90:.0f} hours")

# Application 2: Queue waiting time
def queue_analysis(arrival_rate=10):
    """
    arrival_rate: customers per minute
    """
    # Time between arrivals ~ Exp(arrival_rate)
    mean_interarrival = 1 / arrival_rate

    # P(wait < 0.1 minutes)
    prob_short_wait = expon.cdf(0.1, scale=mean_interarrival)

    print(f"\nQueue Analysis:")
    print(f"Arrival rate: {arrival_rate} per minute")
    print(f"Mean interarrival: {mean_interarrival*60:.1f} seconds")
    print(f"P(next arrival < 6 seconds): {prob_short_wait:.2%}")

# Application 3: Poisson process connection
def poisson_connection():
    """Exponential interarrival → Poisson count"""
    lambda_rate = 5  # Rate
    T = 10  # Time interval

    # Simulate Poisson process
    interarrivals = expon.rvs(scale=1/lambda_rate, size=1000)

    # Count events in [0, T]
    events_in_T = []
    for _ in range(10000):
        times = np.cumsum(expon.rvs(scale=1/lambda_rate, size=100))
        count = np.sum(times <= T)
        events_in_T.append(count)

    # Should be Poisson(λT)
    expected_count = lambda_rate * T

    print(f"\nPoisson Process:")
    print(f"Rate: {lambda_rate} per unit time")
    print(f"Interval: {T}")
    print(f"Expected count: {expected_count}")
    print(f"Simulated mean: {np.mean(events_in_T):.2f}")

reliability_analysis()
queue_analysis()
poisson_connection()

Interviewer's Insight

What they're testing: Continuous distributions.

Strong answer signals:

  • Formula: λe^(-λx)
  • "Time until event"
  • Memoryless property
  • E[X] = 1/λ
  • Connection to Poisson process
  • Examples: failures, queues

Explain Variance Reduction Techniques in Monte Carlo - Google, Meta Interview Question

Difficulty: 🔴 Hard | Tags: Monte Carlo, Variance Reduction, Simulation | Asked by: Google, Meta, Amazon

View Answer

Variance Reduction:

Techniques to reduce variance of Monte Carlo estimates while using same number of samples.

1. Antithetic Variables:

import numpy as np
import matplotlib.pyplot as plt

# Estimate E[f(X)] where X ~ Uniform(0,1)
def f(x):
    return x**2

# Standard Monte Carlo
def standard_mc(n):
    samples = np.random.uniform(0, 1, n)
    return np.mean(f(samples))

# Antithetic variables
def antithetic_mc(n):
    # Use U and 1-U (negatively correlated)
    u = np.random.uniform(0, 1, n//2)
    samples1 = f(u)
    samples2 = f(1 - u)  # Antithetic
    return np.mean(np.concatenate([samples1, samples2]))

# Compare variances
n_trials = 1000
n_samples = 100

estimates_standard = [standard_mc(n_samples) for _ in range(n_trials)]
estimates_antithetic = [antithetic_mc(n_samples) for _ in range(n_trials)]

true_value = 1/3  # ∫₀¹ x² dx

print("Antithetic Variables:")
print(f"True value: {true_value:.4f}")
print(f"Standard MC variance: {np.var(estimates_standard):.6f}")
print(f"Antithetic variance: {np.var(estimates_antithetic):.6f}")
print(f"Variance reduction: {np.var(estimates_standard)/np.var(estimates_antithetic):.2f}x")

2. Control Variates:

# Use correlation with known expectation

def control_variate_mc(n):
    """
    Estimate E[e^X] where X ~ N(0,1)
    Use Y = X as control (E[Y] = 0 is known)
    """
    X = np.random.randn(n)
    Y = X  # Control variate

    # Target
    f_X = np.exp(X)

    # Optimal coefficient
    cov_fY = np.cov(f_X, Y)[0, 1]
    var_Y = np.var(Y)
    c = cov_fY / var_Y

    # Controlled estimator
    estimate = np.mean(f_X - c * (Y - 0))  # E[Y] = 0

    return estimate

# Compare
estimates_naive = []
estimates_cv = []

for _ in range(1000):
    X = np.random.randn(100)
    estimates_naive.append(np.mean(np.exp(X)))
    estimates_cv.append(control_variate_mc(100))

true_value = np.exp(0.5)  # E[e^X] for X~N(0,1)

print("\nControl Variates:")
print(f"True value: {true_value:.4f}")
print(f"Naive variance: {np.var(estimates_naive):.6f}")
print(f"Control variate variance: {np.var(estimates_cv):.6f}")
print(f"Variance reduction: {np.var(estimates_naive)/np.var(estimates_cv):.2f}x")

3. Stratified Sampling:

# Divide domain into strata, sample proportionally

def stratified_sampling(f, n_samples, n_strata=4):
    """Estimate ∫₀¹ f(x) dx using stratification"""
    estimates = []

    # Divide [0,1] into strata
    strata_size = 1 / n_strata
    samples_per_stratum = n_samples // n_strata

    for i in range(n_strata):
        # Sample uniformly within stratum
        lower = i * strata_size
        upper = (i + 1) * strata_size

        samples = np.random.uniform(lower, upper, samples_per_stratum)
        stratum_estimate = np.mean(f(samples)) * strata_size
        estimates.append(stratum_estimate)

    return np.sum(estimates)

# Compare
f = lambda x: np.sin(np.pi * x)  # True integral = 2/π

estimates_standard = [standard_mc(100) for _ in range(1000)]
estimates_stratified = [stratified_sampling(f, 100) for _ in range(1000)]

true_value = 2 / np.pi

print("\nStratified Sampling:")
print(f"True value: {true_value:.4f}")
print(f"Standard variance: {np.var(estimates_standard):.6f}")
print(f"Stratified variance: {np.var(estimates_stratified):.6f}")
print(f"Variance reduction: {np.var(estimates_standard)/np.var(estimates_stratified):.2f}x")

4. Importance Sampling:

# Sample from different distribution, reweight

def importance_sampling_mc(n):
    """
    Estimate E[X²] for X ~ Exp(1) using importance sampling
    Proposal: Exp(2)
    """
    # Sample from proposal (faster decay)
    from scipy.stats import expon

    samples = expon.rvs(scale=0.5, size=n)  # Exp(2)

    # Target: Exp(1)
    # Weights: p(x) / q(x)
    weights = expon.pdf(samples, scale=1) / expon.pdf(samples, scale=0.5)

    # Weighted average
    estimate = np.mean(samples**2 * weights)

    return estimate

estimates_naive = []
estimates_is = []

from scipy.stats import expon

for _ in range(1000):
    # Naive
    samples_naive = expon.rvs(scale=1, size=100)
    estimates_naive.append(np.mean(samples_naive**2))

    # Importance sampling
    estimates_is.append(importance_sampling_mc(100))

true_value = 2  # E[X²] for Exp(1)

print("\nImportance Sampling:")
print(f"True value: {true_value:.4f}")
print(f"Naive variance: {np.var(estimates_naive):.6f}")
print(f"Importance sampling variance: {np.var(estimates_is):.6f}")
print(f"Variance reduction: {np.var(estimates_naive)/np.var(estimates_is):.2f}x")

Summary:

Technique Idea Best for
Antithetic Use negatively correlated samples Smooth functions
Control variates Use correlation with known E[Y] Related known quantity
Stratified Sample proportionally from strata Non-uniform importance
Importance Sample from better distribution Rare events

Interviewer's Insight

What they're testing: Advanced Monte Carlo methods.

Strong answer signals:

  • Lists multiple techniques
  • Antithetic: 1-U
  • Control: use E[Y] known
  • Stratified: divide domain
  • Importance: reweight samples
  • "Reduce variance, same # samples"

What is the Chi-Square Distribution? - Most Tech Companies Interview Question

Difficulty: 🟢 Easy | Tags: Distributions, Chi-Square, Hypothesis Testing | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Chi-Square Distribution:

Sum of k independent squared standard normals.

\[X = Z_1^2 + Z_2^2 + ... + Z_k^2 \sim \chi^2_k\]

Properties:

  • E[X] = k (degrees of freedom)
  • Var(X) = 2k
  • Non-negative, right-skewed
  • As k→∞, approaches normal

Implementation:

import numpy as np
from scipy.stats import chi2, norm
import matplotlib.pyplot as plt

# PDF for different df
x = np.linspace(0, 30, 1000)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for df in [1, 2, 3, 5, 10]:
    pdf = chi2.pdf(x, df)
    plt.plot(x, pdf, linewidth=2, label=f'df={df}')

plt.xlabel('x')
plt.ylabel('Density')
plt.title('Chi-Square PDF')
plt.legend()
plt.grid(True, alpha=0.3)

# Verify: sum of squared normals
plt.subplot(1, 2, 2)

df = 5
n_samples = 10000

# Method 1: Direct chi-square
samples_direct = chi2.rvs(df, size=n_samples)

# Method 2: Sum of squared normals
samples_constructed = np.sum(np.random.randn(n_samples, df)**2, axis=1)

plt.hist(samples_direct, bins=50, density=True, alpha=0.5, label='chi2.rvs()', edgecolor='black')
plt.hist(samples_constructed, bins=50, density=True, alpha=0.5, label='Sum of Z²', edgecolor='black')

# Theoretical
plt.plot(x, chi2.pdf(x, df), 'k-', linewidth=2, label='Theory')

plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Chi-Square (df={df})')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Chi-Square(df={df}):")
print(f"E[X] = {df}")
print(f"Sample mean: {samples_direct.mean():.2f}")
print(f"Var(X) = {2*df}")
print(f"Sample variance: {samples_direct.var():.2f}")

Goodness-of-Fit Test:

# Test if data follows hypothesized distribution

# Example: Die fairness
observed = np.array([45, 52, 48, 55, 50, 50])  # Rolls
expected = np.array([50, 50, 50, 50, 50, 50])  # Fair die

# Chi-square test statistic
chi2_stat = np.sum((observed - expected)**2 / expected)

# p-value
df = len(observed) - 1  # k - 1
p_value = 1 - chi2.cdf(chi2_stat, df)

print(f"\nGoodness-of-Fit Test:")
print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"χ² statistic: {chi2_stat:.3f}")
print(f"df: {df}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Fair' if p_value > 0.05 else 'Biased'} die")

# Visualize
from scipy.stats import chisquare
stat, p = chisquare(observed, expected)

x_plot = np.linspace(0, 20, 1000)
pdf = chi2.pdf(x_plot, df)

plt.figure(figsize=(10, 6))
plt.plot(x_plot, pdf, 'b-', linewidth=2, label=f'χ²({df})')
plt.axvline(chi2_stat, color='red', linestyle='--', linewidth=2, label=f'Statistic={chi2_stat:.2f}')

# Critical region
critical = chi2.ppf(0.95, df)
plt.axvline(critical, color='orange', linestyle='--', linewidth=2, label=f'Critical={critical:.2f}')
plt.fill_between(x_plot[x_plot >= critical], 0, chi2.pdf(x_plot[x_plot >= critical], df), 
                 alpha=0.3, color='orange', label='Rejection region (α=0.05)')

plt.xlabel('χ²')
plt.ylabel('Density')
plt.title('Chi-Square Goodness-of-Fit Test')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Independence Test:

# Test independence in contingency table

from scipy.stats import chi2_contingency

# Example: Gender vs Product preference
observed = np.array([
    [30, 20, 10],  # Male
    [20, 30, 20]   # Female
])

chi2_stat, p_value, dof, expected = chi2_contingency(observed)

print(f"\nIndependence Test:")
print("Observed:")
print(observed)
print("\nExpected (if independent):")
print(expected.round(2))
print(f"\nχ² statistic: {chi2_stat:.3f}")
print(f"df: {dof}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Independent' if p_value > 0.05 else 'Dependent'}")

Applications:

  • Goodness-of-fit tests
  • Independence tests
  • Variance estimation
  • Confidence intervals for variance

Interviewer's Insight

What they're testing: Statistical testing knowledge.

Strong answer signals:

  • "Sum of squared normals"
  • E[X] = df, Var = 2·df
  • Goodness-of-fit test
  • Independence test
  • Non-negative, right-skewed

Explain the Multiple Comparisons Problem - Most Tech Companies Interview Question

Difficulty: 🟡 Medium | Tags: Multiple Testing, FWER, FDR | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Multiple Comparisons Problem:

When performing many hypothesis tests, probability of at least one false positive increases dramatically.

\[P(\text{at least one false positive}) = 1 - (1-\alpha)^m\]

For m tests at α=0.05: - m=1: 5% - m=10: 40% - m=20: 64% - m=100: 99.4%

Demonstration:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulate multiple testing
def simulate_multiple_tests(n_tests, alpha=0.05, n_sims=10000):
    """All null hypotheses are true"""
    false_positives = []

    for _ in range(n_sims):
        # Generate data under null
        p_values = np.random.uniform(0, 1, n_tests)

        # Count false positives
        fp = np.sum(p_values < alpha)
        false_positives.append(fp)

    return np.array(false_positives)

# Test different numbers of comparisons
test_counts = [1, 5, 10, 20, 50, 100]
fwer_empirical = []
fwer_theoretical = []

alpha = 0.05

for m in test_counts:
    fp = simulate_multiple_tests(m, alpha)
    fwer_empirical.append(np.mean(fp > 0))  # At least one FP
    fwer_theoretical.append(1 - (1 - alpha)**m)

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(test_counts, fwer_empirical, 'bo-', linewidth=2, markersize=8, label='Simulated')
plt.plot(test_counts, fwer_theoretical, 'r--', linewidth=2, label='Theoretical')
plt.axhline(alpha, color='green', linestyle=':', linewidth=2, label=f'Target α={alpha}')
plt.xlabel('Number of tests')
plt.ylabel('P(at least one false positive)')
plt.title('Family-Wise Error Rate')
plt.legend()
plt.grid(True, alpha=0.3)

# Distribution of false positives
plt.subplot(1, 2, 2)
m = 20
fp = simulate_multiple_tests(m, alpha)
plt.hist(fp, bins=range(0, m+1), density=True, edgecolor='black', alpha=0.7)
plt.xlabel('Number of false positives')
plt.ylabel('Probability')
plt.title(f'False Positives (m={m} tests, all null true)')
plt.axvline(fp.mean(), color='red', linestyle='--', linewidth=2, 
            label=f'Mean={fp.mean():.2f}')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Multiple Comparisons Problem:")
print(f"α = {alpha}")
for m, fwer in zip(test_counts, fwer_theoretical):
    print(f"m={m:3d} tests: FWER = {fwer:.1%}")

Correction Methods:

# Generate test scenario
np.random.seed(42)
m = 20

# 80% true nulls, 20% false nulls
n_true_null = int(0.8 * m)
n_false_null = m - n_true_null

# p-values
p_true_null = np.random.uniform(0, 1, n_true_null)
p_false_null = np.random.beta(0.5, 10, n_false_null)  # Small p-values

p_values = np.concatenate([p_true_null, p_false_null])
truth = np.array([False]*n_true_null + [True]*n_false_null)

alpha = 0.05

# Method 1: No correction
reject_none = p_values < alpha

# Method 2: Bonferroni
reject_bonf = p_values < alpha / m

# Method 3: Holm-Bonferroni
sorted_idx = np.argsort(p_values)
reject_holm = np.zeros(m, dtype=bool)

for i, idx in enumerate(sorted_idx):
    if p_values[idx] < alpha / (m - i):
        reject_holm[idx] = True
    else:
        break

# Method 4: Benjamini-Hochberg (FDR)
sorted_p = p_values[sorted_idx]
thresholds = np.arange(1, m+1) / m * alpha
comparisons = sorted_p <= thresholds

reject_bh = np.zeros(m, dtype=bool)
if np.any(comparisons):
    k = np.max(np.where(comparisons)[0])
    reject_bh[sorted_idx[:k+1]] = True

# Evaluate
methods = {
    'No correction': reject_none,
    'Bonferroni': reject_bonf,
    'Holm': reject_holm,
    'Benjamini-Hochberg': reject_bh
}

print("\nCorrection Method Comparison:")
print(f"True situation: {n_true_null} nulls true, {n_false_null} nulls false\n")

for name, reject in methods.items():
    tp = np.sum(reject & truth)  # True positives
    fp = np.sum(reject & ~truth)  # False positives
    fn = np.sum(~reject & truth)  # False negatives

    power = tp / n_false_null if n_false_null > 0 else 0
    fdr = fp / reject.sum() if reject.sum() > 0 else 0

    print(f"{name}:")
    print(f"  Rejections: {reject.sum()}")
    print(f"  True positives: {tp}")
    print(f"  False positives: {fp}")
    print(f"  Power: {power:.2%}")
    print(f"  FDR: {fdr:.2%}")
    print()

When to Use Each:

Method Controls Use When
Bonferroni FWER Few tests, need strict control
Holm FWER Uniformly better than Bonferroni
BH FDR Many tests, exploratory
No correction Nothing Single planned test only!

Interviewer's Insight

What they're testing: Multiple testing awareness.

Strong answer signals:

  • "More tests → more false positives"
  • Formula: 1-(1-α)^m
  • Bonferroni: α/m
  • BH for FDR control
  • "Critical in A/B testing, genomics"

Quick Reference: 100+ Interview Questions

Sno Question Title Practice Links Companies Asking Difficulty Topics
1 Basic Probability Concepts: Definitions of Sample Space, Event, Outcome Wikipedia: Probability Google, Amazon, Microsoft Easy Fundamental Concepts
2 Conditional Probability and Independence Khan Academy: Conditional Probability Google, Facebook, Amazon Medium Conditional Probability, Independence
3 Bayes’ Theorem: Statement and Application Wikipedia: Bayes' Theorem Google, Amazon, Microsoft Medium Bayesian Inference
4 Law of Total Probability Wikipedia: Law of Total Probability Google, Facebook Medium Theoretical Probability
5 Expected Value and Variance Khan Academy: Expected Value Google, Amazon, Facebook Medium Random Variables, Moments
6 Probability Distributions: Discrete vs. Continuous Wikipedia: Probability Distribution Google, Amazon, Microsoft Easy Distributions
7 Binomial Distribution: Definition and Applications Khan Academy: Binomial Distribution Amazon, Facebook Medium Discrete Distributions
8 Poisson Distribution: Characteristics and Uses Wikipedia: Poisson Distribution Google, Amazon Medium Discrete Distributions
9 Exponential Distribution: Properties and Applications Wikipedia: Exponential Distribution Google, Amazon Medium Continuous Distributions
10 Normal Distribution and the Central Limit Theorem Khan Academy: Normal Distribution Google, Microsoft, Facebook Medium Continuous Distributions, CLT
11 Law of Large Numbers Wikipedia: Law of Large Numbers Google, Amazon Medium Statistical Convergence
12 Covariance and Correlation: Definitions and Differences Khan Academy: Covariance and Correlation Google, Facebook Medium Statistics, Dependency
13 Moment Generating Functions (MGFs) Wikipedia: Moment-generating function Amazon, Microsoft Hard Random Variables, Advanced Concepts
14 Markov Chains: Basics and Applications Wikipedia: Markov chain Google, Amazon, Facebook Hard Stochastic Processes
15 Introduction to Stochastic Processes Wikipedia: Stochastic process Google, Microsoft Hard Advanced Probability
16 Difference Between Independent and Mutually Exclusive Events Wikipedia: Independent events Google, Facebook Easy Fundamental Concepts
17 Geometric Distribution: Concept and Use Cases Wikipedia: Geometric distribution Amazon, Microsoft Medium Discrete Distributions
18 Hypergeometric Distribution: When to Use It Wikipedia: Hypergeometric distribution Google, Amazon Medium Discrete Distributions
19 Confidence Intervals: Definition and Calculation Khan Academy: Confidence intervals Microsoft, Facebook Medium Inferential Statistics
20 Hypothesis Testing: p-values, Type I and Type II Errors Khan Academy: Hypothesis testing Google, Amazon, Facebook Medium Inferential Statistics
21 Chi-Squared Test: Basics and Applications Wikipedia: Chi-squared test Amazon, Microsoft Medium Inferential Statistics
22 Permutations and Combinations Khan Academy: Permutations and Combinations Google, Facebook Easy Combinatorics
23 The Birthday Problem and Its Implications Wikipedia: Birthday problem Google, Amazon Medium Probability Puzzles
24 The Monty Hall Problem Wikipedia: Monty Hall problem Google, Facebook Medium Probability Puzzles, Conditional Probability
25 Marginal vs. Conditional Probabilities Khan Academy: Conditional Probability Google, Amazon Medium Theoretical Concepts
26 Real-World Application of Bayes’ Theorem Towards Data Science: Bayes’ Theorem Applications Google, Amazon Medium Bayesian Inference
27 Probability Mass Function (PMF) vs. Probability Density Function (PDF) Wikipedia: Probability density function Amazon, Facebook Medium Distributions
28 Cumulative Distribution Function (CDF): Definition and Uses Wikipedia: Cumulative distribution function Google, Microsoft Medium Distributions
29 Determining Independence of Events Khan Academy: Independent Events Google, Amazon Easy Fundamental Concepts
30 Entropy in Information Theory Wikipedia: Entropy (information theory) Google, Facebook Hard Information Theory, Probability
31 Joint Probability Distributions Khan Academy: Joint Probability Microsoft, Amazon Medium Multivariate Distributions
32 Conditional Expectation Wikipedia: Conditional expectation Google, Facebook Hard Advanced Concepts
33 Sampling Methods: With and Without Replacement Khan Academy: Sampling Amazon, Microsoft Easy Sampling, Combinatorics
34 Risk Modeling Using Probability Investopedia: Risk Analysis Google, Amazon Medium Applications, Finance
35 In-Depth: Central Limit Theorem and Its Importance Khan Academy: Central Limit Theorem Google, Microsoft Medium Theoretical Concepts, Distributions
36 Variance under Linear Transformations Wikipedia: Variance Amazon, Facebook Hard Advanced Statistics
37 Quantiles: Definition and Interpretation Khan Academy: Percentiles Google, Amazon Medium Descriptive Statistics
38 Common Probability Puzzles and Brain Teasers Brilliant.org: Probability Puzzles Google, Facebook Medium Puzzles, Recreational Mathematics
39 Real-World Applications of Probability in Data Science Towards Data Science (Search for probability applications in DS) Google, Amazon, Facebook Medium Applications, Data Science
40 Advanced Topic: Introduction to Stochastic Calculus Wikipedia: Stochastic calculus Microsoft, Amazon Hard Advanced Probability, Finance

Questions asked in Google interview

  • Bayes’ Theorem: Statement and Application
  • Conditional Probability and Independence
  • The Birthday Problem
  • The Monty Hall Problem
  • Normal Distribution and the Central Limit Theorem
  • Law of Large Numbers

Questions asked in Facebook interview

  • Conditional Probability and Independence
  • Bayes’ Theorem
  • Chi-Squared Test
  • The Monty Hall Problem
  • Entropy in Information Theory

Questions asked in Amazon interview

  • Basic Probability Concepts
  • Bayes’ Theorem
  • Expected Value and Variance
  • Binomial and Poisson Distributions
  • Permutations and Combinations
  • Real-World Applications of Bayes’ Theorem

Questions asked in Microsoft interview

  • Bayes’ Theorem
  • Markov Chains
  • Stochastic Processes
  • Central Limit Theorem
  • Variance under Linear Transformations