Skip to content

A/B Testing Interview Questions

This document provides a curated list of A/B Testing and Experimentation interview questions. It covers statistical foundations, experimental design, metric selection, and advanced topics like interference (network effects) and sequential testing. Critical for roles at data-driven companies like Netflix, Airbnb, and Uber.


Sno Question Title Practice Links Companies Asking Difficulty Topics
1 What is A/B Testing? Optimizely Most Tech Companies Easy Basics
2 Explain Null Hypothesis (\(H_0\)) vs Alternative Hypothesis (\(H_1\)) Khan Academy Most Tech Companies Easy Statistics
3 What is a p-value? Explain it to a non-technical person. Harvard Business Review Google, Meta, Amazon Medium Statistics, Communication
4 What is Statistical Power? Machine Learning Plus Google, Netflix, Uber Medium Statistics
5 What is Type I error (False Positive) vs Type II error (False Negative)? Towards Data Science Most Tech Companies Easy Statistics
6 How do you calculate sample size for an experiment? Evan Miller Google, Amazon, Meta Medium Experimental Design
7 What is Minimum Detectable Effect (MDE)? StatsEngine Netflix, Airbnb Medium Experimental Design
8 Explain Confidence Intervals. Coursera Most Tech Companies Easy Statistics
9 Difference between One-tailed and Two-tailed tests. Investopedia Google, Amazon Easy Statistics
10 What is the Central Limit Theorem? Why is it important? Khan Academy Google, HFT Firms Medium Statistics
11 How long should you run an A/B test? CXL Airbnb, Booking.com Medium Experimental Design
12 Can you stop an experiment as soon as it reaches significance? (Peeking) Evan Miller Netflix, Uber, Airbnb Hard Pitfalls
13 What is SRM (Sample Ratio Mismatch)? How to debug? Microsoft Research Microsoft, LinkedIn Hard Debugging
14 What is Randomization Unit vs Analysis Unit? Udacity Uber, DoorDash Medium Experimental Design
15 How to handle outliers in A/B testing metrics? Towards Data Science Google, Meta Medium Data Cleaning
16 Mean vs Median: Which metric to use? Stack Overflow Most Tech Companies Easy Metrics
17 What are Guardrail Metrics? Airbnb Tech Blog Airbnb, Netflix Medium Metrics
18 What is a North Star Metric? Amplitude Product Roles Easy Metrics
19 Difference between Z-test and T-test. Statistics By Jim Google, Amazon Medium Statistics
20 How to test multiple variants? (A/B/n testing) VWO Booking.com, Expedia Medium Experimental Design
21 What is the Bonferroni Correction? Wikipedia Google, Meta Hard Statistics
22 What is A/A Testing? Why do it? Optimizely Microsoft, LinkedIn Medium Validity
23 Explain Covariate Adjustment (CUPED). Booking.com Data Booking.com, Microsoft, Meta Hard Optimization
24 How to measure retention in A/B tests? Reforge Netflix, Spotify Medium Metrics
25 What is a Novelty Effect? CXL Facebook, Instagram Medium Pitfalls
26 What is a Primacy Effect? CXL Facebook, Instagram Medium Pitfalls
27 How to handle interference (Network Effects)? Uber Eng Blog Uber, Lyft, DoorDash Hard Network Effects
28 What is a Switchback (Time-split) Experiment? DoorDash Eng DoorDash, Uber Hard Experimental Design
29 What is Cluster Randomization? Wikipedia Facebook, LinkedIn Hard Experimental Design
30 How to test on a 2-sided marketplace? Lyft Eng Uber, Lyft, Airbnb Hard Marketplace
31 Explain Bayesian A/B Testing vs Frequentist. VWO Stitch Fix, Netflix Hard Statistics
32 What is a Multi-Armed Bandit (MAB)? Towards Data Science Netflix, Amazon Hard Bandits
33 Thompson Sampling vs Epsilon-Greedy. GeeksforGeeks Netflix, Amazon Hard Bandits
34 How to deal with low traffic experiments? CXL Startups Medium Strategy
35 How to select metrics for a new feature? Product School Meta, Google Medium Metrics
36 What is Simpson's Paradox? Britannica Google, Amazon Medium Paradoxes
37 How to analyze ratio metrics (e.g., CTR)? Deltamethod Google, Meta Hard Statistics, Delta Method
38 What is Bootstrapping? When to use it? Investopedia Amazon, Netflix Medium Statistics
39 How to detect and handle Seasonality? Towards Data Science Retail/E-comm Medium Time Series
40 What is Change Aversion? Google UX Google, YouTube Medium UX
41 How to design an experiment for a search algorithm? Airbnb Eng Google, Airbnb, Amazon Hard Search, Ranking
42 How to test pricing changes? PriceIntelligently Uber, Airbnb Hard Pricing, Strategy
43 What is interference between experiments? Microsoft Exp Google, Meta, Microsoft Hard Platform
44 Explain Sequential Testing. Evan Miller Optimizely, Netflix Hard Statistics
45 What is Variance Reduction? Meta Research Meta, Microsoft, Booking Hard Optimization
46 How to handle attribution (First-touch vs Last-touch)? Google Analytics Marketing Tech Medium Marketing
47 How to validate if randomization worked? Stats StackExchange Most Tech Companies Easy Validity
48 What is stratification? Wikipedia Most Tech Companies Medium Sampling
49 When should you NOT A/B test? Reforge Product Roles Medium Strategy
50 How to estimate long-term impact from short-term tests? Netflix TechBlog Netflix, Meta Hard Strategy, Proxy Metrics
51 What is Binomial Distribution? Khan Academy Most Tech Companies Easy Statistics
52 What is Poisson Distribution? Khan Academy Uber, Lyft (Rides) Medium Statistics
53 Difference between Correlation and Causation. Khan Academy Most Tech Companies Easy Basics
54 What is a Confounding Variable? Scribbr Most Tech Companies Easy Causal Inference
55 Explain Regression Discontinuity Design (RDD). Wikipedia Economics/Policy Roles Hard Causal Inference
56 Explain Difference-in-Differences (DiD). Wikipedia Uber, Airbnb Hard Causal Inference
57 What is Propensity Score Matching? Wikipedia Meta, Netflix Hard Causal Inference
58 How to Handle Heterogeneous Treatment Effects? CausalML Uber, Meta Hard Causal ML
59 What is Interference in social networks? Meta Research Meta, LinkedIn, Snap Hard Network Effects
60 Explain the concept of "Holdout Groups". Airbnb Eng Amazon, Airbnb Medium Strategy
61 How to test infrastructure changes? (Canary Deployment) Google SRE Google, Netflix Medium DevOps/SRE
62 What is Client-side vs Server-side testing? Optimizely Full Stack Roles Medium Implementation
63 How to deal with flickering? VWO Frontend Roles Medium Implementation
64 What is a Trigger selection in A/B testing? Microsoft Exp Microsoft, Airbnb Hard Experimental Design
65 How to analyze user funnel drop-offs? Mixpanel Product Analysts Medium Analytics
66 What is Geometric Distribution? Wikipedia Most Tech Companies Medium Statistics
67 Explain Inverse Propensity Weighting (IPW). Wikipedia Causal Inference Roles Hard Causal Inference
68 How to calculate Standard Error of Mean (SEM)? Investopedia Most Tech Companies Easy Statistics
69 What is Statistical Significance vs Practical Significance? Towards Data Science Google, Meta Medium Strategy
70 How to handle cookies and tracking prevention (ITP)? WebKit AdTech, Marketing Hard Privacy
71 [HARD] Explain the Delta Method for ratio metrics. Deltamethod Google, Meta, Uber Hard Statistics
72 [HARD] How does Switchback testing solve interference? DoorDash Eng DoorDash, Uber Hard Experimental Design
73 [HARD] Derive the sample size formula. Stats Exchange Google, HFT Firms Hard Math
74 [HARD] How to implement CUPED in Python/SQL? Booking.com Booking, Microsoft Hard Optimization
75 [HARD] Explain Sequential Probability Ratio Test (SPRT). Wikipedia Optimizely, Netflix Hard Statistics
76 [HARD] How to estimate Network Effects (Cluster-Based)? MIT Paper Meta, LinkedIn Hard Network Effects
77 [HARD] Design an experiment for a 3-sided marketplace. Uber Eng Uber, DoorDash Hard Marketplace
78 [HARD] How to correct for multiple comparisons (FDR vs FWER)? Wikipedia Pharma, BioTech, Tech Hard Statistics
79 [HARD] Explain Instrumental Variables (IV). Wikipedia Economics, Uber Hard Causal Inference
80 [HARD] How to build an Experimentation Platform? Microsoft Exp Microsoft, Netflix, Airbnb Hard System Design
81 [HARD] How to handle user identity resolution across devices? Segment Meta, Google Hard Data Engineering
82 [HARD] What is "Carryover Effect" in Switchback tests? DoorDash Eng DoorDash, Uber Hard Pitfalls
83 [HARD] Explain "Washout Period". Clinical Trials DoorDash, Uber Hard Experimental Design
84 [HARD] How to test Ranking algorithms (Interleaving)? Netflix TechBlog Netflix, Google, Airbnb Hard Search/Ranking
85 [HARD] Explain Always-Valid Inference. Optimizely Optimizely, Netflix Hard Statistics
86 [HARD] How to measure cannibalization? Harvard Business Review Retail, E-comm Hard Strategy
87 [HARD] Explain Thompson Sampling Implementation. TDS Amazon, Netflix Hard Bandits
88 [HARD] How to detect Heterogeneous Treatment Effects (Causal Forest)? Wager & Athey Uber, Meta Hard Causal ML
89 [HARD] How to handle "dilution" in experiment metrics? Reforge Product Roles Hard Metrics
90 [HARD] Explain Synthetic Control Method. Wikipedia Uber (City-level tests) Hard Causal Inference
91 [HARD] How to optimize for Long-term Customer Value (LTV)? ThetaCLV Subscription roles Hard Metrics
92 [HARD] Explain "Winner's Curse" in A/B testing. Airbnb Eng Airbnb, Booking Hard Bias
93 [HARD] How to handle heavy-tailed metric distributions? TDS HFT, Fintech Hard Statistics
94 [HARD] How to implement Stratified Sampling in SQL? Stack Overflow Data Eng Hard Sampling
95 [HARD] Explain "Regression to the Mean". Wikipedia Most Tech Companies Hard Statistics
96 [HARD] How to budget "Error Rate" across the company? Microsoft Exp Microsoft, Google Hard Strategy
97 [HARD] How to detect bot traffic in experiments? Google Analytics Security, Fraud Hard Data Quality
98 [HARD] Explain "Interaction Effects" in Factorial Designs. Wikipedia Meta, Google Hard Statistics
99 [HARD] How to use Surrogate Metrics? Netflix TechBlog Netflix Hard Metrics
100 [HARD] How to implement A/B testing in a Microservices architecture? Split.io Netflix, Uber Hard Engineering

Code Examples

1. Power Analysis and Sample Size (Python)

Calculating the required sample size before starting an experiment.

from statsmodels.stats.power import TTestIndPower
import numpy as np

# Parameters
effect_size = 0.1  # Cohen's d (Standardized difference)
alpha = 0.05       # Significance level (5%)
power = 0.8        # Power (80%)

analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha)

print(f"Required sample size per group: {int(np.ceil(sample_size))}")

2. Bayesian A/B Test (Beta-Binomial)

Updating beliefs about conversion rates.

from scipy.stats import beta

# Prior: Uniform distribution (Beta(1,1))
alpha_prior = 1
beta_prior = 1

# Data: Group A
conversions_A = 120
failures_A = 880

# Data: Group B
conversions_B = 140
failures_B = 860

# Posterior
posterior_A = beta(alpha_prior + conversions_A, beta_prior + failures_A)
posterior_B = beta(alpha_prior + conversions_B, beta_prior + failures_B)

# Probability B > A (Approximate via simulation)
samples = 100000
prob_b_better = (posterior_B.rvs(samples) > posterior_A.rvs(samples)).mean()

print(f"Probability B is better than A: {prob_b_better:.4f}")

3. Bootstrap Confidence Interval

Calculating CI for non-normal metrics (e.g., Revenue per User).

import numpy as np

data_control = np.random.lognormal(mean=2, sigma=1, size=1000)
data_variant = np.random.lognormal(mean=2.1, sigma=1, size=1000)

def bootstrap_mean_diff(data1, data2, n_bootstrap=1000):
    diffs = []
    for _ in range(n_bootstrap):
        # Sample with replacement
        sample1 = np.random.choice(data1, len(data1), replace=True)
        sample2 = np.random.choice(data2, len(data2), replace=True)
        diffs.append(sample2.mean() - sample1.mean())
    return np.percentile(diffs, [2.5, 97.5])

ci = bootstrap_mean_diff(data_control, data_variant)
print(f"95% CI for difference: {ci}")

Questions asked in Google interview

  • Explain the difference between Type I and Type II errors.
  • How do you design an experiment to test a change in the Search Ranking algorithm?
  • How to handle multiple metrics in an experiment? (Overall Evaluation Criterion).
  • Explain the trade-off between sample size and experiment duration.
  • Deriving the variance of the difference between two means.
  • How to detect if your randomization algorithm is broken?
  • Explain how you would test a feature with strong network effects.
  • How to measure the long-term impact of a UI change?
  • What metric would you use for a "User Happiness" experiment?
  • Explain the concept of "Regression to the Mean" in the context of A/B testing.

Questions asked in Meta (Facebook) interview

  • How to measure network effects in a social network experiment?
  • Explain Cluster-based randomization. Why use it?
  • How to handle "Novelty Effect" when launching a new feature?
  • Explain CUPED (Controlled-experiment Using Pre-Experiment Data).
  • How to design an experiment for the News Feed ranking?
  • What are the potential bounds of network interference?
  • How to detect if an experiment has a Sample Ratio Mismatch (SRM)?
  • Explain the difference between Average Treatment Effect (ATE) and Conditional ATE (CATE).
  • How to optimize for long-term user retention?
  • Design a test to measure the impact of ads on user engagement.

Questions asked in Netflix interview

  • How to A/B test a new recommendation algorithm?
  • Explain "Interleaving" in ranking experiments.
  • How to choose between "member-level" vs "profile-level" assignment?
  • How to estimate the causal impact of a TV show launch on subscriptions? (Quasi-experiment).
  • Explain the concept of "Proxy Metrics".
  • How to handle outlier users (e.g., bots, heavy users) in analysis?
  • Explain "Switchback" testing infrastructure.
  • How to balance "Exploration" vs "Exploitation" (Bandits)?
  • Design a test for artwork personalization (thumbnails).
  • How to measure the "Incremental Reach" of a marketing campaign?

Questions asked in Uber/Lyft interview (Marketplace)

  • How to test changes in a two-sided marketplace (Rider vs Driver)?
  • Explain "Switchback" designs for marketplace experiments.
  • How to handle "Spillover" or "Cannibalization" effects?
  • Explain "Difference-in-Differences" method.
  • How to measure the impact of surge pricing changes?
  • Explain "Synthetic Control" methods for city-level tests.
  • How to calculate "Marketplace Liquidity" metrics?
  • Design an experiment to reduce driver cancellations.
  • How to test a new matching algorithm?
  • Explain Interference in a geo-spatial context.

Additional Resources