A/B Testing Interview Questions

This document provides a curated list of A/B Testing and Experimentation interview questions. It covers statistical foundations, experimental design, metric selection, and advanced topics like interference (network effects) and sequential testing. Critical for roles at data-driven companies like Netflix, Airbnb, and Uber.

Sno	Question Title	Practice Links	Companies Asking	Difficulty	Topics
1	What is A/B Testing?	Optimizely	Most Tech Companies	Easy	Basics
2	Explain Null Hypothesis (\(H_0\)) vs Alternative Hypothesis (\(H_1\))	Khan Academy	Most Tech Companies	Easy	Statistics
3	What is a p-value? Explain it to a non-technical person.	Harvard Business Review	Google, Meta, Amazon	Medium	Statistics, Communication
4	What is Statistical Power?	Machine Learning Plus	Google, Netflix, Uber	Medium	Statistics
5	What is Type I error (False Positive) vs Type II error (False Negative)?	Towards Data Science	Most Tech Companies	Easy	Statistics
6	How do you calculate sample size for an experiment?	Evan Miller	Google, Amazon, Meta	Medium	Experimental Design
7	What is Minimum Detectable Effect (MDE)?	StatsEngine	Netflix, Airbnb	Medium	Experimental Design
8	Explain Confidence Intervals.	Coursera	Most Tech Companies	Easy	Statistics
9	Difference between One-tailed and Two-tailed tests.	Investopedia	Google, Amazon	Easy	Statistics
10	What is the Central Limit Theorem? Why is it important?	Khan Academy	Google, HFT Firms	Medium	Statistics
11	How long should you run an A/B test?	CXL	Airbnb, Booking.com	Medium	Experimental Design
12	Can you stop an experiment as soon as it reaches significance? (Peeking)	Evan Miller	Netflix, Uber, Airbnb	Hard	Pitfalls
13	What is SRM (Sample Ratio Mismatch)? How to debug?	Microsoft Research	Microsoft, LinkedIn	Hard	Debugging
14	What is Randomization Unit vs Analysis Unit?	Udacity	Uber, DoorDash	Medium	Experimental Design
15	How to handle outliers in A/B testing metrics?	Towards Data Science	Google, Meta	Medium	Data Cleaning
16	Mean vs Median: Which metric to use?	Stack Overflow	Most Tech Companies	Easy	Metrics
17	What are Guardrail Metrics?	Airbnb Tech Blog	Airbnb, Netflix	Medium	Metrics
18	What is a North Star Metric?	Amplitude	Product Roles	Easy	Metrics
19	Difference between Z-test and T-test.	Statistics By Jim	Google, Amazon	Medium	Statistics
20	How to test multiple variants? (A/B/n testing)	VWO	Booking.com, Expedia	Medium	Experimental Design
21	What is the Bonferroni Correction?	Wikipedia	Google, Meta	Hard	Statistics
22	What is A/A Testing? Why do it?	Optimizely	Microsoft, LinkedIn	Medium	Validity
23	Explain Covariate Adjustment (CUPED).	Booking.com Data	Booking.com, Microsoft, Meta	Hard	Optimization
24	How to measure retention in A/B tests?	Reforge	Netflix, Spotify	Medium	Metrics
25	What is a Novelty Effect?	CXL	Facebook, Instagram	Medium	Pitfalls
26	What is a Primacy Effect?	CXL	Facebook, Instagram	Medium	Pitfalls
27	How to handle interference (Network Effects)?	Uber Eng Blog	Uber, Lyft, DoorDash	Hard	Network Effects
28	What is a Switchback (Time-split) Experiment?	DoorDash Eng	DoorDash, Uber	Hard	Experimental Design
29	What is Cluster Randomization?	Wikipedia	Facebook, LinkedIn	Hard	Experimental Design
30	How to test on a 2-sided marketplace?	Lyft Eng	Uber, Lyft, Airbnb	Hard	Marketplace
31	Explain Bayesian A/B Testing vs Frequentist.	VWO	Stitch Fix, Netflix	Hard	Statistics
32	What is a Multi-Armed Bandit (MAB)?	Towards Data Science	Netflix, Amazon	Hard	Bandits
33	Thompson Sampling vs Epsilon-Greedy.	GeeksforGeeks	Netflix, Amazon	Hard	Bandits
34	How to deal with low traffic experiments?	CXL	Startups	Medium	Strategy
35	How to select metrics for a new feature?	Product School	Meta, Google	Medium	Metrics
36	What is Simpson's Paradox?	Britannica	Google, Amazon	Medium	Paradoxes
37	How to analyze ratio metrics (e.g., CTR)?	Deltamethod	Google, Meta	Hard	Statistics, Delta Method
38	What is Bootstrapping? When to use it?	Investopedia	Amazon, Netflix	Medium	Statistics
39	How to detect and handle Seasonality?	Towards Data Science	Retail/E-comm	Medium	Time Series
40	What is Change Aversion?	Google UX	Google, YouTube	Medium	UX
41	How to design an experiment for a search algorithm?	Airbnb Eng	Google, Airbnb, Amazon	Hard	Search, Ranking
42	How to test pricing changes?	PriceIntelligently	Uber, Airbnb	Hard	Pricing, Strategy
43	What is interference between experiments?	Microsoft Exp	Google, Meta, Microsoft	Hard	Platform
44	Explain Sequential Testing.	Evan Miller	Optimizely, Netflix	Hard	Statistics
45	What is Variance Reduction?	Meta Research	Meta, Microsoft, Booking	Hard	Optimization
46	How to handle attribution (First-touch vs Last-touch)?	Google Analytics	Marketing Tech	Medium	Marketing
47	How to validate if randomization worked?	Stats StackExchange	Most Tech Companies	Easy	Validity
48	What is stratification?	Wikipedia	Most Tech Companies	Medium	Sampling
49	When should you NOT A/B test?	Reforge	Product Roles	Medium	Strategy
50	How to estimate long-term impact from short-term tests?	Netflix TechBlog	Netflix, Meta	Hard	Strategy, Proxy Metrics
51	What is Binomial Distribution?	Khan Academy	Most Tech Companies	Easy	Statistics
52	What is Poisson Distribution?	Khan Academy	Uber, Lyft (Rides)	Medium	Statistics
53	Difference between Correlation and Causation.	Khan Academy	Most Tech Companies	Easy	Basics
54	What is a Confounding Variable?	Scribbr	Most Tech Companies	Easy	Causal Inference
55	Explain Regression Discontinuity Design (RDD).	Wikipedia	Economics/Policy Roles	Hard	Causal Inference
56	Explain Difference-in-Differences (DiD).	Wikipedia	Uber, Airbnb	Hard	Causal Inference
57	What is Propensity Score Matching?	Wikipedia	Meta, Netflix	Hard	Causal Inference
58	How to Handle Heterogeneous Treatment Effects?	CausalML	Uber, Meta	Hard	Causal ML
59	What is Interference in social networks?	Meta Research	Meta, LinkedIn, Snap	Hard	Network Effects
60	Explain the concept of "Holdout Groups".	Airbnb Eng	Amazon, Airbnb	Medium	Strategy
61	How to test infrastructure changes? (Canary Deployment)	Google SRE	Google, Netflix	Medium	DevOps/SRE
62	What is Client-side vs Server-side testing?	Optimizely	Full Stack Roles	Medium	Implementation
63	How to deal with flickering?	VWO	Frontend Roles	Medium	Implementation
64	What is a Trigger selection in A/B testing?	Microsoft Exp	Microsoft, Airbnb	Hard	Experimental Design
65	How to analyze user funnel drop-offs?	Mixpanel	Product Analysts	Medium	Analytics
66	What is Geometric Distribution?	Wikipedia	Most Tech Companies	Medium	Statistics
67	Explain Inverse Propensity Weighting (IPW).	Wikipedia	Causal Inference Roles	Hard	Causal Inference
68	How to calculate Standard Error of Mean (SEM)?	Investopedia	Most Tech Companies	Easy	Statistics
69	What is Statistical Significance vs Practical Significance?	Towards Data Science	Google, Meta	Medium	Strategy
70	How to handle cookies and tracking prevention (ITP)?	WebKit	AdTech, Marketing	Hard	Privacy
71	[HARD] Explain the Delta Method for ratio metrics.	Deltamethod	Google, Meta, Uber	Hard	Statistics
72	[HARD] How does Switchback testing solve interference?	DoorDash Eng	DoorDash, Uber	Hard	Experimental Design
73	[HARD] Derive the sample size formula.	Stats Exchange	Google, HFT Firms	Hard	Math
74	[HARD] How to implement CUPED in Python/SQL?	Booking.com	Booking, Microsoft	Hard	Optimization
75	[HARD] Explain Sequential Probability Ratio Test (SPRT).	Wikipedia	Optimizely, Netflix	Hard	Statistics
76	[HARD] How to estimate Network Effects (Cluster-Based)?	MIT Paper	Meta, LinkedIn	Hard	Network Effects
77	[HARD] Design an experiment for a 3-sided marketplace.	Uber Eng	Uber, DoorDash	Hard	Marketplace
78	[HARD] How to correct for multiple comparisons (FDR vs FWER)?	Wikipedia	Pharma, BioTech, Tech	Hard	Statistics
79	[HARD] Explain Instrumental Variables (IV).	Wikipedia	Economics, Uber	Hard	Causal Inference
80	[HARD] How to build an Experimentation Platform?	Microsoft Exp	Microsoft, Netflix, Airbnb	Hard	System Design
81	[HARD] How to handle user identity resolution across devices?	Segment	Meta, Google	Hard	Data Engineering
82	[HARD] What is "Carryover Effect" in Switchback tests?	DoorDash Eng	DoorDash, Uber	Hard	Pitfalls
83	[HARD] Explain "Washout Period".	Clinical Trials	DoorDash, Uber	Hard	Experimental Design
84	[HARD] How to test Ranking algorithms (Interleaving)?	Netflix TechBlog	Netflix, Google, Airbnb	Hard	Search/Ranking
85	[HARD] Explain Always-Valid Inference.	Optimizely	Optimizely, Netflix	Hard	Statistics
86	[HARD] How to measure cannibalization?	Harvard Business Review	Retail, E-comm	Hard	Strategy
87	[HARD] Explain Thompson Sampling Implementation.	TDS	Amazon, Netflix	Hard	Bandits
88	[HARD] How to detect Heterogeneous Treatment Effects (Causal Forest)?	Wager & Athey	Uber, Meta	Hard	Causal ML
89	[HARD] How to handle "dilution" in experiment metrics?	Reforge	Product Roles	Hard	Metrics
90	[HARD] Explain Synthetic Control Method.	Wikipedia	Uber (City-level tests)	Hard	Causal Inference
91	[HARD] How to optimize for Long-term Customer Value (LTV)?	ThetaCLV	Subscription roles	Hard	Metrics
92	[HARD] Explain "Winner's Curse" in A/B testing.	Airbnb Eng	Airbnb, Booking	Hard	Bias
93	[HARD] How to handle heavy-tailed metric distributions?	TDS	HFT, Fintech	Hard	Statistics
94	[HARD] How to implement Stratified Sampling in SQL?	Stack Overflow	Data Eng	Hard	Sampling
95	[HARD] Explain "Regression to the Mean".	Wikipedia	Most Tech Companies	Hard	Statistics
96	[HARD] How to budget "Error Rate" across the company?	Microsoft Exp	Microsoft, Google	Hard	Strategy
97	[HARD] How to detect bot traffic in experiments?	Google Analytics	Security, Fraud	Hard	Data Quality
98	[HARD] Explain "Interaction Effects" in Factorial Designs.	Wikipedia	Meta, Google	Hard	Statistics
99	[HARD] How to use Surrogate Metrics?	Netflix TechBlog	Netflix	Hard	Metrics
100	[HARD] How to implement A/B testing in a Microservices architecture?	Split.io	Netflix, Uber	Hard	Engineering

Code Examples

1. Power Analysis and Sample Size (Python)

Calculating the required sample size before starting an experiment.

from statsmodels.stats.power import TTestIndPower
import numpy as np

# Parameters
effect_size = 0.1  # Cohen's d (Standardized difference)
alpha = 0.05       # Significance level (5%)
power = 0.8        # Power (80%)

analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha)

print(f"Required sample size per group: {int(np.ceil(sample_size))}")

2. Bayesian A/B Test (Beta-Binomial)

Updating beliefs about conversion rates.

from scipy.stats import beta

# Prior: Uniform distribution (Beta(1,1))
alpha_prior = 1
beta_prior = 1

# Data: Group A
conversions_A = 120
failures_A = 880

# Data: Group B
conversions_B = 140
failures_B = 860

# Posterior
posterior_A = beta(alpha_prior + conversions_A, beta_prior + failures_A)
posterior_B = beta(alpha_prior + conversions_B, beta_prior + failures_B)

# Probability B > A (Approximate via simulation)
samples = 100000
prob_b_better = (posterior_B.rvs(samples) > posterior_A.rvs(samples)).mean()

print(f"Probability B is better than A: {prob_b_better:.4f}")

3. Bootstrap Confidence Interval

Calculating CI for non-normal metrics (e.g., Revenue per User).

import numpy as np

data_control = np.random.lognormal(mean=2, sigma=1, size=1000)
data_variant = np.random.lognormal(mean=2.1, sigma=1, size=1000)

def bootstrap_mean_diff(data1, data2, n_bootstrap=1000):
    diffs = []
    for _ in range(n_bootstrap):
        # Sample with replacement
        sample1 = np.random.choice(data1, len(data1), replace=True)
        sample2 = np.random.choice(data2, len(data2), replace=True)
        diffs.append(sample2.mean() - sample1.mean())
    return np.percentile(diffs, [2.5, 97.5])

ci = bootstrap_mean_diff(data_control, data_variant)
print(f"95% CI for difference: {ci}")

Questions asked in Google interview

Explain the difference between Type I and Type II errors.
How do you design an experiment to test a change in the Search Ranking algorithm?
How to handle multiple metrics in an experiment? (Overall Evaluation Criterion).
Explain the trade-off between sample size and experiment duration.
Deriving the variance of the difference between two means.
How to detect if your randomization algorithm is broken?
Explain how you would test a feature with strong network effects.
How to measure the long-term impact of a UI change?
What metric would you use for a "User Happiness" experiment?
Explain the concept of "Regression to the Mean" in the context of A/B testing.

Questions asked in Meta (Facebook) interview

How to measure network effects in a social network experiment?
Explain Cluster-based randomization. Why use it?
How to handle "Novelty Effect" when launching a new feature?
Explain CUPED (Controlled-experiment Using Pre-Experiment Data).
How to design an experiment for the News Feed ranking?
What are the potential bounds of network interference?
How to detect if an experiment has a Sample Ratio Mismatch (SRM)?
Explain the difference between Average Treatment Effect (ATE) and Conditional ATE (CATE).
How to optimize for long-term user retention?
Design a test to measure the impact of ads on user engagement.

Questions asked in Netflix interview

How to A/B test a new recommendation algorithm?
Explain "Interleaving" in ranking experiments.
How to choose between "member-level" vs "profile-level" assignment?
How to estimate the causal impact of a TV show launch on subscriptions? (Quasi-experiment).
Explain the concept of "Proxy Metrics".
How to handle outlier users (e.g., bots, heavy users) in analysis?
Explain "Switchback" testing infrastructure.
How to balance "Exploration" vs "Exploitation" (Bandits)?
Design a test for artwork personalization (thumbnails).
How to measure the "Incremental Reach" of a marketing campaign?

Questions asked in Uber/Lyft interview (Marketplace)

How to test changes in a two-sided marketplace (Rider vs Driver)?
Explain "Switchback" designs for marketplace experiments.
How to handle "Spillover" or "Cannibalization" effects?
Explain "Difference-in-Differences" method.
How to measure the impact of surge pricing changes?
Explain "Synthetic Control" methods for city-level tests.
How to calculate "Marketplace Liquidity" metrics?
Design an experiment to reduce driver cancellations.
How to test a new matching algorithm?
Explain Interference in a geo-spatial context.

Additional Resources

Microsoft Experimentation Platform (Exp) - Best technical papers.
Netflix Tech Blog - Experimentation - Real-world case studies.
Causal Inference for the Brave and True - Python handbook.
Trustworthy Online Controlled Experiments (Book) - The "Bible" of A/B testing (Kohavi).
Uber Engineering - Data - Marketplace testing concepts.