product-analytics

Data-driven product decision making with expert-level analytics methodology. Covers north star metrics, funnel and cohort analysis, A/B testing, event taxonomy design, attribution modeling, session recording patterns, and the balance between data-informed and data-driven decisions. Trigger phrases:

MoltbotDen

Product & Design

Product Analytics

Analytics without a measurement strategy is just expensive data hoarding. Most product teams track dozens of metrics but can't tell you what their north star is or whether it moved last quarter. This skill covers how to instrument products correctly, analyze data rigorously, and — critically — know when to override the data with judgment.

Core Mental Model

Metrics are proxies for value, not value itself. Optimizing a metric blindly causes Goodhart's Law failures: "When a measure becomes a target, it ceases to be a good measure." Daily active users can be gamed with dark patterns. Conversion rates can be inflated with low-quality sign-ups. The key is building a metric tree where improving each leaf metric reflects genuine value creation for users.

Layer your metrics:

North Star Metric — single metric most correlated with long-term value delivery

Leading Indicators — metrics that predict north star movement (can act on now)

Guardrail Metrics — metrics you cannot degrade while chasing the north star

Diagnostic Metrics — help you understand why north star moved

North Star Metric

The north star is the ONE metric that best captures the value your product delivers to customers at scale.

Characteristics of a Good North Star

Reflects customer value received, not activity
Predictive of long-term revenue
Lagging enough to matter, leading enough to act on
Understandable by the whole company
One number (not a composite)

Company          North Star
---------------------------------
Airbnb           Nights booked
Spotify          Time spent listening
Slack            Messages sent within a workspace
Facebook         Daily Active Users (their original, now controversial)
Stripe           Total payment volume
Duolingo         Daily active learners
HubSpot          Weekly active teams using ≥3 features

Anti-patterns:
Revenue          (lag, can obscure user value loss before churn)
Pageviews        (activity, not value)
Sign-ups         (output, not engagement)
App installs     (output)

North Star Metric Framework

Step 1: List the value moments in your product
        "User gets value when they: send first message / complete a project / 
         receive first payment / reach their goal"

Step 2: Find the metric that best proxies that moment at scale
        "Weekly projects completed" captures the value moment

Step 3: Stress-test it
        - Can it be gamed without delivering real value? 
        - Does it degrade if we compromise quality?
        - Does it grow when our best customers engage more?

Step 4: Build the metric tree under it
        North Star: "Weekly projects completed"
        └── Activation: % who complete first project in 7 days
        └── Engagement: Projects/user/week
        └── Retention: % users active week-over-week
        └── Expansion: Teams inviting 2+ members

Funnel Analysis

Funnels show conversion rates between sequential steps. Use them to find where users drop off and prioritize optimization.

Funnel Construction

-- Example: Signup funnel (Mixpanel/Amplitude SQL equivalent)
SELECT
  step,
  COUNT(DISTINCT user_id) as users,
  COUNT(DISTINCT user_id) * 100.0 / MAX(COUNT(DISTINCT user_id)) OVER() as pct_of_top
FROM (
  SELECT user_id, 'visited_landing' as step, created_at FROM page_views WHERE path = '/'
  UNION ALL
  SELECT user_id, 'clicked_signup', created_at FROM events WHERE event = 'signup_clicked'
  UNION ALL
  SELECT user_id, 'completed_signup', created_at FROM events WHERE event = 'signup_completed'
  UNION ALL
  SELECT user_id, 'completed_onboarding', created_at FROM events WHERE event = 'onboarding_finished'
)
GROUP BY step

Interpreting Drop-Off

Funnel:
Visited landing:        100,000  (100%)
Clicked sign-up:         22,000  (22%) ← 78% drop here
Completed sign-up:       15,000  (68% of prev, 15% total)
Completed onboarding:     6,000  (40% of prev, 6% total)
Activated (week 1):       3,200  (53% of prev, 3.2% total)

Analysis:
- Biggest absolute drop: landing → click (78K users lost)
  → A/B test headline, CTA, value prop
- Biggest % drop: sign-up → onboarding (40% lost)
  → Investigate: too many steps? Email verification blocking?
- Highest leverage: improving landing→click by 5pp adds ~5K sign-ups

Statistical Significance for Funnel Changes

Before declaring a funnel improvement: did the change cause it?

from scipy import stats

# Chi-squared test for conversion rate changes
control_conversions = 150
control_visitors = 1000
test_conversions = 175
test_visitors = 1000

chi2, p_value = stats.chi2_contingency([
    [control_conversions, control_visitors - control_conversions],
    [test_conversions, test_visitors - test_conversions]
])[:2]

print(f"p-value: {p_value:.4f}")  # < 0.05 = statistically significant

Cohort Analysis

Cohorts group users by when they first performed an action. Essential for understanding retention and the impact of product changes on different user groups.

Retention Cohort Table

Sign-up cohort   Day 1  Day 7  Day 14  Day 30  Day 60  Day 90
Jan 2025         45%    28%    22%     18%     15%     14%  ← flattening
Feb 2025         48%    30%    24%     19%     16%     -
Mar 2025         52%    35%    28%     22%     -       -
Apr 2025         55%    38%    30%     -       -       -

Reading: Jan's 14% D90 retention means 14% of Jan signups 
         were still active 90 days later.

Sign of PMF: Retention flattens (stops declining) — means 
             you have a core audience that keeps coming back.
Sign of trouble: Retention approaches 0% — bleeding all users.

Cohort Analysis for Feature Impact

Scenario: Did the new onboarding (shipped March 15) improve retention?

Pre-onboarding cohorts (Jan-Mar): D30 retention = 18% avg
Post-onboarding cohorts (Apr-Jun): D30 retention = 24% avg

Is this the feature? Check:
1. Did other things change? (marketing channel, seasonality)
2. Are cohort sizes similar? (mix shift can distort)
3. Is the difference statistically significant? (run t-test on user-level data)
4. Is the new cohort large enough? (wait 30 days for D30 data)

A/B Testing Fundamentals

Before You Test — Sample Size

Calculate required sample size BEFORE running the experiment. Running until it "looks significant" inflates false positive rates.

import math

def required_sample_size(baseline_rate, min_detectable_effect, power=0.8, significance=0.05):
    """
    baseline_rate:        current conversion rate (e.g., 0.05 for 5%)
    min_detectable_effect: smallest relative change worth detecting (e.g., 0.10 for 10%)
    power:                probability of detecting a real effect (0.8 = 80%)
    significance:         acceptable false positive rate (0.05 = 5%)
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_detectable_effect)
    
    z_alpha = 1.96  # for 95% confidence (two-tailed)
    z_beta  = 0.842 # for 80% power
    
    n = (z_alpha * math.sqrt(2 * p1 * (1-p1)) + z_beta * math.sqrt(p1*(1-p1) + p2*(1-p2))) ** 2 / (p2 - p1) ** 2
    return math.ceil(n)

# Example: 5% baseline, want to detect 10% lift
n = required_sample_size(0.05, 0.10)
print(f"Required per variant: {n:,}")  # ~14,000 per variant
# Total duration = (n × 2) / daily_traffic

Interpreting Results

p-value: probability of seeing this result if null hypothesis is true
         p < 0.05 → reject null (likely real effect)
         p > 0.05 → don't reject null (effect may not be real)

Confidence interval: [lower, upper] — if it doesn't cross 0, the effect is real

Practical significance: Is the lift large enough to matter?
  Statistically significant 0.1% lift on checkout: not worth shipping complexity
  Statistically significant 5% lift on checkout: ship immediately

Novelty effect: New features often show inflated early results.
  Run tests for at least 2 full business cycles (2 weeks minimum).
  Segment: "new users during test" vs "existing users during test"
  — existing users show the novelty effect; new users show steady state.

A/B Test Decision Framework

Result           Action
-----------      --------
Significant positive  →  Ship (verify guardrails didn't degrade)
Significant negative  →  Drop + analyze why
Inconclusive          →  Assess: extend runtime? Increase sample? Simplify hypothesis?
"Directionally positive" → Be skeptical. Either extend or run a bigger bet.

Multi-armed bandit: Use for content/messaging experiments where you 
                    want to exploit winning variants quickly.
                    Use classic A/B for feature experiments where you need 
                    clean before/after attribution.

Event Taxonomy Design

Well-designed events are the foundation of all analytics. Bad taxonomy = unmaintainable, uninterpretable data.

Noun-Verb Convention

Format: {object}_{action}  (snake_case, past tense)

✅ Good events:
user_signed_up
project_created
payment_completed
team_member_invited
feature_flag_enabled
file_exported
search_performed
onboarding_step_completed

❌ Bad events:
button_clicked          (what button? what did it do?)
page_viewed             (use consistent noun: dashboard_viewed)
action_performed        (meaningless)
UserSignedUp            (wrong casing convention)
sign up complete        (spaces, ambiguous)

Event Properties (the real value is in properties)

// user_signed_up event with rich properties
analytics.track('user_signed_up', {
  // Identity
  user_id: 'usr_abc123',
  email_domain: 'acme.com',    // not full email — privacy
  
  // Acquisition
  signup_source: 'organic',    // 'organic' | 'paid' | 'referral' | 'direct'
  utm_campaign: 'spring-sale',
  utm_medium: 'email',
  referrer_user_id: 'usr_xyz', // if invited
  
  // Context
  signup_method: 'google_oauth', // 'email_password' | 'google_oauth' | 'github'
  plan_selected: 'pro',
  
  // Experiment
  experiment_variant: 'onboarding_v2', // which A/B variant they saw
  
  // Timing
  time_from_landing_to_signup_seconds: 142,
})

Event Taxonomy Schema (Amplitude/Mixpanel)

Object Type → Action → Properties
------------------------------------
User        → signed_up, logged_in, profile_updated, deleted_account
Project     → created, renamed, archived, deleted, shared, exported
Content     → created, published, edited, deleted, viewed, liked, shared
Payment     → initiated, completed, failed, refunded, subscription_created
Team        → created, member_invited, member_removed, role_changed
Feature     → enabled, disabled, usage (with feature_name property)
Error       → api_error, validation_error (with error_code, error_message)

Amplitude/Mixpanel Implementation Patterns

// Mixpanel: identify user with traits
mixpanel.identify(user.id)
mixpanel.people.set({
  '$email':    user.email,
  '$name':     user.name,
  'plan':      user.plan,
  'company':   user.company,
  'created_at': user.createdAt,
})

// Amplitude: set user properties
amplitude.setUserId(user.id)
const identify = new amplitude.Identify()
  .set('plan', user.plan)
  .set('company_size', user.companySize)
  .setOnce('signup_date', user.createdAt)  // setOnce prevents overwrites
amplitude.identify(identify)

// Group analytics (org-level metrics)
mixpanel.set_group('company', user.companyId)
mixpanel.get_group('company', user.companyId).set({
  'plan': org.plan,
  'seat_count': org.seats,
})

Attribution Modeling

How do you credit conversions across multiple touchpoints?

Customer journey:
Day 1: Saw Twitter ad          → Ad spend: $0.50
Day 3: Read blog post (organic)
Day 7: Clicked Google Search ad → Ad spend: $2.00
Day 8: Opened welcome email
Day 10: Converted (paid $99)

Attribution models:
First-touch: Twitter ad gets 100% credit ($99)
Last-touch:  Google Search ad gets 100% credit ($99) ← default in GA4
Linear:      Each touchpoint gets $24.75 (4 touchpoints)
Time-decay:  More credit to touchpoints closer to conversion
  Google Search: ~40%, Email: ~30%, Blog: ~20%, Twitter: ~10%
Data-driven:  ML model based on historical patterns (requires lots of data)

What to use when:

First-touch: Understanding top-of-funnel channel value

Last-touch: SEM/paid campaigns where click → convert is the model

Linear/Time-decay: Content marketing attribution

Data-driven: Large companies with enough events for ML models

Session Recording Analysis

Tools: Hotjar, FullStory, Microsoft Clarity (free)

Patterns to Look For

Rage clicks:     User clicks same area repeatedly → something isn't working
Dead clicks:     Clicking non-interactive elements → perceived affordance mismatch
Scroll depth:    Where do users stop reading? → CTA placement optimization
U-turns:         Back-and-forth between two pages → navigation confusion
Form abandonment: Which field causes drop-off? → form friction analysis

Hotjar heatmap reading:
  Dark red spots → most attention/interaction
  Cold blue spots → ignored content
  
  Common findings:
  - Hero image gets more clicks than CTA
  - Users try to click non-link text
  - Footer links have surprising engagement

Data-Informed vs Data-Driven

The most important distinction in analytics philosophy.

Data-driven: The data makes the decision. The metrics determine the action. No override.

Data-informed: Data is a critical input, but judgment, strategy, and ethics also inform the decision.

When to trust the data over judgment:
- A/B test with sufficient power and clear result
- Funnel drop-off is obvious and unambiguous
- Retention cohort shows clear inflection from product change

When to override data with judgment:
- Metric being optimized conflicts with long-term user trust
  ("notification click rates are up" — but we're burning goodwill)
- Small n — your sample is too small to be directional
- Survivorship bias — data only reflects users who stayed
- The strategy requires investment before metrics improve
  (new market expansion looks "bad" in data before it gets good)
- Ethical concerns about a tactic that "works" in the data

Anti-Patterns

❌ Vanity metrics — pageviews, downloads, Twitter followers. They move easily, don't predict revenue, and create a false sense of progress.

❌ Peeking at A/B tests — checking results before hitting your required sample size inflates false positives dramatically (up to 3x more false positives).

❌ One-size-fits-all metrics — different user segments should have segment-specific KPIs. Power users and casual users have different value patterns.

❌ Event names that change — signup_complete becomes registration_finished in v2. Now you can't compare cohorts. Lock naming conventions and enforce them.

❌ Tracking everything — 500 events with no ownership creates a graveyard. Each event should answer a specific question. Delete events not used in 6 months.

❌ Correlation is causation — "Sign-ups increased the week we shipped X" is not evidence X caused it. You need controlled experiments.

Quick Reference

A/B Test Checklist

[ ] Hypothesis written (If X, then Y, because Z)
[ ] Primary metric defined before launch
[ ] Guardrail metrics defined (won't degrade)
[ ] Sample size calculated (not based on time, based on events)
[ ] Test runs minimum 2 full business cycles
[ ] Segment analysis planned (new vs existing users)
[ ] Ship/no-ship threshold defined upfront

North Star Metric Check

[ ] Reflects value delivered to customers (not activity)
[ ] Predictive of long-term revenue
[ ] Can be decomposed into leading indicators
[ ] Can't be easily gamed without delivering real value
[ ] Whole team understands what it means and how to move it

Event Taxonomy Rules

Format:      {noun}_{verb}  (past tense, snake_case)
Properties:  Always include user_id, timestamp (auto), experiment_variant
Avoid:       PII (email, phone), button/UI names (use semantic names)
Test in:     Dev environment before production
Review:      Analytics review every 6 months — delete unused events

Skill Information

Source: MoltbotDen
Category: Product & Design
Repository: View on GitHub