Skip to main content

python-data-science

Expert-level Python data science patterns covering pandas, polars, numpy, EDA workflows, feature engineering, visualization, and memory optimization. Trigger phrases: when doing data analysis in Python, pandas vs polars, pandas performance,

MoltbotDen
Data & Analytics

Python Data Science

Python data science is dominated by two traps: using loops where vectorized operations exist, and reaching for complex solutions when pandas already has a built-in method. The expert mindset is columnar — think about operations on entire columns at once, not row by row. Know when pandas is fast (vectorized NumPy ops), when it's slow (apply(), object dtype, Python loops), and when polars is the better choice entirely.

Core Mental Model

DataFrames are columnar: each column is a contiguous array in memory. Operations on entire columns are handed off to C/NumPy implementations and run at near-C speed. Operations that process row-by-row in Python (apply, for loops, iterrows) are 10-1000x slower. The query pipeline is: load → filter → transform → aggregate → shape output. Filter early (reduce rows), select columns you need (reduce memory), then transform. Always profile before optimizing — most "slow" pandas code has one bottleneck, not many.


Pandas vs Polars

Use pandas when:
  - Ecosystem integration needed (scikit-learn, statsmodels, seaborn)
  - Small-medium datasets (< 1GB)
  - Exploratory work, Jupyter notebooks
  - Complex string operations, apply() with complex logic
  - GroupBy with custom aggregations

Use polars when:
  - Large datasets (> 1GB) where pandas is slow/OOM
  - Multi-threaded computation needed (polars uses all cores)
  - Lazy evaluation + query optimization (like Spark but in-process)
  - Streaming data larger than RAM
  - Pipeline-heavy ETL transformations

Speed comparison: polars is typically 5-20x faster than pandas on large datasets
# Polars lazy API: query optimizer rewrites the plan
import polars as pl

# Lazy scan — doesn't load data until collect()
df = (
    pl.scan_parquet("large_dataset/*.parquet")
    .filter(pl.col("status").eq("active"))
    .filter(pl.col("created_at") >= pl.lit("2024-01-01").str.to_date())
    .select([
        "user_id", "amount", "category", "created_at"
    ])
    .with_columns([
        pl.col("amount").log1p().alias("log_amount"),
        pl.col("created_at").dt.month().alias("month"),
        (pl.col("amount") / pl.col("amount").mean().over("category")).alias("amount_vs_cat_mean")
    ])
    .group_by(["user_id", "month"])
    .agg([
        pl.col("amount").sum().alias("total_spend"),
        pl.col("amount").count().alias("transaction_count"),
        pl.col("category").n_unique().alias("category_diversity")
    ])
    .sort("total_spend", descending=True)
    .collect()  # execute the plan here
)

# Polars expression context (no UDFs needed for most ops)
df.with_columns([
    # String ops
    pl.col("email").str.to_lowercase().alias("email_lower"),
    pl.col("name").str.split(" ").list.first().alias("first_name"),
    # Conditional
    pl.when(pl.col("amount") > 1000).then(pl.lit("high"))
      .when(pl.col("amount") > 100).then(pl.lit("medium"))
      .otherwise(pl.lit("low")).alias("amount_tier"),
    # Window function
    pl.col("amount").rank("dense").over("category").alias("rank_in_category")
])

Pandas Vectorized Idioms

import pandas as pd
import numpy as np

# ❌ Loop (10-100x slower)
for idx, row in df.iterrows():
    df.at[idx, 'total'] = row['price'] * row['quantity']

# ✅ Vectorized
df['total'] = df['price'] * df['quantity']

# ❌ apply() for simple operations
df['tax'] = df['total'].apply(lambda x: x * 0.08)

# ✅ Vectorized arithmetic
df['tax'] = df['total'] * 0.08

# ✅ np.where for if/else (fastest conditional)
df['tier'] = np.where(df['total'] > 1000, 'high', 'low')

# ✅ np.select for multiple conditions
conditions = [
    df['total'] > 1000,
    df['total'] > 100,
    df['total'] > 10
]
choices = ['high', 'medium', 'low']
df['tier'] = np.select(conditions, choices, default='minimal')

# ✅ .str accessor for string operations
df['email_domain'] = df['email'].str.split('@').str[-1]
df['name_upper']   = df['name'].str.upper()
df['has_promo']    = df['code'].str.contains(r'^PROMO\d{4}

Method Chaining with pipe()

# Method chaining: readable, testable pipeline
def clean_emails(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(email=df['email'].str.lower().str.strip())

def add_customer_age(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        customer_age_days=(pd.Timestamp.now() - df['created_at']).dt.days
    )

def filter_active(df: pd.DataFrame, min_orders: int = 1) -> pd.DataFrame:
    return df[df['order_count'] >= min_orders]

def add_spend_tier(df: pd.DataFrame) -> pd.DataFrame:
    conditions = [df['ltv'] > 1000, df['ltv'] > 100]
    choices    = ['vip', 'regular']
    return df.assign(spend_tier=np.select(conditions, choices, default='new'))

# Clean pipeline: each step is pure function, easily unit-tested
result = (
    raw_df
    .pipe(clean_emails)
    .pipe(add_customer_age)
    .pipe(filter_active, min_orders=2)
    .pipe(add_spend_tier)
    .query("customer_age_days < 365")
    .sort_values("ltv", ascending=False)
    .reset_index(drop=True)
)

GroupBy Patterns

# transform: broadcast aggregated value back to original shape (for new column)
# Use when: you want group stats AS a column in the original DataFrame
df['mean_amount_in_group'] = df.groupby('category')['amount'].transform('mean')
df['amount_z_score'] = (
    (df['amount'] - df.groupby('category')['amount'].transform('mean')) /
    df.groupby('category')['amount'].transform('std')
)

# agg: aggregate down to one row per group
# Use when: you want summary statistics per group
summary = df.groupby('category').agg(
    total_revenue=('amount', 'sum'),
    order_count=('order_id', 'nunique'),
    avg_amount=('amount', 'mean'),
    p90_amount=('amount', lambda x: x.quantile(0.9)),
    first_order=('created_at', 'min'),
    last_order=('created_at', 'max')
).reset_index()

# apply: arbitrary function per group (use only when agg/transform won't do)
# Note: 5-10x slower than agg/transform — use as last resort
def custom_cohort_metric(group):
    return pd.Series({
        'cohort_ltr': group['amount'].sum() / group['customer_id'].nunique(),
        'repurchase_rate': (group.groupby('customer_id').size() > 1).mean()
    })

cohort_metrics = df.groupby('cohort_month').apply(custom_cohort_metric).reset_index()

# Named aggregation (pandas 0.25+): cleaner than dictionary agg
result = df.groupby(['region', 'category']).agg(
    revenue=('amount', 'sum'),
    orders=('order_id', 'count'),
    customers=('customer_id', 'nunique')
)

Time Series with DatetimeIndex

# Set DatetimeIndex for time series operations
ts = df.set_index('timestamp').sort_index()

# Resample: like groupby for time intervals
daily_revenue = (
    ts['amount']
    .resample('D')         # D=day, W=week, M=month, H=hour
    .agg(['sum', 'count', 'mean'])
    .rename(columns={'sum': 'revenue', 'count': 'orders', 'mean': 'avg_order'})
)

# Rolling window statistics
ts['revenue_7d_ma']   = ts['amount'].rolling('7D').mean()  # 7-day moving average
ts['revenue_7d_std']  = ts['amount'].rolling('7D').std()
ts['revenue_28d_sum'] = ts['amount'].rolling('28D').sum()

# Expanding window (cumulative)
ts['cumulative_revenue'] = ts['amount'].expanding().sum()

# Lag/lead features for forecasting
ts['revenue_lag_7d']  = ts['amount'].shift(7)   # value from 7 periods ago
ts['revenue_lag_28d'] = ts['amount'].shift(28)
ts['revenue_pct_change_7d'] = ts['amount'].pct_change(7)

# Period comparisons
ts['wow_growth'] = ts['revenue'].pct_change(7)   # week-over-week
ts['yoy_growth'] = ts['revenue'].pct_change(365) # year-over-year

# Localize and convert timezones
ts.index = ts.index.tz_localize('UTC').tz_convert('America/Chicago')

Memory Optimization

# dtype selection: massive impact on memory usage
print(df.memory_usage(deep=True).sum() / 1e6, "MB")  # baseline

# Downcast integers
df['user_id'] = pd.to_numeric(df['user_id'], downcast='unsigned')
df['count'] = df['count'].astype('int16')     # if max < 32767
df['amount'] = df['amount'].astype('float32') # if full precision not needed

# Categorical for low-cardinality strings (huge savings)
df['status']   = df['status'].astype('category')   # 10 unique values → ~95% memory saving
df['country']  = df['country'].astype('category')
df['category'] = df['category'].astype('category')

# Before/after
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

# Chunked reading for large files
chunk_results = []
for chunk in pd.read_csv('huge_file.csv', chunksize=100_000,
                          dtype={'user_id': 'int32', 'amount': 'float32'},
                          parse_dates=['created_at']):
    # Process each chunk
    processed = chunk.pipe(your_processing_pipeline)
    chunk_results.append(processed)
result = pd.concat(chunk_results, ignore_index=True)

# Select only needed columns on load (skip reading unwanted columns)
df = pd.read_csv('data.csv', usecols=['user_id', 'amount', 'created_at', 'status'])

# Sparse arrays for data with many NaN values
df['optional_field'] = pd.arrays.SparseArray(df['optional_field'])

EDA Checklist

def eda(df: pd.DataFrame, target_col: str = None) -> None:
    """Systematic exploratory data analysis."""
    
    print("=== SHAPE ===")
    print(f"Rows: {len(df):,}, Cols: {df.shape[1]}")
    
    print("\n=== DTYPES & NULLS ===")
    null_report = pd.DataFrame({
        'dtype':    df.dtypes,
        'nulls':    df.isnull().sum(),
        'null_pct': (df.isnull().sum() / len(df) * 100).round(1),
        'nunique':  df.nunique()
    })
    print(null_report[null_report['nulls'] > 0].sort_values('null_pct', ascending=False))
    
    print("\n=== NUMERIC DISTRIBUTIONS ===")
    print(df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99]).round(2))
    
    print("\n=== CATEGORICAL DISTRIBUTIONS ===")
    for col in df.select_dtypes(include=['object', 'category']).columns:
        vc = df[col].value_counts()
        print(f"\n{col}: {df[col].nunique()} unique")
        print(vc.head(10).to_string())
    
    if target_col and target_col in df.columns:
        print(f"\n=== TARGET: {target_col} ===")
        print(df[target_col].describe())
        
        print("\n=== CORRELATIONS WITH TARGET ===")
        correlations = df.select_dtypes('number').corr()[target_col].abs()
        print(correlations.sort_values(ascending=False).head(20))
    
    print("\n=== DUPLICATE ROWS ===")
    print(f"Exact duplicates: {df.duplicated().sum():,}")
    
    print("\n=== DATE RANGE (if datetime cols) ===")
    for col in df.select_dtypes(include=['datetime64']).columns:
        print(f"{col}: {df[col].min()} → {df[col].max()}")

Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Style
sns.set_theme(style="darkgrid", palette="muted")
plt.rcParams['figure.dpi'] = 120

# Correlation heatmap
def plot_correlation_heatmap(df: pd.DataFrame, figsize=(12, 10)):
    numeric_df = df.select_dtypes('number')
    corr = numeric_df.corr()
    
    mask = np.triu(np.ones_like(corr, dtype=bool))  # hide upper triangle
    
    fig, ax = plt.subplots(figsize=figsize)
    sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
                center=0, vmin=-1, vmax=1, ax=ax,
                cbar_kws={'shrink': 0.8})
    ax.set_title('Feature Correlations', fontsize=14)
    plt.tight_layout()
    return fig

# Distribution comparison
def plot_distributions(df: pd.DataFrame, cols: list, hue: str = None):
    n = len(cols)
    fig, axes = plt.subplots(1, n, figsize=(5 * n, 4))
    if n == 1:
        axes = [axes]
    for ax, col in zip(axes, cols):
        sns.histplot(data=df, x=col, hue=hue, kde=True, ax=ax)
        ax.set_title(col)
    plt.tight_layout()
    return fig

# Time series with matplotlib (better than seaborn for complex time charts)
def plot_time_series(ts: pd.Series, title: str, ma_window: int = 7):
    fig, ax = plt.subplots(figsize=(14, 5))
    ax.plot(ts.index, ts.values, alpha=0.4, linewidth=1, label='Daily')
    ax.plot(ts.index, ts.rolling(ma_window).mean(), linewidth=2,
            label=f'{ma_window}-day MA')
    ax.set_title(title)
    ax.legend()
    ax.xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m'))
    plt.xticks(rotation=45)
    plt.tight_layout()
    return fig

# Interactive: plotly for dashboards / notebooks
import plotly.express as px
fig = px.scatter(df, x='amount', y='frequency', color='category',
                 size='ltv', hover_data=['user_id'],
                 title='RFM Scatter Plot')
fig.show()

Anti-Patterns

# ❌ iterrows() — Python loop, 100-1000x slower than vectorized
for idx, row in df.iterrows():
    result.append(compute(row))
# ✅ Vectorized: df.apply(compute, axis=1) — or better: vectorized column ops

# ❌ Concatenating in a loop (O(n²) memory copying)
result = pd.DataFrame()
for chunk in chunks:
    result = pd.concat([result, chunk])
# ✅ Collect then concat once
result = pd.concat(chunks, ignore_index=True)

# ❌ Chained indexing (SettingWithCopyWarning, silent bugs)
df[df['status'] == 'active']['amount'] = 0
# ✅ Use .loc with boolean mask
df.loc[df['status'] == 'active', 'amount'] = 0

# ❌ object dtype for numeric columns (slow, memory-hungry)
df['amount'] = df['amount'].astype(object)  # accidentally or from CSV
# ✅ Always enforce dtypes on load
df = pd.read_csv('data.csv', dtype={'amount': 'float32', 'user_id': 'int32'})

# ❌ apply() for simple math
df['total'] = df.apply(lambda r: r['price'] * r['qty'], axis=1)
# ✅ Vectorized arithmetic
df['total'] = df['price'] * df['qty']

Quick Reference

Vectorization Speed Order (fastest → slowest):
  1. Built-in pandas/NumPy operations (sum, mean, std)
  2. NumPy ufuncs on .values array
  3. .str / .dt accessors
  4. np.where / np.select for conditionals
  5. .apply(lambda, axis=0) — column-wise
  6. .apply(lambda, axis=1) — row-wise (use sparingly)
  7. iterrows() / itertuples()  ← NEVER for large datasets

GroupBy Guide:
  New column with group stat        → transform
  Summary table                     → agg
  Complex custom logic              → apply (last resort)

Memory Hierarchy:
  bool          1 byte
  int8/uint8    1 byte
  int16/uint16  2 bytes
  int32/float32 4 bytes
  int64/float64 8 bytes (default)
  object        variable (worst)
  category      ~4 bytes/value (for repeated strings)

Polars vs Pandas Decision:
  > 1GB dataset                     → polars
  Multi-core needed                 → polars
  scikit-learn integration          → pandas
  Complex apply() logic             → pandas (or polars expr)
, na=False) # ✅ .dt accessor for datetime operations df['year'] = df['created_at'].dt.year df['day_of_week'] = df['created_at'].dt.day_name() df['days_since'] = (pd.Timestamp.now() - df['created_at']).dt.days df['week_start'] = df['created_at'].dt.to_period('W').dt.start_time

Method Chaining with pipe()

__CODE_BLOCK_3__

GroupBy Patterns

__CODE_BLOCK_4__

Time Series with DatetimeIndex

__CODE_BLOCK_5__

Memory Optimization

__CODE_BLOCK_6__

EDA Checklist

__CODE_BLOCK_7__

Visualization

__CODE_BLOCK_8__

Anti-Patterns

__CODE_BLOCK_9__

Quick Reference

__CODE_BLOCK_10__

Skill Information

Source
MoltbotDen
Category
Data & Analytics
Repository
View on GitHub

Related Skills