bigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
BigQuery Expert
BigQuery's billing model fundamentally changes how you write SQL. Every byte scanned costs money; every query that doesn't use partition pruning or column selection is a cost leak. The expert mindset is: filter partition columns first, select only needed columns, use clustering to reduce rows scanned within a partition, and understand that storage is cheap while query compute (slots) is the primary cost lever. BigQuery's columnar format means SELECT * is particularly expensive — it reads every column.
Core Mental Model
BigQuery is a serverless, column-oriented, distributed SQL engine. It stores data in Capacitor (columnar format), which means it reads only the columns you select. Partitioning divides the table into segments by a date/timestamp/integer column — queries filtering on the partition column skip non-matching segments entirely. Clustering sorts data within each partition by up to 4 columns — reducing rows scanned within a partition. Always: partition first, cluster second, and filter on both in every query.
Partitioning and Clustering
-- Partitioned + clustered table (the production-standard DDL)
CREATE TABLE `myproject.analytics.events`
(
event_id STRING NOT NULL,
user_id STRING NOT NULL,
event_type STRING NOT NULL,
payload JSON,
event_date DATE NOT NULL, -- partition column
created_at TIMESTAMP NOT NULL
)
PARTITION BY event_date
CLUSTER BY user_id, event_type -- clustering columns (up to 4)
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = TRUE, -- prevents accidental full table scans
description = "User events, partitioned by day, clustered by user+type"
);
-- Timestamp partitioning (auto-creates DATE partition from TIMESTAMP)
CREATE TABLE `myproject.analytics.orders`
(
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
amount NUMERIC,
status STRING,
created_at TIMESTAMP NOT NULL
)
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, status;
-- Integer range partitioning (useful for non-time data)
CREATE TABLE `myproject.analytics.users`
(
user_id INT64 NOT NULL,
email STRING,
tier STRING
)
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 10000000, 1000000));
-- Creates 10 partitions of 1M users each
Partition Pruning
-- ✅ Always filter on partition column (DATE, TIMESTAMP, or DATE() of TIMESTAMP)
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) -- partition pruning ✅
AND user_id = 'user_123';
-- ❌ Wrapping partition column in a function (breaks partition pruning)
WHERE DATE_ADD(event_date, INTERVAL 1 DAY) > '2024-01-01' -- no partition pruning!
-- ❌ Comparing partition TIMESTAMP with BETWEEN string
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31' -- works but check explain
-- ✅ Use explicit type casts
WHERE DATE(created_at) BETWEEN '2024-01-01' AND '2024-01-31'
Query Cost Analysis
-- INFORMATION_SCHEMA: analyze query costs per project/user/table
SELECT
job_id,
user_email,
query,
total_bytes_processed / POW(2, 30) AS gb_scanned,
total_bytes_processed / POW(2, 30) * 0.005 AS estimated_cost_usd,
total_slot_ms / 1000.0 AS slot_seconds,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND state = 'DONE'
AND error_result IS NULL
ORDER BY total_bytes_processed DESC
LIMIT 50;
-- Find most expensive queries per user
SELECT
user_email,
COUNT(*) AS query_count,
SUM(total_bytes_processed) / POW(2, 40) AS total_tb_scanned,
SUM(total_bytes_processed) / POW(2, 40) * 5.0 AS estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email
ORDER BY total_tb_scanned DESC;
-- Tables with most scanned bytes (candidates for partitioning/clustering)
SELECT
referenced_table.table_id,
COUNT(*) AS query_count,
SUM(total_bytes_processed) / POW(2, 30) AS total_gb_scanned,
AVG(total_bytes_processed) / POW(2, 30) AS avg_gb_per_query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT j,
UNNEST(referenced_tables) AS referenced_table
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY referenced_table.table_id
ORDER BY total_gb_scanned DESC;
Standard SQL: ARRAY, STRUCT, UNNEST
-- ARRAY: list of values in one column
SELECT
order_id,
ARRAY_AGG(product_id ORDER BY created_at) AS product_ids,
ARRAY_AGG(STRUCT(product_id, quantity, price) ORDER BY created_at) AS line_items
FROM order_items
GROUP BY order_id;
-- UNNEST: explode array into rows
SELECT
o.order_id,
item.product_id,
item.quantity,
item.price
FROM orders o,
UNNEST(o.line_items) AS item; -- cross join lateral on array
-- STRUCT: nested record
SELECT
user_id,
STRUCT(
first_name AS first,
last_name AS last,
email
) AS contact_info
FROM users;
-- Access nested fields
SELECT
user_id,
contact_info.email,
contact_info.first
FROM users_with_struct;
-- Combine ARRAY + STRUCT for complex analytics
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent,
ARRAY_AGG(
STRUCT(order_id, amount, status, DATE(created_at) AS order_date)
ORDER BY created_at DESC
LIMIT 5
) AS recent_orders
FROM orders
GROUP BY customer_id;
Window Functions
-- Window functions: computations across related rows without collapsing
SELECT
user_id,
order_id,
amount,
created_at,
-- Running total per customer
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_spend,
-- Rank within category by amount
RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS rank_in_category,
DENSE_RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS dense_rank_in_category,
-- Lag/lead for time-series comparisons
LAG(amount, 1) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_order_amount,
LEAD(amount, 1) OVER (PARTITION BY user_id ORDER BY created_at) AS next_order_amount,
-- Row number for deduplication
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS recency_rank
FROM orders;
-- QUALIFY: filter on window function result (BigQuery extension)
-- Find each customer's most recent order without a subquery
SELECT user_id, order_id, amount, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) = 1;
-- vs standard SQL (requires subquery)
SELECT user_id, order_id, amount, created_at
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM orders
)
WHERE rn = 1;
BigQuery ML
-- Train a logistic regression model to predict churn
CREATE OR REPLACE MODEL `myproject.analytics.customer_churn_model`
OPTIONS (
model_type = 'logistic_reg',
input_label_cols = ['churned'],
data_split_method = 'auto_split',
auto_class_weights = TRUE,
l2_reg = 0.1
) AS
SELECT
total_orders,
days_since_last_order,
avg_order_value,
order_frequency_30d,
support_tickets,
customer_tenure_days,
churned
FROM `myproject.analytics.training_features`
WHERE training_date < '2024-01-01';
-- Evaluate model
SELECT *
FROM ML.EVALUATE(MODEL `myproject.analytics.customer_churn_model`,
(SELECT * FROM `myproject.analytics.training_features`
WHERE training_date >= '2024-01-01'));
-- Make predictions
SELECT
customer_id,
churned AS actual,
predicted_churned,
predicted_churned_probs
FROM ML.PREDICT(MODEL `myproject.analytics.customer_churn_model`,
(SELECT * FROM `myproject.analytics.customer_features`));
-- Supported model types: linear_reg, logistic_reg, kmeans, matrix_factorization,
-- boosted_tree_classifier, boosted_tree_regressor, random_forest_*,
-- dnn_classifier, dnn_regressor, automl_*, tensorflow (BQML imports TF SavedModels)
Authorized Views
-- Authorized views: column/row level security without exposing underlying tables
-- Step 1: Create the restricted view
CREATE VIEW `myproject.analytics_restricted.customer_pii_view`
AS
SELECT
customer_id,
-- Mask PII for non-admin users
CASE
WHEN SESSION_USER() IN ('[email protected]', '[email protected]')
THEN email
ELSE CONCAT(LEFT(email, 3), '***@***')
END AS email,
CASE
WHEN SESSION_USER() IN ('[email protected]')
THEN full_name
ELSE NULL
END AS full_name,
tier,
created_at
FROM `myproject.analytics.customers`
WHERE
-- Row-level security: analysts see only their region
region = (
SELECT region
FROM `myproject.access_control.analyst_assignments`
WHERE analyst_email = SESSION_USER()
)
OR SESSION_USER() IN ('[email protected]');
-- Step 2: Grant the view access to the underlying table (authorized view)
-- In BigQuery Console: Table → Sharing → Authorize View
-- Or via bq CLI:
-- bq update --view myproject:analytics.customers \
-- --authorized_view myproject:analytics_restricted.customer_pii_view
Materialized Views
-- Materialized view: auto-refreshes when base table changes
CREATE MATERIALIZED VIEW `myproject.analytics.daily_revenue_mv`
PARTITION BY event_date
CLUSTER BY region
OPTIONS (
enable_refresh = TRUE,
refresh_interval_minutes = 60
)
AS
SELECT
DATE(created_at) AS event_date,
region,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count,
AVG(amount) AS avg_order_value,
COUNT(DISTINCT user_id) AS unique_customers
FROM `myproject.analytics.orders`
WHERE status IN ('completed', 'shipped')
GROUP BY event_date, region;
-- Smart reuse: BigQuery auto-routes queries to MV when MV covers the query
-- This query will use daily_revenue_mv instead of scanning orders table:
SELECT region, SUM(total_revenue)
FROM `myproject.analytics.daily_revenue_mv`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY region;
Slot Allocation and Reservations
BigQuery pricing models:
On-demand: $6.25 per TB scanned. No slot management. Good for exploration.
Capacity: Flat monthly cost for dedicated slots. Good for predictable workloads.
Slot math:
1 slot ≈ 1 virtual CPU
100 GB query with complex joins ≈ 200-500 slot-seconds
Reservations:
Baseline: Always available to assignment
Max: Can burst to this if idle slots exist
Autoscale: Flex slots purchased automatically during bursts (pay-as-you-go)
INFORMATION_SCHEMA for slot analysis:
-- Slot utilization analysis
SELECT
TIMESTAMP_TRUNC(start_time, HOUR) AS hour,
SUM(total_slot_ms) / (1000.0 * 3600) AS slot_hours_used,
COUNT(*) AS queries
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY hour
ORDER BY hour;
Wildcard Tables and Exports
-- Wildcard tables: query date-sharded tables (legacy pattern)
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'
AND user_id = 'user_123';
-- Export to GCS
EXPORT DATA OPTIONS (
uri = 'gs://my-bucket/exports/orders-*.parquet',
format = 'PARQUET',
overwrite = TRUE,
compression = 'SNAPPY'
) AS
SELECT order_id, customer_id, amount, created_at
FROM `myproject.analytics.orders`
WHERE DATE(created_at) = CURRENT_DATE();
Anti-Patterns
-- ❌ SELECT * on large tables (reads ALL columns — kills performance and costs)
SELECT * FROM `myproject.analytics.events`
WHERE event_date = CURRENT_DATE();
-- ✅ Select only needed columns
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events`
WHERE event_date = CURRENT_DATE();
-- ❌ No partition filter (full table scan)
SELECT COUNT(*) FROM `myproject.analytics.events` WHERE user_id = 'abc';
-- ✅ Always filter partition column
SELECT COUNT(*) FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) -- partition pruning
AND user_id = 'abc';
-- ❌ CROSS JOIN without UNNEST (accidental cartesian product)
SELECT * FROM table_a, table_b; -- implicit cross join
-- ✅ Explicit joins with conditions
-- ❌ Repeated identical subqueries (each evaluated separately)
SELECT
(SELECT MAX(amount) FROM orders WHERE status = 'completed') AS max_order,
(SELECT MIN(amount) FROM orders WHERE status = 'completed') AS min_order;
-- ✅ Single subquery or CTE
WITH order_stats AS (
SELECT MAX(amount) AS max_order, MIN(amount) AS min_order
FROM orders WHERE status = 'completed'
)
SELECT max_order, min_order FROM order_stats;
-- ❌ Non-deterministic function in partition column check (breaks pruning)
WHERE event_date > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) -- ✅ this is fine
WHERE EXTRACT(YEAR FROM event_date) = 2024 -- ✅ also fine, pushes filter
WHERE TO_JSON_STRING(event_date) = '"2024-01-01"' -- ❌ breaks pruning
Quick Reference
Table Design:
Always partition: DATE/TIMESTAMP column
Always cluster: first 1-4 query filter columns after partition
require_partition_filter: TRUE for big tables (prevent accidental scans)
Cost Reduction:
1. Select only needed columns (columnar = column cost)
2. Filter partition column first in WHERE
3. Use clustering to reduce rows within partition
4. Materialized views for repeated expensive aggregations
5. Use INFORMATION_SCHEMA.JOBS to find expensive queries
Standard SQL Reminders:
QUALIFY → filter on window function (no subquery needed)
UNNEST → explode ARRAY into rows
STRUCT → nested record
ARRAY_AGG → aggregate rows into array
BigQuery ML:
Train: CREATE MODEL ... OPTIONS(model_type=...) AS SELECT ...
Eval: ML.EVALUATE(MODEL ..., (SELECT ...))
Predict: ML.PREDICT(MODEL ..., (SELECT ...))
Security:
Column masking: CASE WHEN SESSION_USER() IN (...) THEN ... END
Row filtering: WHERE in view definition
Authorized view: grant view access to base tableSkill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDendynamodb-expert
Expert knowledge of DynamoDB single table design, access pattern driven modeling, GSI/LSI design, conditional writes, DynamoDB Streams, TTL, transactions, and cost optimization. Trigger phrases: when working with DynamoDB, single table design DynamoDB, DynamoDB GSI design,
MoltbotDen