
# Introduction
Focusing solely on SELECT, WHERE, and GROUP BY is sufficient for primary aggregation, however many actual analytical duties require patterns that transcend easy queries. Examples embrace detecting consecutive exercise streaks, segmenting prospects by spend tier, smoothing noisy time-series information, or tracing plan improve paths throughout rows.
This text walks by means of 7 sensible SQL patterns past the fundamentals, specializing in strategies that resolve actual analytical issues.
# Setting Up the Dataset
We’ll use a pattern buyer transactions desk from a fictional subscription software program as a service (SaaS) firm:
CREATE TABLE transactions (
transaction_id SERIAL PRIMARY KEY,
customer_id INT,
plan_type VARCHAR(20), -- 'starter', 'professional', 'enterprise'
quantity NUMERIC(10,2),
standing VARCHAR(20), -- 'accomplished', 'refunded', 'failed'
created_at TIMESTAMP
);
The total dataset of 36 transactions throughout 7 prospects, spanning September 2023 by means of June 2024, is on the market in seed.sql. Run it earlier than you progress on to the queries.
# 1. Measuring Time Between Occasions with LAG()
LAG() and LEAD() allow you to entry a earlier or subsequent row’s worth and not using a self-join. They’re notably helpful for calculating gaps between occasions like renewal cadence, churn alerts, and re-engagement delays.
Job: Calculate what number of days elapsed between every buyer’s successive accomplished transactions.
SELECT
customer_id,
created_at,
LAG(created_at) OVER (
PARTITION BY customer_id
ORDER BY created_at
) AS previous_transaction_at,
ROUND(
EXTRACT(EPOCH FROM (
created_at - LAG(created_at) OVER (
PARTITION BY customer_id
ORDER BY created_at
)
)) / 86400
) AS days_since_last
FROM transactions
WHERE standing="accomplished"
ORDER BY customer_id, created_at;
Output (truncated):
customer_id | created_at | previous_transaction_at | days_since_last
-------------+---------------------+-------------------------+-----------------
3317 | 2024-01-03 11:02:00 | |
3317 | 2024-03-15 10:45:00 | 2024-01-03 11:02:00 | 72
3317 | 2024-05-22 09:30:00 | 2024-03-15 10:45:00 | 68
4482 | 2023-09-10 09:00:00 | |
4482 | 2023-10-10 09:00:00 | 2023-09-10 09:00:00 | 30
4482 | 2023-11-10 09:14:00 | 2023-10-10 09:00:00 | 31
4482 | 2024-01-03 09:14:00 | 2023-11-10 09:14:00 | 54
4482 | 2024-03-03 08:20:00 | 2024-01-03 09:14:00 | 60
4482 | 2024-04-03 10:00:00 | 2024-03-03 08:20:00 | 31
4482 | 2024-05-01 11:00:00 | 2024-04-03 10:00:00 | 28
...
7891 | 2024-02-01 09:00:00 | |
7891 | 2024-04-01 09:00:00 | 2024-02-01 09:00:00 | 60
7891 | 2024-05-15 09:00:00 | 2024-04-01 09:00:00 | 44
8810 | 2024-01-05 12:00:00 | |
8810 | 2024-02-05 12:00:00 | 2024-01-05 12:00:00 | 31
8810 | 2024-04-05 12:00:00 | 2024-02-05 12:00:00 | 60
(29 rows)
The primary row per buyer all the time has NULL for each columns — there is not any prior occasion to reference. EXTRACT(EPOCH ...) converts the timestamp interval to seconds; dividing by 86400 provides days.
LEAD() works the identical method however seems to be ahead as an alternative of backward, making it helpful for calculating time-to-next-renewal or flagging the final transaction earlier than churn.
# 2. Evaluating a Row to Different Rows within the Identical Desk with a Self-Be a part of
A self-join relates rows inside the identical desk to one another. It is the fitting instrument when you want to evaluate two occasions for a similar entity throughout time — upgrades, downgrades, re-activations, or any earlier than/after sample.
Job: Discover prospects who upgraded from starter to professional (or professional to enterprise) at any level.
SELECT DISTINCT t1.customer_id
FROM transactions t1
JOIN transactions t2
ON t1.customer_id = t2.customer_id
AND t1.plan_type="starter"
AND t2.plan_type="professional"
AND t2.created_at > t1.created_at
WHERE t1.standing="accomplished"
AND t2.standing="accomplished"
ORDER BY t1.customer_id;
Output:
customer_id
-------------
4482
6204
7891
(3 rows)
The desk is aliased twice (t1, t2) so every alias can symbolize a special time limit for a similar buyer. The situation t2.created_at > t1.created_at enforces temporal order — with out it, you’d match prospects who merely had each plan varieties in any order, together with the flawed one. DISTINCT collapses circumstances the place a buyer had a number of starter transactions earlier than upgrading, which might in any other case produce duplicate rows.
This identical construction works for detecting downgrades, discovering prospects who churned and got here again, or evaluating any two states that have to be ordered by time.
# 3. Deciding on the High Row per Group with ROW_NUMBER()
While you want the top-N rows per class — highest transaction per buyer, most up-to-date occasion per account, first buy per cohort — ROW_NUMBER() inside a standard desk expression (CTE) is the usual method.
Job: Get every buyer’s single highest accomplished transaction.
WITH ranked AS (
SELECT
customer_id,
transaction_id,
quantity,
plan_type,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY quantity DESC, created_at DESC
) AS rn
FROM transactions
WHERE standing="accomplished"
)
SELECT customer_id, transaction_id, quantity, plan_type
FROM ranked
WHERE rn = 1
ORDER BY customer_id;
Output:
customer_id | transaction_id | quantity | plan_type
-------------+----------------+--------+------------
3317 | 12 | 19.00 | starter
4482 | 8 | 299.00 | enterprise
5901 | 19 | 299.00 | enterprise
6103 | 25 | 299.00 | enterprise
6204 | 28 | 79.00 | professional
7891 | 32 | 79.00 | professional
8810 | 36 | 79.00 | professional
(7 rows)
ROW_NUMBER() assigns 1 to the row that kinds first inside every partition. The outer question then filters to solely these rows. The secondary kind on created_at DESC acts as a tiebreaker; when two transactions have the identical quantity, the newer one wins.
In order for you ties included slightly than damaged, swap ROW_NUMBER() for RANK(). RANK() assigns the identical quantity to tied rows and skips the subsequent rank (1, 1, 3), whereas DENSE_RANK() does the identical with out skipping (1, 1, 2).
# 4. Segmenting Clients by Spend with NTILE(n)
NTILE(n) divides ordered rows into n roughly equal buckets and assigns every row a bucket quantity. It is the fitting instrument for buyer tiering, spend quartiles, or constructing cohorts for A/B evaluation with out hardcoding thresholds.
Job: Rank prospects into spend quartiles based mostly on their whole accomplished transaction worth.
WITH customer_spend AS (
SELECT
customer_id,
SUM(quantity) AS total_spend,
COUNT(*) AS total_transactions
FROM transactions
WHERE standing="accomplished"
GROUP BY customer_id
)
SELECT
customer_id,
total_spend,
total_transactions,
NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM customer_spend
ORDER BY total_spend DESC;
Output:
customer_id | total_spend | total_transactions | spend_quartile
-------------+-------------+--------------------+----------------
5901 | 1495.00 | 5 | 4
6103 | 835.00 | 5 | 3
4482 | 653.00 | 7 | 3
8810 | 237.00 | 3 | 2
6204 | 177.00 | 3 | 2
7891 | 177.00 | 3 | 1
3317 | 57.00 | 3 | 1
(7 rows)
Quartile 4 is your highest spenders; quartile 1 is your lowest. NTILE() would not hardcode spend thresholds, so the buckets recalibrate mechanically as new prospects are added. This makes it extra strong than static cutoffs like CASE WHEN total_spend > 500.
# 5. Smoothing Noisy Information with a Rolling Window
A rolling (or transferring) common smooths out month-to-month volatility, making traits in time-series information a lot simpler to learn. Window capabilities with an express ROWS BETWEEN body offer you exact management over what number of durations to incorporate.
Job: Calculate a 3-month rolling common of month-to-month income to clean out noise.
WITH month-to-month AS (
SELECT
DATE_TRUNC('month', created_at)::DATE AS month,
SUM(quantity) AS monthly_revenue
FROM transactions
WHERE standing="accomplished"
GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
month,
monthly_revenue,
ROUND(AVG(monthly_revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2) AS revenue_3mo_avg
FROM month-to-month
ORDER BY month;
Output:
month | monthly_revenue | revenue_3mo_avg
-------------+-----------------+-----------------
2023-09-01 | 19.00 | 19.00
2023-10-01 | 19.00 | 19.00
2023-11-01 | 79.00 | 39.00
2024-01-01 | 275.00 | 124.33
2024-02-01 | 476.00 | 276.67
2024-03-01 | 555.00 | 435.33
2024-04-01 | 835.00 | 622.00
2024-05-01 | 775.00 | 721.67
2024-06-01 | 598.00 | 736.00
(9 rows)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW tells the window perform to have a look at the present row and the 2 rows earlier than it. The primary two rows use fewer inputs since there is not any prior historical past, in order that they act as a 1-month and 2-month common respectively.
Swap ROWS for RANGE if you wish to embrace all rows with the identical ORDER BY worth (helpful when a number of rows share a timestamp). For longer smoothing, change 2 PRECEDING to 5 PRECEDING for a 6-month window.
# 6. Aggregating Conditionally with FILTER
FILTER allows you to apply a WHERE situation to a particular combination with out splitting the question into a number of subqueries. The result’s a number of conditional aggregations in a single move over the info.
Job: Get whole income, refunds, and failed transaction counts damaged out by month — multi function row monthly.
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(quantity) FILTER (WHERE standing="accomplished") AS revenue_completed,
SUM(quantity) FILTER (WHERE standing="refunded") AS revenue_refunded,
COUNT(*) FILTER (WHERE standing="failed") AS failed_count
FROM transactions
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;
Output:
month | revenue_completed | revenue_refunded | failed_count
------------------------+-------------------+------------------+--------------
2023-09-01 00:00:00+00 | 19.00 | | 0
2023-10-01 00:00:00+00 | 19.00 | | 0
2023-11-01 00:00:00+00 | 79.00 | | 0
2024-01-01 00:00:00+00 | 275.00 | | 0
2024-02-01 00:00:00+00 | 476.00 | 79.00 | 1
2024-03-01 00:00:00+00 | 555.00 | 79.00 | 0
2024-04-01 00:00:00+00 | 835.00 | 299.00 | 0
2024-05-01 00:00:00+00 | 775.00 | | 1
2024-06-01 00:00:00+00 | 598.00 | | 2
(9 rows)
The choice to FILTER is three separate subqueries joined collectively — extra code, tougher to learn, and infrequently slower. Be aware that SUM with FILTER returns NULL (not zero) when no rows match in a given month, which is correct: there genuinely had been no refunds in these months. Wrap in COALESCE(..., 0) if you happen to desire zeros.
FILTER is commonplace SQL and works in PostgreSQL and BigQuery. In Snowflake and a few others, use SUM(CASE WHEN standing="accomplished" THEN quantity END) as an alternative.
# 7. Detecting Consecutive Exercise Streaks with Window Capabilities
Discovering unbroken sequences — lively months and not using a hole, consecutive days with transactions, subscription streaks — is without doubt one of the trickier SQL issues. The traditional answer makes use of a window perform to group rows into streaks and not using a recursive CTE.
The approach: assign every lively month a sequential row quantity inside its buyer partition. If the months are really consecutive, subtracting that row quantity from the month date produces the identical fixed worth for each month within the streak. A niche breaks the fixed.
Job: Discover every buyer’s consecutive lively months (months with at the least one accomplished transaction).
WITH monthly_activity AS (
SELECT
customer_id,
DATE_TRUNC('month', created_at)::DATE AS active_month
FROM transactions
WHERE standing="accomplished"
GROUP BY customer_id, DATE_TRUNC('month', created_at)
),
with_prev AS (
SELECT
customer_id,
active_month,
LAG(active_month) OVER (
PARTITION BY customer_id
ORDER BY active_month
) AS prev_month
FROM monthly_activity
),
streak_groups AS (
SELECT
customer_id,
active_month,
SUM(CASE WHEN active_month = prev_month + INTERVAL '1 month' THEN 0 ELSE 1 END)
OVER (PARTITION BY customer_id ORDER BY active_month) AS streak_id
FROM with_prev
),
streaks AS (
SELECT
customer_id,
streak_id,
MIN(active_month) AS streak_start,
MAX(active_month) AS streak_end,
COUNT(*) AS streak_length_months
FROM streak_groups
GROUP BY customer_id, streak_id
)
SELECT customer_id, streak_start, streak_end, streak_length_months
FROM streaks
ORDER BY customer_id, streak_start;
Output:
customer_id | streak_start | streak_end | streak_length_months
-------------+--------------+------------+----------------------
3317 | 2024-01-01 | 2024-01-01 | 1
3317 | 2024-03-01 | 2024-03-01 | 1
3317 | 2024-05-01 | 2024-05-01 | 1
4482 | 2023-09-01 | 2023-11-01 | 3
4482 | 2024-01-01 | 2024-01-01 | 1
4482 | 2024-03-01 | 2024-05-01 | 3
5901 | 2024-02-01 | 2024-06-01 | 5
6103 | 2024-01-01 | 2024-04-01 | 4
6103 | 2024-06-01 | 2024-06-01 | 1
6204 | 2024-01-01 | 2024-01-01 | 1
6204 | 2024-03-01 | 2024-03-01 | 1
6204 | 2024-05-01 | 2024-05-01 | 1
7891 | 2024-02-01 | 2024-02-01 | 1
7891 | 2024-04-01 | 2024-05-01 | 2
8810 | 2024-01-01 | 2024-02-01 | 2
8810 | 2024-04-01 | 2024-04-01 | 1
(16 rows)
# Fast Reference
These patterns work in commonplace SQL with out counting on database-specific options, and so they seem steadily in analytical workflows similar to retention evaluation, improve funnel monitoring, and income reporting.
| Tip | When to Use It |
|---|---|
LAG() / LEAD()
|
Time between occasions, earlier than/after comparisons per entity |
| Self-join | Detect transitions between states (upgrades, re-activations) |
ROW_NUMBER()
|
High-N rows per group, deduplication |
NTILE(n)
|
Buyer segmentation into spend/exercise tiers |
Rolling window (ROWS BETWEEN)
|
Clean noisy time-series, transferring averages |
FILTER
|
A number of conditional aggregations in a single question move |
| Consecutive streak detection | Subscription streaks, retention evaluation, session gaps |
When you’re comfy with them, many multi-step information transformations which are usually dealt with in Python might be expressed extra cleanly and effectively in a single SQL question.
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embrace DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and occasional! Presently, she’s engaged on studying and sharing her information with the developer group by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates partaking useful resource overviews and coding tutorials.
