4. Aggregation Analytical Functions

PDF Slides

Open PDF

1. Title & Short Summary

Aggregate & Window (Analytical) Functions in SQL

This lesson shows how to:

Summarize data with aggregate functions (COUNT, SUM, AVG, MIN, MAX).
Use GROUP BY to turn detailed rows into summary rows.
Use window (analytic) functions to compute aggregates, rankings, and comparisons without losing row-level detail.

2. Big Picture – Where This Fits

In SQL learning, this chapter sits in “data analysis & analytics”:

Earlier you learned basic SELECT–FROM–WHERE and maybe simple GROUP BY.
Here you go beyond simple aggregation:
- Aggregate functions + GROUP BY → classic reports (“sales per customer”).
- Window functions → advanced analytics: running totals, moving averages, top-N per group, month-over-month changes, percentiles, etc.

Think of it like this:

GROUP BY → compress rows into summaries (you lose row detail).
WINDOW → add extra calculated columns alongside the original rows. (Diagram on pages 6–8 of the slides illustrates this contrast. )

3. Key Concepts & Definitions

3.1 Core aggregates

Concept	Meaning	Notes / Data types
`COUNT(expr)`	Number of rows (or non-NULL values of `expr`)	Works for any type. `COUNT(*)` counts rows including NULLs; `COUNT(col)` ignores NULLs.
`SUM(expr)`	Sum of numeric values	Numbers only. NULLs are ignored.
`AVG(expr)`	Average of numeric values	Numbers only. NULLs ignored by default. Use `COALESCE` to treat NULLs as 0.
`MIN(expr)`	Smallest value	Any comparable type.
`MAX(expr)`	Largest value	Any comparable type.

3.2 GROUP BY vs Window functions

Feature	GROUP BY	Window functions (OVER)
Output row count	Decreases (rows collapsed per group)	Same as input (row-level)
Purpose	Simple aggregation reports	Aggregation + detailed rows
Typical use	“Total sales per product”	“Total sales per product on each row”, running totals, rankings
Syntax idea	`SELECT group_col, SUM(x) FROM t GROUP BY group_col`	`SELECT x, SUM(x) OVER (PARTITION BY group_col) FROM t`

(See the side-by-side diagrams on pages 6–10. )

3.3 Window syntax pieces

From the slides, a generic window expression looks like:

<window_function>(expression)
OVER (
    [PARTITION BY ...]
    [ORDER BY ...]
    [frame_clause]
)

Pieces:

Window function – what calculation to do (e.g. SUM, ROW_NUMBER, LAG).
Expression – argument(s) you pass, e.g. Sales, NTILE(3), LEAD(Sales, 2, 0).
PARTITION BY – splits rows into windows (groups); calculations restart in each partition.
ORDER BY – defines the order inside each partition (essential for running totals, rankings, LAG/LEAD).
FRAME clause (ROWS/RANGE ...) – defines the subset of rows in the partition the function can “see” (for running vs rolling totals).

3.4 Categories of window functions

Visual on pages 9, 15, 39, 41, 98 groups window functions into:

Category	Functions	What they do
Window aggregates	`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`	One summary value per row (over a window)
Ranking	`ROW_NUMBER`, `RANK`, `DENSE_RANK`, `NTILE`, `CUME_DIST`, `PERCENT_RANK`	Assign positions / buckets based on ordering
Value / analytic	`LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`	Pull a value from another row in the window

4. Syntax Cheat-Sheet

4.1 Plain aggregate functions

From 13_Aggregate_Functions.sql:

-- Count rows
SELECT COUNT(*) AS total_customers
FROM customers;

-- Total sales
SELECT SUM(sales) AS total_sales
FROM orders;

-- Average, max, min
SELECT AVG(sales) AS avg_sales FROM orders;
SELECT MAX(score) AS max_score FROM customers;
SELECT MIN(score) AS min_score FROM customers;

4.2 GROUP BY pattern

SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(sales) AS total_sales,
    AVG(sales) AS avg_sales,
    MAX(sales) AS highest_sales,
    MIN(sales) AS lowest_sales
FROM orders
GROUP BY customer_id;

This matches the classic “aggregations per group” example.

4.3 Basic window function patterns

Total across all rows (no partition):

SUM(Sales) OVER () AS Total_Sales

Used in several scripts and in slides to show “entire dataset” calculations.

Per group / partition:

SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product

Multiple partition levels:

SUM(Sales) OVER () AS Total_Sales,
SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product,
SUM(Sales) OVER (PARTITION BY ProductID, OrderStatus) AS Sales_By_Product_Status

(from Window Basics Task 5).

4.4 ORDER and FRAME – running vs rolling

Running total by status (default frame):

SUM(Sales) OVER (
    PARTITION BY OrderStatus
    ORDER BY OrderDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS Total_Sales

Rolling window (current + next 2 rows):

SUM(Sales) OVER (
    PARTITION BY OrderStatus
    ORDER BY OrderDate
    ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) AS Total_Sales

(from Tasks 7–11 in Window Basics and the rolling/ running diagrams on pages 59–64).

4.5 Window aggregate samples

-- Count orders overall and per customer
SELECT
    OrderID,
    OrderDate,
    CustomerID,
    COUNT(*) OVER () AS TotalOrders,
    COUNT(*) OVER (PARTITION BY CustomerID) AS OrdersByCustomers
FROM Sales.Orders;

-- Moving average of Sales per product
SELECT
    OrderID,
    ProductID,
    OrderDate,
    Sales,
    AVG(Sales) OVER (PARTITION BY ProductID) AS AvgByProduct,
    AVG(Sales) OVER (PARTITION BY ProductID ORDER BY OrderDate) AS MovingAvg
FROM Sales.Orders;

(from 15_Window_Aggregations.sql).

4.6 Ranking function patterns

-- Basic ranking three ways
SELECT
    OrderID,
    ProductID,
    Sales,
    ROW_NUMBER() OVER (ORDER BY Sales DESC)     AS SalesRank_Row,
    RANK()       OVER (ORDER BY Sales DESC)     AS SalesRank_Rank,
    DENSE_RANK() OVER (ORDER BY Sales DESC)     AS SalesRank_Dense
FROM Sales.Orders;

-- Top-1 sale per product (Top-N per group)
SELECT *
FROM (
    SELECT
        OrderID,
        ProductID,
        Sales,
        ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Sales DESC) AS RankByProduct
    FROM Sales.Orders
) AS TopProductSales
WHERE RankByProduct = 1;

-- Buckets with NTILE
SELECT 
    OrderID,
    Sales,
    NTILE(3) OVER (ORDER BY Sales DESC) AS Buckets
FROM Sales.Orders;

(from 16_Window_Ranking.sql and ranking diagrams on pages 75–85).

4.7 Value function patterns (LEAD/LAG/FIRST/LAST)

-- Month-over-month change using LAG
SELECT
    * ,
    CurrentMonthSales - PreviousMonthSales AS MoM_Change,
    ROUND(
        CAST((CurrentMonthSales - PreviousMonthSales) AS FLOAT)
        / PreviousMonthSales * 100, 1
    ) AS MoM_Perc
FROM (
    SELECT
        MONTH(OrderDate) AS OrderMonth,
        SUM(Sales) AS CurrentMonthSales,
        LAG(SUM(Sales)) OVER (ORDER BY MONTH(OrderDate)) AS PreviousMonthSales
    FROM Sales.Orders
    GROUP BY MONTH(OrderDate)
) AS MonthlySales;

-- FIRST_VALUE / LAST_VALUE per product
SELECT
    OrderID,
    ProductID,
    Sales,
    FIRST_VALUE(Sales) OVER (PARTITION BY ProductID ORDER BY Sales) AS LowestSales,
    LAST_VALUE(Sales) OVER (
        PARTITION BY ProductID 
        ORDER BY Sales 
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    ) AS HighestSales,
    Sales - FIRST_VALUE(Sales) OVER (PARTITION BY ProductID ORDER BY Sales) AS SalesDifference
FROM Sales.Orders;

(from 17_Window_Value_Functions.sql and the value-function diagrams on pages 99–104, 106–110).

5. Worked Examples with Explanations

5.1 Aggregates vs GROUP BY

Example 1 – basic aggregates

SELECT COUNT(*) AS total_customers
FROM customers;

What it does: returns one row with the total count of customers.
Pattern: “Apply aggregate to whole table → 1-row summary”.

Example 2 – grouped aggregates

SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(sales) AS total_sales
FROM orders
GROUP BY customer_id;

Each customer_id becomes one row.
Aggregates are computed within each group.
This matches the “squeeze many rows per customer into 1 summary row” diagram on page 6.

5.2 GROUP BY vs Window SUM

From Window Basics Tasks 1–4:

-- Task 1: plain total
SELECT SUM(Sales) AS Total_Sales
FROM Sales.Orders;

-- Task 2: total per product using GROUP BY
SELECT 
    ProductID,
    SUM(Sales) AS Total_Sales
FROM Sales.Orders
GROUP BY ProductID;

-- Task 3: total across all orders as a window (detail kept)
SELECT
    OrderID,
    OrderDate,
    ProductID,
    Sales,
    SUM(Sales) OVER () AS Total_Sales
FROM Sales.Orders;

-- Task 4: per-product totals as a window
SELECT
    OrderID,
    OrderDate,
    ProductID,
    Sales,
    SUM(Sales) OVER () AS Total_Sales,
    SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product
FROM Sales.Orders;

Key idea:

GROUP BY version returns one row per product (no order details).
Window version returns one row per order, plus columns showing:
- total over all orders
- total per product

This exactly matches the “Window Functions keep details” diagram on page 7.

5.3 Multi-level partitions

From Task 5:

SELECT
    OrderID,
    OrderDate,
    ProductID,
    OrderStatus,
    Sales,
    SUM(Sales) OVER () AS Total_Sales,
    SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product,
    SUM(Sales) OVER (PARTITION BY ProductID, OrderStatus) AS Sales_By_Product_Status
FROM Sales.Orders;

SUM(Sales) OVER () – full dataset total.
PARTITION BY ProductID – total per product.
PARTITION BY ProductID, OrderStatus – total per product + status pair. This corresponds to the slide that shows combined columns in PARTITION BY.

5.4 Frames: running & rolling totals

Compare with the running/rolling Total diagrams on pages 59–64.

Running total example:

SELECT
    OrderID,
    OrderDate,
    ProductID,
    OrderStatus,
    Sales,
    SUM(Sales) OVER (
        PARTITION BY OrderStatus 
        ORDER BY OrderDate 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS Total_Sales
FROM Sales.Orders;

For each order status, sums from first order in time up to current order.

Rolling total example:

SELECT
    OrderID,
    ProductID,
    OrderDate,
    Sales,
    AVG(Sales) OVER (
        PARTITION BY ProductID 
        ORDER BY OrderDate 
        ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
    ) AS RollingAvg
FROM Sales.Orders;

For each product, takes the average of current order + next order only (2-row moving window).

5.5 Window aggregates in analysis use cases

Counting & duplicate detection

SELECT 
    * 
FROM (
    SELECT 
        *,
        COUNT(*) OVER(PARTITION BY OrderID) AS CheckDuplicates
    FROM Sales.OrdersArchive
) t
WHERE CheckDuplicates > 1;

COUNT(*) OVER(PARTITION BY OrderID) counts how many rows share the same OrderID.
Filtering for > 1 returns only duplicated orders.

Percent of total

SELECT
    OrderID,
    ProductID,
    Sales,
    SUM(Sales) OVER () AS TotalSales,
    ROUND(CAST(Sales AS FLOAT) / SUM(Sales) OVER () * 100, 2) AS PercentageOfTotal
FROM Sales.Orders;

Each row knows its own share of the global total.

Deviation from min / max

SELECT
    OrderID,
    OrderDate,
    ProductID,
    Sales,
    MAX(Sales) OVER () AS HighestSales,
    MIN(Sales) OVER () AS LowestSales,
    Sales - MIN(Sales) OVER () AS DeviationFromMin,
    MAX(Sales) OVER () - Sales AS DeviationFromMax
FROM Sales.Orders;

Lets you compare each row to global extremes quickly.

5.6 Ranking examples

From 16_Window_Ranking.sql and ranking slides.

Top sale per product (Top-N per group)

SELECT *
FROM (
    SELECT
        OrderID,
        ProductID,
        Sales,
        ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Sales DESC) AS RankByProduct
    FROM Sales.Orders
) AS TopProductSales
WHERE RankByProduct = 1;

Bottom 2 customers by total sales

SELECT *
FROM (
    SELECT
        CustomerID,
        SUM(Sales) AS TotalSales,
        ROW_NUMBER() OVER (ORDER BY SUM(Sales)) AS RankCustomers
    FROM Sales.Orders
    GROUP BY CustomerID
) AS BottomCustomerSales
WHERE RankCustomers <= 2;

Bucketization with NTILE

SELECT
    OrderID,
    Sales,
    Buckets,
    CASE 
        WHEN Buckets = 1 THEN 'High'
        WHEN Buckets = 2 THEN 'Medium'
        WHEN Buckets = 3 THEN 'Low'
    END AS SalesSegmentations
FROM (
    SELECT
        OrderID,
        Sales,
        NTILE(3) OVER (ORDER BY Sales DESC) AS Buckets
    FROM Sales.Orders
) AS SalesBuckets;

Similar to NTILE bucket diagrams on pages 83–85: rows are split into roughly equal groups; higher buckets get bigger groups first.

CUME_DIST for top 40% of prices

SELECT 
    Product,
    Price,
    DistRank,
    CONCAT(DistRank * 100, '%') AS DistRankPerc
FROM (
    SELECT
        Product,
        Price,
        CUME_DIST() OVER (ORDER BY Price DESC) AS DistRank
    FROM Sales.Products
) AS PriceDistribution
WHERE DistRank <= 0.4;

CUME_DIST = (# rows with value ≥ current when ordered DESC) / total rows, as in formulas on pages 89–91.

5.7 Value function examples

From 17_Window_Value_Functions.sql and LAG/LEAD diagrams on pages 102–110.

Month-over-month sales change

SELECT
    *,
    CurrentMonthSales - PreviousMonthSales AS MoM_Change,
    ROUND(
        CAST((CurrentMonthSales - PreviousMonthSales) AS FLOAT)
        / PreviousMonthSales * 100, 1
    ) AS MoM_Perc
FROM (
    SELECT
        MONTH(OrderDate) AS OrderMonth,
        SUM(Sales) AS CurrentMonthSales,
        LAG(SUM(Sales)) OVER (ORDER BY MONTH(OrderDate)) AS PreviousMonthSales
    FROM Sales.Orders
    GROUP BY MONTH(OrderDate)
) AS MonthlySales;

LAG peeks at the previous month’s aggregated sales.

Customer loyalty – avg days between orders

SELECT
    CustomerID,
    AVG(DaysUntilNextOrder) AS AvgDays,
    RANK() OVER (ORDER BY COALESCE(AVG(DaysUntilNextOrder), 999999)) AS RankAvg
FROM (
    SELECT
        OrderID,
        CustomerID,
        OrderDate AS CurrentOrder,
        LEAD(OrderDate) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS NextOrder,
        DATEDIFF(
            day,
            OrderDate,
            LEAD(OrderDate) OVER (PARTITION BY CustomerID ORDER BY OrderDate)
        ) AS DaysUntilNextOrder
    FROM Sales.Orders
) AS CustomerOrdersWithNext
GROUP BY CustomerID;

Shorter average gaps → more frequent, “loyal” customers.

First & last value per product

Already shown in syntax section; it matches the “timeline” diagrams where FIRST_VALUE picks the first sales value and LAST_VALUE the last one (with an explicit frame).

6. Common Mistakes & Tips

Using window functions in WHERE or GROUP BY
- Rule slide: “Window functions can only be used in SELECT or ORDER BY clauses.”
- The script shows an invalid example: WHERE SUM(Sales) OVER (...) > 100. Instead, wrap the query and filter in an outer SELECT.
Nesting window functions
- Slides explicitly say: “Not allowed to nest window functions!” and show AVG( SUM(Sales) OVER() ) OVER(...) as invalid.
Confusing COUNT(*) vs COUNT(col)
- COUNT(*) / COUNT(1) counts rows including NULLs.
- COUNT(col) ignores rows where col is NULL (example table with Gloves NULL only counted by COUNT(*) in the slides).
Forgetting that AVG ignores NULLs
- Slides and script show using COALESCE(Score,0) to include NULLs in average.
Misunderstanding frames
- Frame requires ORDER BY.
- “Lower” boundary must not be after the “higher” boundary.
- Default frame for ordered windows is effectively RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running from start to current).
Assuming all window functions allow frames
- Slides show a matrix:
  - Aggregate and value functions: frame optional, and especially recommended for LAST_VALUE.
  - Ranking functions: frame not allowed; they always consider the entire partition.
Expecting ROW_NUMBER to handle ties “nicely”
- ROW_NUMBER always gives unique numbers → ties are broken arbitrarily by the ORDER BY.
- Use RANK (with gaps) or DENSE_RANK (no gaps) when ties should share the same rank, as highlighted in the ranking diagrams.
Forgetting default offset / default value in LEAD/LAG
- Offset default is 1.
- Default value is NULL, but you can override it (slides show LEAD(Sales, 2, 10)).

7. Memory Boosters (Mnemonics & Hooks)

“C-S-A-M-M” for aggregates
- Count, Sum, Average, Min, Max.
- Imagine squeezing an orange: you Count oranges, Sum the juice, find Average juice per orange, Min/Max juiciest and driest.
“P-O-F inside OVER” – Window clause order
- PARTITION BY → groups.
- ORDER BY → sequence.
- FRAME → slice of that sequence.
“RowNumber, Rank, DenseRank – 1, gap, no-gap”
- ROW_NUMBER → 1-2-3-4… always unique.
- RANK → 1-2-2-4 (gap after tie).
- DENSE_RANK → 1-2-2-3 (no gap).
Running vs Rolling totals: “Snowball vs Window”
- Running total = snowball rolling downhill, keeps growing (from first row to current).
- Rolling total = fixed window sliding along (e.g., last 3 months).
Value functions = “Time machine”
- LAG = travel to previous row.
- LEAD = jump to future row.
- FIRST_VALUE = go to the start; LAST_VALUE = go to the end of the partition.

8. Active-Recall Questions & Mini-Exercises

Use these as flashcards. Try to answer without looking back.

8.1 Aggregates & GROUP BY

What is the difference between COUNT(*) and COUNT(Sales) when some Sales values are NULL?
Write a query to show total sales per customer using GROUP BY.
Why does GROUP BY usually produce fewer rows than the original table?

8.2 Window basics & PARTITION / ORDER / FRAME

Rewrite “total sales per product” using a window function instead of GROUP BY so that each order row still appears.
What does PARTITION BY ProductID, OrderStatus do compared to PARTITION BY ProductID alone?
What is the effect of ORDER BY inside OVER when computing a running total?

8.3 Window aggregates

How would you compute a moving average of Sales per product, ordered by OrderDate?
How can you use window functions to detect duplicate OrderID values?
How could you calculate, for each order, the percentage of total sales?

8.4 Ranking functions

When should you choose ROW_NUMBER vs RANK vs DENSE_RANK? Give a scenario for each.
How would you find the top 1 order per product using ROW_NUMBER?
Explain what NTILE(4) OVER (ORDER BY Sales DESC) does.
In the CUME_DIST price example, what does a value of 0.4 mean?

8.5 Value functions

How can LAG be used to calculate month-over-month sales change?
How would you compute the average days between orders per customer using LEAD?
Why does LAST_VALUE often require an explicit frame (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)?

9. Ultra-Short Recap (Key Takeaways)

Aggregate functions (COUNT, SUM, AVG, MIN, MAX) summarize multiple rows into one value.
GROUP BY collapses rows into summary rows; window functions keep all rows and just add extra columns.
A window function is func(expr) OVER (PARTITION BY ... ORDER BY ... frame); remember P-O-F.
Use window aggregates for running totals, rolling averages, percent of total, min/max comparisons, and duplicate checks.
Ranking functions (ROW_NUMBER, RANK, DENSE_RANK, NTILE, CUME_DIST, PERCENT_RANK) let you do top-N, segmentation, and distribution analysis.
Value functions (LAG, LEAD, FIRST_VALUE, LAST_VALUE) are perfect for time-based and before/after comparisons.
Window functions cannot be used in WHERE and cannot be nested. Put them in a subquery if you need to filter by them.
COUNT(*) vs COUNT(col) and default AVG behaviour with NULLs are classic gotchas—always check how NULLs are treated.

Keyboard shortcuts

Note Book