Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

4. Aggregation Analytical Functions


PDF Slides


1. Title & Short Summary

Aggregate & Window (Analytical) Functions in SQL

This lesson shows how to:

  • Summarize data with aggregate functions (COUNT, SUM, AVG, MIN, MAX).
  • Use GROUP BY to turn detailed rows into summary rows.
  • Use window (analytic) functions to compute aggregates, rankings, and comparisons without losing row-level detail.

2. Big Picture – Where This Fits

In SQL learning, this chapter sits in “data analysis & analytics”:

  • Earlier you learned basic SELECT–FROM–WHERE and maybe simple GROUP BY.

  • Here you go beyond simple aggregation:

    • Aggregate functions + GROUP BY → classic reports (“sales per customer”).
    • Window functions → advanced analytics: running totals, moving averages, top-N per group, month-over-month changes, percentiles, etc.

Think of it like this:

  • GROUP BY → compress rows into summaries (you lose row detail).
  • WINDOWadd extra calculated columns alongside the original rows. (Diagram on pages 6–8 of the slides illustrates this contrast. )

3. Key Concepts & Definitions

3.1 Core aggregates

ConceptMeaningNotes / Data types
COUNT(expr)Number of rows (or non-NULL values of expr)Works for any type. COUNT(*) counts rows including NULLs; COUNT(col) ignores NULLs.
SUM(expr)Sum of numeric valuesNumbers only. NULLs are ignored.
AVG(expr)Average of numeric valuesNumbers only. NULLs ignored by default. Use COALESCE to treat NULLs as 0.
MIN(expr)Smallest valueAny comparable type.
MAX(expr)Largest valueAny comparable type.

3.2 GROUP BY vs Window functions

FeatureGROUP BYWindow functions (OVER)
Output row countDecreases (rows collapsed per group)Same as input (row-level)
PurposeSimple aggregation reportsAggregation + detailed rows
Typical use“Total sales per product”“Total sales per product on each row”, running totals, rankings
Syntax ideaSELECT group_col, SUM(x) FROM t GROUP BY group_colSELECT x, SUM(x) OVER (PARTITION BY group_col) FROM t

(See the side-by-side diagrams on pages 6–10. )

3.3 Window syntax pieces

From the slides, a generic window expression looks like:

<window_function>(expression)
OVER (
    [PARTITION BY ...]
    [ORDER BY ...]
    [frame_clause]
)

Pieces:

  • Window function – what calculation to do (e.g. SUM, ROW_NUMBER, LAG).
  • Expression – argument(s) you pass, e.g. Sales, NTILE(3), LEAD(Sales, 2, 0).
  • PARTITION BY – splits rows into windows (groups); calculations restart in each partition.
  • ORDER BY – defines the order inside each partition (essential for running totals, rankings, LAG/LEAD).
  • FRAME clause (ROWS/RANGE ...) – defines the subset of rows in the partition the function can “see” (for running vs rolling totals).

3.4 Categories of window functions

Visual on pages 9, 15, 39, 41, 98 groups window functions into:

CategoryFunctionsWhat they do
Window aggregatesCOUNT, SUM, AVG, MIN, MAXOne summary value per row (over a window)
RankingROW_NUMBER, RANK, DENSE_RANK, NTILE, CUME_DIST, PERCENT_RANKAssign positions / buckets based on ordering
Value / analyticLAG, LEAD, FIRST_VALUE, LAST_VALUEPull a value from another row in the window

4. Syntax Cheat-Sheet

4.1 Plain aggregate functions

From 13_Aggregate_Functions.sql:

-- Count rows
SELECT COUNT(*) AS total_customers
FROM customers;

-- Total sales
SELECT SUM(sales) AS total_sales
FROM orders;

-- Average, max, min
SELECT AVG(sales) AS avg_sales FROM orders;
SELECT MAX(score) AS max_score FROM customers;
SELECT MIN(score) AS min_score FROM customers;

4.2 GROUP BY pattern

SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(sales) AS total_sales,
    AVG(sales) AS avg_sales,
    MAX(sales) AS highest_sales,
    MIN(sales) AS lowest_sales
FROM orders
GROUP BY customer_id;

This matches the classic “aggregations per group” example.


4.3 Basic window function patterns

Total across all rows (no partition):

SUM(Sales) OVER () AS Total_Sales

Used in several scripts and in slides to show “entire dataset” calculations.

Per group / partition:

SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product

Multiple partition levels:

SUM(Sales) OVER () AS Total_Sales,
SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product,
SUM(Sales) OVER (PARTITION BY ProductID, OrderStatus) AS Sales_By_Product_Status

(from Window Basics Task 5).


4.4 ORDER and FRAME – running vs rolling

Running total by status (default frame):

SUM(Sales) OVER (
    PARTITION BY OrderStatus
    ORDER BY OrderDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS Total_Sales

Rolling window (current + next 2 rows):

SUM(Sales) OVER (
    PARTITION BY OrderStatus
    ORDER BY OrderDate
    ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) AS Total_Sales

(from Tasks 7–11 in Window Basics and the rolling/ running diagrams on pages 59–64).


4.5 Window aggregate samples

-- Count orders overall and per customer
SELECT
    OrderID,
    OrderDate,
    CustomerID,
    COUNT(*) OVER () AS TotalOrders,
    COUNT(*) OVER (PARTITION BY CustomerID) AS OrdersByCustomers
FROM Sales.Orders;
-- Moving average of Sales per product
SELECT
    OrderID,
    ProductID,
    OrderDate,
    Sales,
    AVG(Sales) OVER (PARTITION BY ProductID) AS AvgByProduct,
    AVG(Sales) OVER (PARTITION BY ProductID ORDER BY OrderDate) AS MovingAvg
FROM Sales.Orders;

(from 15_Window_Aggregations.sql).


4.6 Ranking function patterns

-- Basic ranking three ways
SELECT
    OrderID,
    ProductID,
    Sales,
    ROW_NUMBER() OVER (ORDER BY Sales DESC)     AS SalesRank_Row,
    RANK()       OVER (ORDER BY Sales DESC)     AS SalesRank_Rank,
    DENSE_RANK() OVER (ORDER BY Sales DESC)     AS SalesRank_Dense
FROM Sales.Orders;
-- Top-1 sale per product (Top-N per group)
SELECT *
FROM (
    SELECT
        OrderID,
        ProductID,
        Sales,
        ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Sales DESC) AS RankByProduct
    FROM Sales.Orders
) AS TopProductSales
WHERE RankByProduct = 1;
-- Buckets with NTILE
SELECT 
    OrderID,
    Sales,
    NTILE(3) OVER (ORDER BY Sales DESC) AS Buckets
FROM Sales.Orders;

(from 16_Window_Ranking.sql and ranking diagrams on pages 75–85).


4.7 Value function patterns (LEAD/LAG/FIRST/LAST)

-- Month-over-month change using LAG
SELECT
    * ,
    CurrentMonthSales - PreviousMonthSales AS MoM_Change,
    ROUND(
        CAST((CurrentMonthSales - PreviousMonthSales) AS FLOAT)
        / PreviousMonthSales * 100, 1
    ) AS MoM_Perc
FROM (
    SELECT
        MONTH(OrderDate) AS OrderMonth,
        SUM(Sales) AS CurrentMonthSales,
        LAG(SUM(Sales)) OVER (ORDER BY MONTH(OrderDate)) AS PreviousMonthSales
    FROM Sales.Orders
    GROUP BY MONTH(OrderDate)
) AS MonthlySales;
-- FIRST_VALUE / LAST_VALUE per product
SELECT
    OrderID,
    ProductID,
    Sales,
    FIRST_VALUE(Sales) OVER (PARTITION BY ProductID ORDER BY Sales) AS LowestSales,
    LAST_VALUE(Sales) OVER (
        PARTITION BY ProductID 
        ORDER BY Sales 
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    ) AS HighestSales,
    Sales - FIRST_VALUE(Sales) OVER (PARTITION BY ProductID ORDER BY Sales) AS SalesDifference
FROM Sales.Orders;

(from 17_Window_Value_Functions.sql and the value-function diagrams on pages 99–104, 106–110).


5. Worked Examples with Explanations

5.1 Aggregates vs GROUP BY

Example 1 – basic aggregates

SELECT COUNT(*) AS total_customers
FROM customers;
  • What it does: returns one row with the total count of customers.
  • Pattern: “Apply aggregate to whole table → 1-row summary”.

Example 2 – grouped aggregates

SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(sales) AS total_sales
FROM orders
GROUP BY customer_id;
  • Each customer_id becomes one row.
  • Aggregates are computed within each group.
  • This matches the “squeeze many rows per customer into 1 summary row” diagram on page 6.

5.2 GROUP BY vs Window SUM

From Window Basics Tasks 1–4:

-- Task 1: plain total
SELECT SUM(Sales) AS Total_Sales
FROM Sales.Orders;
-- Task 2: total per product using GROUP BY
SELECT 
    ProductID,
    SUM(Sales) AS Total_Sales
FROM Sales.Orders
GROUP BY ProductID;
-- Task 3: total across all orders as a window (detail kept)
SELECT
    OrderID,
    OrderDate,
    ProductID,
    Sales,
    SUM(Sales) OVER () AS Total_Sales
FROM Sales.Orders;
-- Task 4: per-product totals as a window
SELECT
    OrderID,
    OrderDate,
    ProductID,
    Sales,
    SUM(Sales) OVER () AS Total_Sales,
    SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product
FROM Sales.Orders;

Key idea:

  • GROUP BY version returns one row per product (no order details).

  • Window version returns one row per order, plus columns showing:

    • total over all orders
    • total per product

This exactly matches the “Window Functions keep details” diagram on page 7.


5.3 Multi-level partitions

From Task 5:

SELECT
    OrderID,
    OrderDate,
    ProductID,
    OrderStatus,
    Sales,
    SUM(Sales) OVER () AS Total_Sales,
    SUM(Sales) OVER (PARTITION BY ProductID) AS Sales_By_Product,
    SUM(Sales) OVER (PARTITION BY ProductID, OrderStatus) AS Sales_By_Product_Status
FROM Sales.Orders;
  • SUM(Sales) OVER () – full dataset total.
  • PARTITION BY ProductID – total per product.
  • PARTITION BY ProductID, OrderStatus – total per product + status pair. This corresponds to the slide that shows combined columns in PARTITION BY.

5.4 Frames: running & rolling totals

Compare with the running/rolling Total diagrams on pages 59–64.

Running total example:

SELECT
    OrderID,
    OrderDate,
    ProductID,
    OrderStatus,
    Sales,
    SUM(Sales) OVER (
        PARTITION BY OrderStatus 
        ORDER BY OrderDate 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS Total_Sales
FROM Sales.Orders;
  • For each order status, sums from first order in time up to current order.

Rolling total example:

SELECT
    OrderID,
    ProductID,
    OrderDate,
    Sales,
    AVG(Sales) OVER (
        PARTITION BY ProductID 
        ORDER BY OrderDate 
        ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
    ) AS RollingAvg
FROM Sales.Orders;
  • For each product, takes the average of current order + next order only (2-row moving window).

5.5 Window aggregates in analysis use cases

  1. Counting & duplicate detection

    SELECT 
        * 
    FROM (
        SELECT 
            *,
            COUNT(*) OVER(PARTITION BY OrderID) AS CheckDuplicates
        FROM Sales.OrdersArchive
    ) t
    WHERE CheckDuplicates > 1;
    
    • COUNT(*) OVER(PARTITION BY OrderID) counts how many rows share the same OrderID.
    • Filtering for > 1 returns only duplicated orders.
  2. Percent of total

    SELECT
        OrderID,
        ProductID,
        Sales,
        SUM(Sales) OVER () AS TotalSales,
        ROUND(CAST(Sales AS FLOAT) / SUM(Sales) OVER () * 100, 2) AS PercentageOfTotal
    FROM Sales.Orders;
    
    • Each row knows its own share of the global total.
  3. Deviation from min / max

    SELECT
        OrderID,
        OrderDate,
        ProductID,
        Sales,
        MAX(Sales) OVER () AS HighestSales,
        MIN(Sales) OVER () AS LowestSales,
        Sales - MIN(Sales) OVER () AS DeviationFromMin,
        MAX(Sales) OVER () - Sales AS DeviationFromMax
    FROM Sales.Orders;
    
    • Lets you compare each row to global extremes quickly.

5.6 Ranking examples

From 16_Window_Ranking.sql and ranking slides.

  1. Top sale per product (Top-N per group)

    SELECT *
    FROM (
        SELECT
            OrderID,
            ProductID,
            Sales,
            ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Sales DESC) AS RankByProduct
        FROM Sales.Orders
    ) AS TopProductSales
    WHERE RankByProduct = 1;
    
  2. Bottom 2 customers by total sales

    SELECT *
    FROM (
        SELECT
            CustomerID,
            SUM(Sales) AS TotalSales,
            ROW_NUMBER() OVER (ORDER BY SUM(Sales)) AS RankCustomers
        FROM Sales.Orders
        GROUP BY CustomerID
    ) AS BottomCustomerSales
    WHERE RankCustomers <= 2;
    
  3. Bucketization with NTILE

    SELECT
        OrderID,
        Sales,
        Buckets,
        CASE 
            WHEN Buckets = 1 THEN 'High'
            WHEN Buckets = 2 THEN 'Medium'
            WHEN Buckets = 3 THEN 'Low'
        END AS SalesSegmentations
    FROM (
        SELECT
            OrderID,
            Sales,
            NTILE(3) OVER (ORDER BY Sales DESC) AS Buckets
        FROM Sales.Orders
    ) AS SalesBuckets;
    
    • Similar to NTILE bucket diagrams on pages 83–85: rows are split into roughly equal groups; higher buckets get bigger groups first.
  4. CUME_DIST for top 40% of prices

    SELECT 
        Product,
        Price,
        DistRank,
        CONCAT(DistRank * 100, '%') AS DistRankPerc
    FROM (
        SELECT
            Product,
            Price,
            CUME_DIST() OVER (ORDER BY Price DESC) AS DistRank
        FROM Sales.Products
    ) AS PriceDistribution
    WHERE DistRank <= 0.4;
    
    • CUME_DIST = (# rows with value ≥ current when ordered DESC) / total rows, as in formulas on pages 89–91.

5.7 Value function examples

From 17_Window_Value_Functions.sql and LAG/LEAD diagrams on pages 102–110.

  1. Month-over-month sales change

    SELECT
        *,
        CurrentMonthSales - PreviousMonthSales AS MoM_Change,
        ROUND(
            CAST((CurrentMonthSales - PreviousMonthSales) AS FLOAT)
            / PreviousMonthSales * 100, 1
        ) AS MoM_Perc
    FROM (
        SELECT
            MONTH(OrderDate) AS OrderMonth,
            SUM(Sales) AS CurrentMonthSales,
            LAG(SUM(Sales)) OVER (ORDER BY MONTH(OrderDate)) AS PreviousMonthSales
        FROM Sales.Orders
        GROUP BY MONTH(OrderDate)
    ) AS MonthlySales;
    
    • LAG peeks at the previous month’s aggregated sales.
  2. Customer loyalty – avg days between orders

    SELECT
        CustomerID,
        AVG(DaysUntilNextOrder) AS AvgDays,
        RANK() OVER (ORDER BY COALESCE(AVG(DaysUntilNextOrder), 999999)) AS RankAvg
    FROM (
        SELECT
            OrderID,
            CustomerID,
            OrderDate AS CurrentOrder,
            LEAD(OrderDate) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS NextOrder,
            DATEDIFF(
                day,
                OrderDate,
                LEAD(OrderDate) OVER (PARTITION BY CustomerID ORDER BY OrderDate)
            ) AS DaysUntilNextOrder
        FROM Sales.Orders
    ) AS CustomerOrdersWithNext
    GROUP BY CustomerID;
    
    • Shorter average gaps → more frequent, “loyal” customers.
  3. First & last value per product

    Already shown in syntax section; it matches the “timeline” diagrams where FIRST_VALUE picks the first sales value and LAST_VALUE the last one (with an explicit frame).


6. Common Mistakes & Tips

  1. Using window functions in WHERE or GROUP BY

    • Rule slide: “Window functions can only be used in SELECT or ORDER BY clauses.”
    • The script shows an invalid example: WHERE SUM(Sales) OVER (...) > 100. Instead, wrap the query and filter in an outer SELECT.
  2. Nesting window functions

    • Slides explicitly say: “Not allowed to nest window functions!” and show AVG( SUM(Sales) OVER() ) OVER(...) as invalid.
  3. Confusing COUNT(*) vs COUNT(col)

    • COUNT(*) / COUNT(1) counts rows including NULLs.
    • COUNT(col) ignores rows where col is NULL (example table with Gloves NULL only counted by COUNT(*) in the slides).
  4. Forgetting that AVG ignores NULLs

    • Slides and script show using COALESCE(Score,0) to include NULLs in average.
  5. Misunderstanding frames

    • Frame requires ORDER BY.
    • “Lower” boundary must not be after the “higher” boundary.
    • Default frame for ordered windows is effectively RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running from start to current).
  6. Assuming all window functions allow frames

    • Slides show a matrix:

      • Aggregate and value functions: frame optional, and especially recommended for LAST_VALUE.
      • Ranking functions: frame not allowed; they always consider the entire partition.
  7. Expecting ROW_NUMBER to handle ties “nicely”

    • ROW_NUMBER always gives unique numbers → ties are broken arbitrarily by the ORDER BY.
    • Use RANK (with gaps) or DENSE_RANK (no gaps) when ties should share the same rank, as highlighted in the ranking diagrams.
  8. Forgetting default offset / default value in LEAD/LAG

    • Offset default is 1.
    • Default value is NULL, but you can override it (slides show LEAD(Sales, 2, 10)).

7. Memory Boosters (Mnemonics & Hooks)

  1. “C-S-A-M-M” for aggregates

    • Count, Sum, Average, Min, Max.
    • Imagine squeezing an orange: you Count oranges, Sum the juice, find Average juice per orange, Min/Max juiciest and driest.
  2. “P-O-F inside OVER” – Window clause order

    • PARTITION BY → groups.
    • ORDER BY → sequence.
    • FRAME → slice of that sequence.
  3. “RowNumber, Rank, DenseRank – 1, gap, no-gap”

    • ROW_NUMBER1-2-3-4… always unique.
    • RANK1-2-2-4 (gap after tie).
    • DENSE_RANK1-2-2-3 (no gap).
  4. Running vs Rolling totals: “Snowball vs Window”

    • Running total = snowball rolling downhill, keeps growing (from first row to current).
    • Rolling total = fixed window sliding along (e.g., last 3 months).
  5. Value functions = “Time machine”

    • LAG = travel to previous row.
    • LEAD = jump to future row.
    • FIRST_VALUE = go to the start; LAST_VALUE = go to the end of the partition.

8. Active-Recall Questions & Mini-Exercises

Use these as flashcards. Try to answer without looking back.

8.1 Aggregates & GROUP BY

  1. What is the difference between COUNT(*) and COUNT(Sales) when some Sales values are NULL?
  2. Write a query to show total sales per customer using GROUP BY.
  3. Why does GROUP BY usually produce fewer rows than the original table?

8.2 Window basics & PARTITION / ORDER / FRAME

  1. Rewrite “total sales per product” using a window function instead of GROUP BY so that each order row still appears.
  2. What does PARTITION BY ProductID, OrderStatus do compared to PARTITION BY ProductID alone?
  3. What is the effect of ORDER BY inside OVER when computing a running total?

8.3 Window aggregates

  1. How would you compute a moving average of Sales per product, ordered by OrderDate?
  2. How can you use window functions to detect duplicate OrderID values?
  3. How could you calculate, for each order, the percentage of total sales?

8.4 Ranking functions

  1. When should you choose ROW_NUMBER vs RANK vs DENSE_RANK? Give a scenario for each.
  2. How would you find the top 1 order per product using ROW_NUMBER?
  3. Explain what NTILE(4) OVER (ORDER BY Sales DESC) does.
  4. In the CUME_DIST price example, what does a value of 0.4 mean?

8.5 Value functions

  1. How can LAG be used to calculate month-over-month sales change?
  2. How would you compute the average days between orders per customer using LEAD?
  3. Why does LAST_VALUE often require an explicit frame (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)?

9. Ultra-Short Recap (Key Takeaways)

  • Aggregate functions (COUNT, SUM, AVG, MIN, MAX) summarize multiple rows into one value.
  • GROUP BY collapses rows into summary rows; window functions keep all rows and just add extra columns.
  • A window function is func(expr) OVER (PARTITION BY ... ORDER BY ... frame); remember P-O-F.
  • Use window aggregates for running totals, rolling averages, percent of total, min/max comparisons, and duplicate checks.
  • Ranking functions (ROW_NUMBER, RANK, DENSE_RANK, NTILE, CUME_DIST, PERCENT_RANK) let you do top-N, segmentation, and distribution analysis.
  • Value functions (LAG, LEAD, FIRST_VALUE, LAST_VALUE) are perfect for time-based and before/after comparisons.
  • Window functions cannot be used in WHERE and cannot be nested. Put them in a subquery if you need to filter by them.
  • COUNT(*) vs COUNT(col) and default AVG behaviour with NULLs are classic gotchas—always check how NULLs are treated.