SQL Advanced Cheatsheet — Window Functions

Window functions are similar to aggregate functions in that they compute aggregate values by performing calculations across a set of rows related to the current row. Unlike aggregate functions, using window functions does not group the result set (into a single output row) but instead retains the same number of rows as the input.

Does this sound confusing? Let’s dissect window functions by going over the basic structures and a few examples.

Window Function Basics

The above query sums duration seconds over the entire result set, in order by start time. Let’s look at the basic syntax of window functions:

  • SUM(duration) acts as the aggregation. You can also use COUNT, AVG, MIN, or any other aggregate function.
  • OVER is the key phrase that designates this query as a window function.

Suppose we’d like to narrow the window function from the whole dataset to individual groups, we can use PARTITION BY, which is similar to GROUP BY:

Now, the query is grouped by start_terminal and within each start_terminal value, it is ordered by start_time.

Notice that running_total (summing duration seconds) starts over when we reach a new start_terminal. This happens when we group using PARTITION BY — all aggregate functions are performed within individual groups. In addition, ORDER BY treats every partition separately.

Essentially, ORDER BY and PARTITION BY defines the “window” — the ordered subset of data in which aggregate calculations are made.


ROW_NUMBER() returns the number of a given row. Suppose we’re interested in ranking the highest-paid employees in each department.


We can also use RANK() or DENSE_RANK() to display rankings within individual groups. Unlike ROW_NUMBER(), both RANK() and DENSE_RANK() will display identical rankings in the case of ties. Let’s return to the earlier bike share example. Suppose we want to return the start time rankings (in ascending chronological order) within each start terminal:

We will use ROW_NUMBER(), RANK(), and DENSE_RANK() in the above query to compare the differences:

In this case, the 4th and 5th rows have the same value. ROW_NUMBER() assigns a distinct value to each row, while the other two give the identical rows the same rank. However, RANK() will skip the duplicate rank (#5 in this case is skipped due to a tie for #4) while DENSE_RANK() will not skip any ranks.


LAG() pulls from previous rows and LEAD() pulls from the following rows. This will be especially useful if you want to compute the differences between rows.

  • LAG() outputs the previous row, so the first row of LAG() will be null.
  • LEAD() outputs the subsequent row, so the last row of LEAD() will be null.
  • Here, we calculate the difference as the current row minus the previous row (lag) duration. Thus, the first row of difference will also be null because of LAG().


NTILE() will output what percentile a given row falls under (more specifically NTILE(100)). You could also specify quartile using NTILE(4).

Suppose we wanted to find each user’s email activity percentile, which is defined by the total number of emails sent:

  • We use ORDER BY to display the highest number of emails first.
  • NTILE(100) is used to return the percentile in which the given row of email count falls under.

M.S.E Data Science @Johns Hopkins University with a B.S. in Applied Mathematics. Previously @MongoDB, current Data Science Intern @EA