SQL Advanced Cheatsheet — Window Functions

Window functions are similar to aggregate functions in that they compute aggregate values by performing calculations across a set of rows related to the current row. Unlike aggregate functions, using window functions does not group the result set (into a single output row) but instead retains the same number of rows as the input.

Does this sound confusing? Let’s dissect window functions by going over the basic structures and a few examples.

Window Function Basics

The above query sums duration seconds over the entire result set, in order by start time. Let’s look at the basic syntax of window functions:

  • SUM(duration) acts as the aggregation. You can also use COUNT, AVG, MIN, or any other aggregate function.
  • OVER is the key phrase that designates this query as a window function.
Image for post
Image for post

Suppose we’d like to narrow the window function from the whole dataset to individual groups, we can use PARTITION BY, which is similar to GROUP BY:

Now, the query is grouped by start_terminal and within each start_terminal value, it is ordered by start_time.

Image for post
Image for post
Image for post
Image for post

Notice that running_total (summing duration seconds) starts over when we reach a new start_terminal. This happens when we group using PARTITION BY — all aggregate functions are performed within individual groups. In addition, ORDER BY treats every partition separately.

Essentially, ORDER BY and PARTITION BY defines the “window” — the ordered subset of data in which aggregate calculations are made.

ROW_NUMBER

ROW_NUMBER() returns the number of a given row. Suppose we’re interested in ranking the highest-paid employees in each department.

RANK and DENSE_RANK

We can also use RANK() or DENSE_RANK() to display rankings within individual groups. Unlike ROW_NUMBER(), both RANK() and DENSE_RANK() will display identical rankings in the case of ties. Let’s return to the earlier bike share example. Suppose we want to return the start time rankings (in ascending chronological order) within each start terminal:

We will use ROW_NUMBER(), RANK(), and DENSE_RANK() in the above query to compare the differences:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Rows 4 and 5 have identical start_time

In this case, the 4th and 5th rows have the same value. ROW_NUMBER() assigns a distinct value to each row, while the other two give the identical rows the same rank. However, RANK() will skip the duplicate rank (#5 in this case is skipped due to a tie for #4) while DENSE_RANK() will not skip any ranks.

LAG and LEAD

LAG() pulls from previous rows and LEAD() pulls from the following rows. This will be especially useful if you want to compute the differences between rows.

Image for post
Image for post
  • LAG() outputs the previous row, so the first row of LAG() will be null.
  • LEAD() outputs the subsequent row, so the last row of LEAD() will be null.
  • Here, we calculate the difference as the current row minus the previous row (lag) duration. Thus, the first row of difference will also be null because of LAG().

NTILE

NTILE() will output what percentile a given row falls under (more specifically NTILE(100)). You could also specify quartile using NTILE(4).

Suppose we wanted to find each user’s email activity percentile, which is defined by the total number of emails sent:

  • We use ORDER BY to display the highest number of emails first.
  • NTILE(100) is used to return the percentile in which the given row of email count falls under.

M.S.E Data Science @Johns Hopkins University with a B.S. in Applied Mathematics. Previously @MongoDB, current Data Science Intern @EA

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store