SQL Intermediate Cheatsheet — Aggregate Functions

The intermediate SQL tutorial will go over 2 main areas: aggregate functions (aggregate data using summary statistics) and joins (combine 2 or more tables). I will cover various aggregate functions in this section, ranging from COUNT and SUM to GROUP BY and CASE WHEN. In the next part, I will also go over the various types of joins, such as INNER JOIN and LEFT JOIN.

These cheatsheets focus on first presenting the theories and then applying them to real-world examples to illustrate how the concepts work in practice.

Aggregate Functions

SQL is excellent at aggregating data similar to how you might in an Excel pivot table. Aggregate functions are used all the time, so it’s definitely important to get comfortable with them.

The specific functions I will cover are COUNT, SUM, DISTINCT, GROUP BY, HAVING, and CASE. There are also other basic aggregate functions such as MIN/MAX and AVG. Specifically, MIN(price), MAX(price), and AVG(price) will output the minimum, maximum, and average price from a given table.

COUNT

COUNT is the easiest aggregate function, which counts the number of rows in a particular column. The COUNT function can be used on both numerical and non-numerical (i.e. text) values.

SELECT COUNT(*)
FROM table_name

COUNT(*) will output a single number which equals the total number of rows in the entire table. (Note: using COUNT(1) has the same effect as COUNT(*) )

SELECT COUNT(user_ID)
FROM table_name

The query above returns the number of rows in which user_ID is not null.

SELECT COUNT(DISTINCT user_ID)
FROM table_name

This returns the total unique values of the user_ID column.

SUM

SUM totals the values in a given column, Unlike COUNT, SUM can only be applied to numerical columns. Another key difference between the two is the presence of null values. As SUM treats nulls as 0, you don’t need to worry as much about null values here as you would with COUNT.

The query below will calculate the sum of the volume column from a table with Apple stock prices.

SELECT SUM(volume)
FROM apple_stock_price

Note that aggregators like SUM only aggregate vertically. Calculation across rows can be done using simple arithmetic.

GROUP BY

GROUP BY comes in handy when you want to aggregate only a subset of the table by separating the data into groups, which can be aggregated independently. On the other hand, functions like COUNT and SUM have the commonality of aggregating across entire tables.

GROUP BY is often used with the aforementioned aggregate functions: COUNT, SUM, AVG, MIN, MAX.

SELECT year, month, SUM(volume)
FROM apple_stock_price
GROUP BY year, month
ORDER BY year, month

The above query will output the total number of Apple shares traded (grouped by) each month, sorted in (ascending) chronological order.

SELECT product, COUNT(*)
FROM transactions
GROUP BY product

The query above will calculate the number of transactions for (grouped by) each product.

HAVING

HAVING is used to filter aggregate columns, which is similar to the WHERE clause. Note that HAVING always comes after GROUP BY (filter aggregate functions), whereas WHERE comes before GROUP BY.

SELECT year, month, MAX(price)
FROM apple_stock_price
GROUP BY year, month
HAVING MAX(price) > 400

Here we’re finding every month in which Apple stock worked its way over $400.

SELECT product, AVG(price)
FROM transactions
GROUP BY product
HAVING AVG(price) <= 200

This query above will output which product had an average price of $200 or less.

CASE WHEN

The CASE WHEN statement is SQL’s version of if/then logic. Let’s take a closer look at the syntax:

SELECT name, school_year, CASE WHEN school_year IN ('senior', 'junior') THEN 'yes' ELSE NULL END AS is_senior 
FROM ncaa_players

The query above can be interpreted as the following:

  1. CASE WHEN checks each row to see if school_year is ‘junior’ or ‘senior.
  2. For a given row: if the conditional statement is true, ‘yes’ gets printed in the newly created is_senior column. If false, there will be a null value in the is_senior column.
  3. Meanwhile, SQL will output all values in the name and school_year columns.

Here are two specific examples that illustrate the power of CASE WHEN.

SELECT CASE WHEN state IN ('CA', 'WA', 'OR') THEN 'west coast' WHEN state = 'TX' THEN 'texas' ELSE 'other' END AS region, COUNT(1) AS player_count
FROM ncaa_players
WHERE weight > 200
GROUP BY 1

Count the number of players weighing more than 200 pounds for West Coast, Texas, and the other regions. (Note: COUNT(1) and GROUP BY 1 both refer to aggregation on the first column, which is the region classification).

Image for post
Image for post
SELECT state, COUNT(CASE WHEN year = 'junior' THEN 1 ELSE NULL END) AS junior_count, COUNT(CASE WHEN year = 'senior' THEN 1 ELSE NULL END) AS senior_count, COUNT(1) AS total_players_by_state
FROM ncaa_players
GROUP BY state
ORDER BY total_players_by_state DESC

Display the number of juniors and seniors by state, as well as the total number of players in each state. Order the output results such that states with the most players come first.

Image for post
Image for post

References:

mode.com

Written by

M.S.E Data Science @Johns Hopkins University with a B.S. in Applied Mathematics. Previously @MongoDB, current Data Science Intern @EA

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store