SQL Intermediate Cheatsheet — Aggregate Functions
The intermediate SQL tutorial will go over 2 main areas: aggregate functions (aggregate data using summary statistics) and joins (combine 2 or more tables). I will cover various aggregate functions in this section, ranging from COUNT and SUM to GROUP BY and CASE WHEN. In the next part, I will also go over the various types of joins, such as INNER JOIN and LEFT JOIN.
These cheatsheets focus on first presenting the theories and then applying them to real-world examples to illustrate how the concepts work in practice.
Aggregate Functions
SQL is excellent at aggregating data similar to how you might in an Excel pivot table. Aggregate functions are used all the time, so it’s definitely important to get comfortable with them.
The specific functions I will cover are COUNT, SUM, DISTINCT, GROUP BY, HAVING, and CASE. There are also other basic aggregate functions such as MIN/MAX and AVG. Specifically, MIN(price), MAX(price), and AVG(price) will output the minimum, maximum, and average price from a given table.
COUNT
COUNT is the easiest aggregate function, which counts the number of rows in a particular column. The COUNT function can be used on both numerical and non-numerical (i.e. text) values.
COUNT all rows
SELECT COUNT(*)
FROM table_name
COUNT(*) will output a single number which equals the total number of rows in the entire table. (Note: using COUNT(1) has the same effect as COUNT(*) )
COUNT individual columns
SELECT COUNT(user_ID)
FROM table_name
The query above returns the number of rows in which user_ID is not null.
COUNT DISTINCT values
SELECT COUNT(DISTINCT user_ID)
FROM table_name
This returns the total unique values of the user_ID column.
SUM
SUM totals the values in a given column, Unlike COUNT, SUM can only be applied to numerical columns. Another key difference between the two is the presence of null values. As SUM treats nulls as 0, you don’t need to worry as much about null values here as you would with COUNT.
The query below will calculate the sum of the volume column from a table with Apple stock prices.
SELECT SUM(volume)
FROM apple_stock_price
Note that aggregators like SUM only aggregate vertically. Calculation across rows can be done using simple arithmetic.
GROUP BY
GROUP BY comes in handy when you want to aggregate only a subset of the table by separating the data into groups, which can be aggregated independently. On the other hand, functions like COUNT and SUM have the commonality of aggregating across entire tables.
GROUP BY is often used with the aforementioned aggregate functions: COUNT, SUM, AVG, MIN, MAX.
SELECT year, month, SUM(volume)
FROM apple_stock_price
GROUP BY year, month
ORDER BY year, month
The above query will output the total number of Apple shares traded (grouped by) each month, sorted in (ascending) chronological order.
SELECT product, COUNT(*)
FROM transactions
GROUP BY product
The query above will calculate the number of transactions for (grouped by) each product.
HAVING
HAVING is used to filter aggregate columns, which is similar to the WHERE clause. Note that HAVING always comes after GROUP BY (filter aggregate functions), whereas WHERE comes before GROUP BY.
SELECT year, month, MAX(price)
FROM apple_stock_price
GROUP BY year, month
HAVING MAX(price) > 400
Here we’re finding every month in which Apple stock worked its way over $400.
SELECT product, AVG(price)
FROM transactions
GROUP BY product
HAVING AVG(price) <= 200
This query above will output which product had an average price of $200 or less.
CASE WHEN
The CASE WHEN statement is SQL’s version of if/then logic. Let’s take a closer look at the syntax:
SELECT name, school_year, CASE WHEN school_year IN ('senior', 'junior') THEN 'yes' ELSE NULL END AS is_senior
FROM ncaa_players
The query above can be interpreted as the following:
- CASE WHEN checks each row to see if school_year is ‘junior’ or ‘senior.
- For a given row: if the conditional statement is true, ‘yes’ gets printed in the newly created is_senior column. If false, there will be a null value in the is_senior column.
- Meanwhile, SQL will output all values in the name and school_year columns.
Here are two specific examples that illustrate the power of CASE WHEN.
SELECT CASE WHEN state IN ('CA', 'WA', 'OR') THEN 'west coast' WHEN state = 'TX' THEN 'texas' ELSE 'other' END AS region, COUNT(1) AS player_count
FROM ncaa_players
WHERE weight > 200
GROUP BY 1
Count the number of players weighing more than 200 pounds for West Coast, Texas, and the other regions. (Note: COUNT(1) and GROUP BY 1 both refer to aggregation on the first column, which is the region classification).
SELECT state, COUNT(CASE WHEN year = 'junior' THEN 1 ELSE NULL END) AS junior_count, COUNT(CASE WHEN year = 'senior' THEN 1 ELSE NULL END) AS senior_count, COUNT(1) AS total_players_by_state
FROM ncaa_players
GROUP BY state
ORDER BY total_players_by_state DESC
Display the number of juniors and seniors by state, as well as the total number of players in each state. Order the output results such that states with the most players come first.
References:
mode.com