SQL Advanced Cheatsheet — Subqueries
As we begin trying to answer more in-depth questions about the data, our SQL queries will become more complex as well. In the beginner cheatsheet, I covered the basic structures of SQL in terms of querying from one table. In the intermediate portion, I explained aggregate functions and joins which are used to aggregate data using summary statistics and combine multiple tables.
In my cheatsheets, I first present the theories and then apply them to real-world examples to illustrate how the concepts work in practice.
Introduction
The advanced SQL cheatsheet will cover two main areas: subqueries and window functions. This section will focus on subqueries, also known as inner queries or nested queries.
A subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved. Essentially, a subquery is a query within another SQL query and can be embedded within the WHERE and the FROM clause. It can also be used with the SELECT, INSERT, UPDATE, and DELETE statements along with comparison (i.e. =, >, <) and logical operators (i.e. BETWEEN, IN).
I’ll go over 5 specific examples of using subqueries in 3 different scenarios to answer more complex questions. Note that you should always include an alias (another name) for the subquery.
Aggregate Multiple Stages
Let’s say we wanted to compute how many incidents, on average, happen on a given day of the week in a given month. For example, the average number of incidents on a Friday in January.
SELECT sub.month, sub.day_of_week, AVG(sub.incident_count) AS avg_incidents
FROM
(SELECT day_of_week, date, COUNT(incident_ID) AS incident_count
FROM
GROUP BY 1, 2
) sub
GROUP BY 1, 2
ORDER BY 1, 2
- The subquery counts the total number of incidents for each day:
- Based on the above results, we then calculate the average number of incidents happening on a given day in a given month:
Another example is if we were to compute the percentage of orders that have a known customer address. Suppose our given data is as follows:
orders: ID | customer_ID | amount
customers: ID | address | amount
SELECT COUNT(CASE WHEN is_shippable = 'YES' THEN 1 ELSE NULL END) *
1.00 / COUNT(*) * 100 AS percent_shippable
FROM
(SELECT o.ID, CASE WHEN c.address IS NULL THEN 'NO' ELSE 'YES'
END AS is_shippable
FROM orders o
LEFT JOIN customers c
ON o.customer_ID = c.ID) sub
Let’s try to dissect this query above:
- The subquery returns the order ID and another boolean feature which evaluates whether if a given row has a null value for customer address. This is achieved by joining the two tables on the same customer ID. Why LEFT JOIN? Because we want to display all of the orders and subsequently calculate the percentage of orders with non-null customer addresses.
- In the SELECT clause, we count the number of rows that have a known (‘yes’) customer address, and divide it by the total number to get the percentage of shippable orders. Note that the numerator is multiplied by 1.00 for floating number division, since COUNT returns an integer.
Subqueries in Conditional Logic
Suppose we wanted to return all of the entries from the top 5 earliest dates in the data.
SELECT *
FROM crime_incidents_table
WHERE date IN (SELECT date
FROM crime_incidents_table
ORDER BY date
LIMIT 5)
We wrap the subquery as a condition in the WHERE clause to only include the entries that satisfy our “top 5 earliest dates” criterion. Note that you should not include an alias when writing a subquery in a conditional statement.
Joins Using Subqueries
SELECT *
FROM crime_incidents_table incidents
JOIN (SELECT date
FROM crime_incidents_table
ORDER BY date
LIMIT 5) sub
ON incidents.date = sub.date
This query will also return all of the entries from the top 5 earliest dates, the same results as above. The subquery includes only the top 5 earliest date entries, which is then joined to the main table.
Looking at the same data, let’s say we are now interested in finding all rows from the three categories with the fewest crime incidents reported.
SELECT main.*, sub.count_by_category AS total_incidents_by_category
FROM crime_incidents_table main
JOIN (SELECT category, COUNT(*) AS count_by_category
FROM crime_incidents_table
GROUP BY 1
ORDER BY 2
LIMIT 3) sub
ON sub.category = main.category
- The subquery returns the three categories with the least incidents per category:
- These three categories are then used as the common identifier to be joined with the main table:
The output is correct! We’ve successfully queried the relevant information regarding the three categories with the least number of incidents per category.
References:
mode.com