SQL Advanced Cheatsheet — Subqueries

As we begin trying to answer more in-depth questions about the data, our SQL queries will become more complex as well. In the beginner cheatsheet, I covered the basic structures of SQL in terms of querying from one table. In the intermediate portion, I explained aggregate functions and joins which are used to aggregate data using summary statistics and combine multiple tables.

In my cheatsheets, I first present the theories and then apply them to real-world examples to illustrate how the concepts work in practice.

Introduction

The advanced SQL cheatsheet will cover two main areas: subqueries and window functions. This section will focus on subqueries, also known as inner queries or nested queries.

A subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved. Essentially, a subquery is a query within another SQL query and can be embedded within the WHERE and the FROM clause. It can also be used with the SELECT, INSERT, UPDATE, and DELETE statements along with comparison (i.e. =, >, <) and logical operators (i.e. BETWEEN, IN).

I’ll go over 5 specific examples of using subqueries in 3 different scenarios to answer more complex questions. Note that you should always include an alias (another name) for the subquery.

Aggregate Multiple Stages

Let’s say we wanted to compute how many incidents, on average, happen on a given day of the week in a given month. For example, the average number of incidents on a Friday in January.

  • The subquery counts the total number of incidents for each day:
  • Based on the above results, we then calculate the average number of incidents happening on a given day in a given month:

Another example is if we were to compute the percentage of orders that have a known customer address. Suppose our given data is as follows:

orders: ID | customer_ID | amount

customers: ID | address | amount

Let’s try to dissect this query above:

  • The subquery returns the order ID and another boolean feature which evaluates whether if a given row has a null value for customer address. This is achieved by joining the two tables on the same customer ID. Why LEFT JOIN? Because we want to display all of the orders and subsequently calculate the percentage of orders with non-null customer addresses.
  • In the SELECT clause, we count the number of rows that have a known (‘yes’) customer address, and divide it by the total number to get the percentage of shippable orders. Note that the numerator is multiplied by 1.00 for floating number division, since COUNT returns an integer.

Subqueries in Conditional Logic

Suppose we wanted to return all of the entries from the top 5 earliest dates in the data.

We wrap the subquery as a condition in the WHERE clause to only include the entries that satisfy our “top 5 earliest dates” criterion. Note that you should not include an alias when writing a subquery in a conditional statement.

Joins Using Subqueries

This query will also return all of the entries from the top 5 earliest dates, the same results as above. The subquery includes only the top 5 earliest date entries, which is then joined to the main table.

Looking at the same data, let’s say we are now interested in finding all rows from the three categories with the fewest crime incidents reported.

  • The subquery returns the three categories with the least incidents per category:
  • These three categories are then used as the common identifier to be joined with the main table:

The output is correct! We’ve successfully queried the relevant information regarding the three categories with the least number of incidents per category.

References:

mode.com

M.S.E Data Science @Johns Hopkins University with a B.S. in Applied Mathematics. Previously @MongoDB, current Data Science Intern @EA

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store