sql: writing legible queries using ctes

2022-07-07

 | 

~6 min read

 | 

1099 words

NB: This note is based on my experience using Mode analytics to run analysis on an Amazon Redshift SQL database. If there are language distinctions, start by looking at your version of SQL.

A CTE is a common way to refer to a Common Table Expression. But what is a CTE? I like to think of them as abstracted subqueries.

To understand how CTEs might be useful, let’s look at the same query two different ways. Before we can do that though, we need to set up the example. Using Mode’s tutorial database, we’ll look at the dunder_mifflin_paper_sales table.

Let’s start with what’s included in the table:

SELECT
  column_name,
  data_type
FROM
  INFORMATION_SCHEMA.COLUMNS
WHERE
  table_name = 'dunder_mifflin_paper_sales'

This returns the 60 columns and their data types. 1

Now, to demonstrate the power of CTEs, let’s say that we want to see the total quantity and percent of sales by account. We can imagine that if the list of orders looks like:

order_id quantity account_id
1 10 a
2 5 a
3 15 b
4 2 c
5 8 c
6 20 c

Then what we want is:

acct_qty account_id pct
15 a .25
15 b .25
30 c .5

There are likely many ways to achieve this, but one is through the use of CTEs.

Let’s start with just getting the quantities by account:

SELECT
  d.account_id AS acct,
  sum (d.quantity) AS acct_qty
FROM
  tutorial.dunder_mifflin_paper_sales AS d
GROUP BY
  1

And the total quantity would be:

SELECT
  sum (d.quantity) AS qty
FROM
  tutorial.dunder_mifflin_paper_sales AS d

But how do we bring them together and add the extra detail of the percent at the account level? One way would be to add a subquery. There are numerous problems with subqueries, however. Most obviously is the duplication of logic throughout the query. In many cases, we want to use the same piece of logic, but with a subquery, we’d be forced to rewrite it in multiple spots. CTEs might look more complicated, but once you get used to them, I think they make queries much easier to read and reason through.

Before we write out the full query, let’s look at some standard syntax for CTEs.

-- To get started with CTEs, we need to define them.
-- To do this, we use the keyword WITH
-- Then we provide the CTE a name.
-- Just as
with qty_by_account AS ()
-- if we want to do multiple CTEs, add a comma to separate
, total_qty as ()
-- then when you're done we are able to proceed with our standard query

And now, let’s put this into practice with a real example:

-- Defining our CTEs
WITH account_data AS (
  SELECT
    d.account_id AS acct,
    sum (d.quantity) AS acct_qty
  FROM
    tutorial.dunder_mifflin_paper_sales AS d
  GROUP BY
    1
),
all_acct_qty AS (
  SELECT
    sum(acct_qty) AS qty
  FROM
    account_data
)

-- Using our CTEs in our standard query
SELECT
  a.*,
  t.qty,
  1.0 * SUM(a.acct_qty) / sum(t.qty) AS pct
FROM
  account_data AS a
  CROSS JOIN all_acct_qty AS t
GROUP BY
  a.acct,
  a.acct_qty,
  t.qty
ORDER BY
  2 DESC

While this is slightly verbose, each piece is understandable on its own. The first CTE, account_data, gets a sum of the quantity of paper sold at the account level. The second CTE, all_acct_qty, summarizes further to get the full quantity of paper products sold by Dunder Mifflin in the data set.

In our SELECT statement we then reference these table expressions as if they were tables.

SELECT
-- ...
FROM
  account_data AS a
  CROSS JOIN all_acct_qty AS t

CTEs provide a lovely way to simplify potentially complicated queries. They work by teasing out pieces into smaller queries which are then composed on their own. There’s definitely a learning curve to them, though I’d argue the hardest part of using them is that they force you to think about a problem differently. What used to be one bigger step now makes more sense as smaller ones. These smaller pieces are then composed.

It’s nice and elegant!

Footnotes

  • 1 The full list of columns in our table.

    column_name data_type
    order_id text
    purchased_at timestamp without time zone
    status text
    cancelled_at timestamp without time zone
    returned_at text
    product_id text
    product_name text
    price double precision
    discount double precision
    shipping_cost double precision
    quantity double precision
    business_size double precision
    payment_cycle text
    account_id text
    account_name text
    account_manager text
    days_to_close double precision
    shipping_mode text
    shipping_address text
    shipping_city text
    shipping_state text
    shipping_zip double precision
    shipping_region text
    shipping_latitude double precision
    shipping_longitude double precision
    days_to_ship double precision
    reviewed_at timestamp without time zone
    rating double precision
    index double precision
    review text
    order_id text
    purchased_at timestamp without time zone
    status text
    cancelled_at timestamp without time zone
    returned_at text
    product_id text
    product_name text
    price double precision
    discount double precision
    shipping_cost double precision
    quantity double precision
    business_size double precision
    payment_cycle text
    account_id text
    account_name text
    account_manager text
    days_to_close double precision
    shipping_mode text
    shipping_address text
    shipping_city text
    shipping_state text
    shipping_zip double precision
    shipping_region text
    shipping_latitude double precision
    shipping_longitude double precision
    days_to_ship double precision
    reviewed_at timestamp without time zone
    rating double precision
    index double precision
    review text

Related Posts
  • SQL: How To Append A Generated Value To Each Row


  • Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!