Select distinct in database

This is all my investigation about the differences I encountered running an ActiveRecord query, that includes a SELECT DISTINCT statement, from one database to another.

I had the surprise to have two different results running the same query in SQLite and PostgreSQL.

This article is about my findings and how to play with SQL to have the same results in the two databases.

☀️ PostgreSQL

First thing first.

Rails app

I created a rails app with models like so:

class User < ApplicationRecord
  has_many :comments
end

class Comment < ApplicationRecord
  belongs_to :user
end

Schema

ActiveRecord::Schema.define(version: 2018_11_27_145214) do

  enable_extension "plpgsql"

  create_table "comments", force: :cascade do |t|
    t.bigint "user_id"
    t.integer "rating"
    t.index ["user_id"], name: "index_comments_on_user_id"
  end

  create_table "users", force: :cascade do |t|
    t.string "name"
  end

  add_foreign_key "comments", "users"
end

Database configuration

I used a postgresql adapter for the databse (in database.yml)

default: &default
  adapter: postgresql
  encoding: unicode
  pool: 5
  host: localhost

development:
  <<: *default
  database: users_comments

Seed

I then seeded with some records:

user1 = User.create(name: 'Ola')
user2 = User.create(name: 'Hello')
user3 = User.create(name: 'Wow')
user4 = User.create(name: 'Bye')

comment1 = Comment.create(user: user1, rating: 1)
comment2 = Comment.create(user: user2, rating: 5)
comment3 = Comment.create(user: user2, rating: 3)
comment3 = Comment.create(user: user3, rating: 5)
comment4 = Comment.create(user: user1, rating: 2)

SQL views

SELECT * FROM users

id name
1 Ola
2 Hello
3 Wow
4 Bye

SELECT * FROM comments

id user_id rating
1 1 1
2 2 2
3 2 3
4 3 5
5 1 5

Bye does not have any comments so it should not appear again in the problem investigation.


Implentation of queries in User model

What I wanted to do is:

  1. create a method to rank the user by comment rating: order_comment_rating
  2. create a method to select unique users with comments: with_comments
  3. chain the two methods above with_comments_order_comment_rating

The problem being mainly to write the third method because of the .distinct that may appear in the second method.

Let’s deep into it a little later.


1. .order_comment_rating

Ordering records by comment rating

First, I created a method .order_comment_rating in the User model.

  def self.order_comment_rating
    joins(:comments).merge(Comment.order(:rating))
  end

Let’s split the method to understand what it does.

Step 1: Join comments

First we need to connect the users and comments tables.

>> User.joins(:comments)
SELECT "users".*
FROM "users"
INNER JOIN "comments" ON "comments"."user_id" = "users"."id";

This query does this:

id name
1 Ola
2 Hello
2 Hello
3 Wow
1 Ola

This joins returns 5 records, but we only have 3 users. In fact, it returns all the users but some of them appears twice. We have to imagine that there is the comments table on the right, where each one of the comments is connected.

Users Hello and Ola appear twice as they both wrote two comments.

Step 2: Order comments by rating

>> Comment.order(:rating)
SELECT comments.*
FROM comments
ORDER BY comments.rating ASC;

With this query we get all the comments from the database, ordered by rating, so we have 5 records in output.

Apply both

Then we can easily chain both to have our result of users ordered by rating:

>>  joins(:comments).merge(Comment.order(:rating))
SELECT users.*
FROM users
INNER JOIN comments ON comments.user_id = users.id
ORDER BY comments.rating ASC
id user_id rating
1 Ola 1
1 Ola 2
2 Hello 3
2 Hello 5
3 Wow 5

2. .with_comments

List users with comments

Next, I implemented the .with_comments method in the User model to return the users who wrote comments.

def self.with_comments
  joins(:comments).distinct
end

As we saw earlier joins(:comments) return all users (with duplicates) for all the comments they wrote.

Contrary to the first method, here I applied a distinct so duplicates of users are removed.

SELECT DISTINCT users.*
FROM users
INNER JOIN comments ON comments.user_id = users.id;

So we end up with 3 records returned.

id name
1 Ola
2 Hello
3 Wow

3. .with_comments_order_comment_rating

This is where the troubles arise 🙂

Well, the first reflex would be to chain the two methods above: .with_comments and .order_comment_rating to have only unique users who got comments ordered by rating, like so:

>> User.with_comments.order_comment_rating

Reading the definition of the each method in the User model, the query above is the same as:

>> User.joins(:comments).distinct.merge(Comment.order(:rating))

If we do this, this will fail ! 🤕

Traceback (most recent call last):
ActiveRecord::StatementInvalid (PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list)
LINE 1: ..." ON "comments"."user_id" = "users"."id" ORDER BY "comments"...
                                                             ^
: SELECT DISTINCT "users".* FROM "users" INNER JOIN "comments" ON "comments"."user_id" = "users"."id" ORDER BY "comments"."rating" ASC LIMIT $1

There is ways to handle this ! But first, let’s see what happens when changing the database configuration.

❌ Remember well the failing query from the Traceback:

SELECT DISTINCT "users".*
FROM "users"
INNER JOIN "comments" ON "comments"."user_id" = "users"."id"
ORDER BY "comments"."rating" ASC LIMIT $1

What’s coming was a huge discovery to me !


☀️ SQLite3

I changed the database configuration in database.yml to use the adapter sqlite3. The app remained unchanged.

default: &default
  adapter: sqlite3
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  timeout: 5000

development:
  <<: *default
  database: db/development.sqlite3

I opened a rails console and wrote the exact same query as before.

>> User.with_comments.order_comment_rating

As with PostgreSQL, I was expecting a ActiveRecord::StatementInvalid error to come up. But I had 3 records returned and no error.

1|2018-11-17 12:53:44.516126|2018-11-17 12:53:44.516126|Ola|1
2|2018-11-17 12:53:44.518606|2018-11-17 12:53:44.518606|Hello|2
3|2018-11-17 12:53:44.521313|2018-11-17 12:53:44.521313|Wow|3
id name
1 Ola
2 Hello
3 Wow

The SQL generated was exactly the same than the one that just crashed with PostgreSQL. 🕵🏻‍♀️ (see error message in the Traceback)

SELECT DISTINCT users.*
FROM users
INNER JOIN comments ON comments.user_id = users.id
ORDER BY comments.rating ASC;

SQLite does not have any troubles handling the DISTINCT in queries.

Here is the query plan in SQLite console:

>> EXPLAIN QUERY PLAN SELECT DISTINCT users.* FROM users INNER JOIN comments ON comments.user_id = users.id ORDER BY comments.rating ASC;
5|0|0|SCAN TABLE comments
7|0|0|SEARCH TABLE users USING INTEGER PRIMARY KEY (rowid=?)
22|0|0|USE TEMP B-TREE FOR DISTINCT
23|0|0|USE TEMP B-TREE FOR ORDER BY

☀️ Database differences & investigation

Database behaviors

SQL language and Query Processor

Things important to note is SQL (Structured Query Language) is, well, a language, and its execution is determined by the Query Processor, also called the Query Executor.

From Microsoft documentation:

The Query Processor accepts SQL syntax, determines how to execute that syntax and executes the chosen plan.

The query processor chooses:

  • Which, if any, indexes to use
  • The order in which joins are executed
  • The order in which constraints such as WHERE clauses are applied
  • Which algorithms are likely to lead to the best performance, based on costing information derived from statistics

So depending on the Query Processor chosen, the results of query can differ. This is what happened in my case.


Let’s get back to business! Why does the query that works in SQLite does not work in PostgreSQL ?

🕵🏻‍♀️ PostgreSQL lead 1

⚒ I first tried the solution offered by the error message:

PG::InvalidColumnReference:
ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

To summarize, when we join the users and comments tables together, we get this:

id name id user_id rating
1 Ola 5 1 1
2 Hello 2 2 5
2 Hello 3 2 3
3 Wow 4 3 5
1 Ola 1 1 2

Now, we want the list of distinct users, ordered by their respectivecomments.rating.

But, taking user Hello for example, where should it goes ? As we have to only pick one row to represent this user, should it come second, because one row is rating the comment at 2, or should it go last because it is rated 5 on another comment ?

As the database system can’t make that decision for us, we are restricted to ordering only by columns from the SELECT clause.

Let’s look at the sqlite3 query to understand why it can not work in PostgreSQL.

SELECT DISTINCT users.*
FROM users
INNER JOIN comments ON comments.user_id = users.id
ORDER BY comments.rating ASC;
  1. The issue with the logical ordering is that in SQLite, comments.rating isn’t include in the SELECT clause, only the users table is
  2. DISTINCT happens before ORDER BY, so the column we would like to sort on, comments, is gone
  3. the ORDER BY results is undefined

SQLite is accepting the query even though it’s logically undefined.

So changing the sqlite query to include the comments table in the SELECT clause is accepted by PostgreSQL.

SELECT DISTINCT users.*, comments.rating
FROM users
INNER JOIN comments ON comments.user_id = users.id
ORDER BY comments.rating ASC;
id name rating
1 Ola 1
2 Ola 2
2 Hello 3
3 Hello 5
1 Wow 5

❌ We didn’t get the error but the problem now is that we have duplicate users. We would expect to only have 3 users returned.

Query Plan:

Unique  (cost=159.69..176.69 rows=1700 width=44)
  ->  Sort  (cost=159.69..163.94 rows=1700 width=44)
        Sort Key: comments.rating, users.id, users.name
        ->  Hash Join  (cost=37.00..68.47 rows=1700 width=44)
              Hash Cond: (comments.user_id = users.id)
              ->  Seq Scan on comments  (cost=0.00..27.00 rows=1700 width=12)
              ->  Hash  (cost=22.00..22.00 rows=1200 width=40)
                    ->  Seq Scan on users  (cost=0.00..22.00 rows=1200 width=40)

🕵🏻‍♀️ PostgreSQL lead 2

⚒ I then implemented this solution after doing a course about Advanced Query on Upcase - Thoughbot.

I tried to handle the ActiveRecord::StatementInvalid by doing a subquery.

>> User.from(User.with_comments, :users).order_comment_rating

Let’s split this to understand what happens.

The first query uses the .with_comments method and wrap it into a subquery. Then it says to select all users from this subquery.

>> User.from(User.with_comments, :users)
SELECT users.*
FROM (
	SELECT DISTINCT users.* FROM users
	INNER JOIN comments ON comments.user_id = users.id
	) users;

It returns exactly 3 users as the User.with_comments query did, but we haven’t ordered by rating yet. So:

>> User.from(User.with_comments, :users).order_comment_rating
SELECT users.*
FROM (
	SELECT DISTINCT users.* FROM users
	INNER JOIN comments ON comments.user_id = users.id
	) users
INNER JOIN comments ON comments.user_id = users.id
ORDER BY comments.rating ASC;
id name rating
1 Ola 1
2 Ola 2
2 Hello 3
3 Hello 5
1 Wow 5

❌ No error, but it returns 5 records listing the users ordered by comment rating. The problem remaining that we have duplicate users.

Query Plan

Sort  (cost=1072.38..1097.88 rows=10200 width=44)
  Sort Key: comments.rating
  ->  Merge Join  (cost=162.50..393.25 rows=10200 width=44)
        Merge Cond: (comments.user_id = users.id)
        ->  Index Scan using index_comments_on_user_id on comments  (cost=0.15..73.65 rows=1700 width=12)
        ->  Sort  (cost=162.35..165.35 rows=1200 width=40)
              Sort Key: users.id
              ->  HashAggregate  (cost=76.97..88.97 rows=1200 width=40)
                    Group Key: users.id, users.name
                    ->  Hash Join  (cost=37.00..68.47 rows=1700 width=40)
                          Hash Cond: (comments_1.user_id = users.id)
                          ->  Seq Scan on comments comments_1  (cost=0.00..27.00 rows=1700 width=8)
                          ->  Hash  (cost=22.00..22.00 rows=1200 width=40)
                                ->  Seq Scan on users  (cost=0.00..22.00 rows=1200 width=40)

🕵🏻‍♀️ PostgreSQL lead 3

⚒ I then discussed about it on Twitter, and had help from Stephen Frost, who told me about CROSS JOIN LATERAL to solve this problem.

The CROSS JOIN LATERAL iterates over each of the results and runs subquery giving an access to that record. That can be a help for us !

From PostgreSQL documentation:

Subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.) …

… When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set’s values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s).

SELECT users.*, comments.rating
FROM users CROSS JOIN LATERAL (
	SELECT comments.rating FROM comments
	WHERE users.id  = comments.user_id
	ORDER BY comments.rating ASC limit 1
	) AS comments;

The limit 1 is a way to select one user once the users have been ordered by comments.

✅ Here we have our result without any duplicates 💃🏻

id name rating
1 Ola 1
2 Hello 3
3 Wow 5

Query Plan

Unique  (cost=159.69..176.69 rows=1700 width=44)
  ->  Sort  (cost=159.69..163.94 rows=1700 width=44)
        Sort Key: comments.rating, users.id, users.name
        ->  Hash Join  (cost=37.00..68.47 rows=1700 width=44)
              Hash Cond: (comments.user_id = users.id)
              ->  Seq Scan on comments  (cost=0.00..27.00 rows=1700 width=12)
              ->  Hash  (cost=22.00..22.00 rows=1200 width=40)
                    ->  Seq Scan on users  (cost=0.00..22.00 rows=1200 width=40)

🕵🏻‍♀️ Conclusion

  • Be carefully with database configuration: it has an impact on SQL queries as they won’t be executed by the same query processor.
  • What SQLite can handle easily, PostgreSQL needs a complex query for a SELECT DISTINCT clause.
  • The next step would be to investigate query plans on each query executor.

Next step

  • I would like to investigate how to implement the last query in ActiveRecord.
  • I would love to illustrate this with SQL join diagrams and learn more about query plans.

To go further


https://use-the-index-luke.com/fr/sql/plans-dexecution/postgresql/operations

Updated: