Tips, Tricks, and Hacks
Programming should not really be thought of as an amalgamation of different tips, tricks, and hacks. But also, I would be wrong to deny the power of tips, tricks, and hacks. They are an important piece of the pie. And every once in a while you are just missing one crucial piece of information to get you where you need to be.
The frontend world in particular seems to have an affinity for these nuggets of knowledge (a la CSS-tricks). I also see a lot of helpful tips and tricks with terminal, bash aliases, git, certain hotkey combinations, etc.
Today, however, I am coming to you today from the SQL world. Hopefully this trick will be as helpful to you as it has been to me.
Once in a blue moon there arises some business logic that seems like it might be impossible for SQL to solve—or, perhaps more accurately, the solutions that are coming to mind just seem to be too complex, or messy, and you don’t want to deal with them.
I was in this situation not too long ago, and the end result of that situation ended up in me adding a new tool to my SQL toolbelt. And it was so simple that I felt like people needed to have easier access to it!
SELECT DISTINCT ON
Here’s how it goes.
I have a database of
dogs and each dog has a
breed, and one or many
weights which are stored in the
dog_weight’s table. Each weight has a
value_in_pounds and a
(I’m guessing that if you clicked on and are reading a SQL blog post, you probably also want to see the schema in SQL)
CREATE TABLE dog (
CREATE TABLE dog_weight (
The one last detail is that there can only be one
best_in_show weight per
With this schema, my goal is to select what can be thought of as the “best-in-show weight” for each dog breed. BUT, some
breed’s have never won best-in-show, so they don’t have a best-in-show weight—instead they have one or more rows which have
best_in_show set to
FALSE. And in this case, we actually want to return the highest
weight for that
breed. So really, I’m trying to select the “best-in-show weight, or the highest weight on record for that dog”. So how do we do this?
SELECT dog.id, dog.breed, dog_weight.value_in_pounds FROM dog INNER JOIN dog_weight ON dog.id = dog_weight.dog_id WHERE dog_weight.best_in_show = true;
This query is sort of almost correct, but will not return the breeds which have not won best-in-show (which is really where the complexity comes from, and returning the heaviest dog of that breed in that circumstance).
As always in programming, and in this case with SQL, there’s many different paths that could arrive at the correct query. But one of the simplest paths is with this query:
SELECT DISTINCT ON (dog.breed) dog.id, dog.breed, dog_weight.value_in_pounds FROM dog INNER JOIN dog_weight ON dog.id = dog_weight.dog_id ORDER BY dog.breed DESC, dog_weight.best_in_show DESC, dog_weight.value_in_pounds DESC;
What’s really happening here is that the results are being ordered in a very specific way, such that the first result in each grouping of dog breeds is the one we want to select.
They are ordered first by their
breed (which groups them together by breed), and next by
best_in_show (which puts the first result for each window of breeds as the dog with
best_in_show, if it’s present), and finally by their weight (putting the heaviest dog first as the result for that dog breed, in the case that
best_in_show is not present). Finally, it selects
DISTINCT ON (dog.breed), which selects only the first result for each dog breed, which is exactly what we are aiming to return.
The PostgreSQL docs can explain it better than I can:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY.. Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example:
SELECT DISTINCT ON (location) location, time, report
ORDER BY location, time DESC;
retrieves the most recent weather report for each location. But if we had not used ORDER BY to force descending order of time values for each location, we’d have gotten a report from an unpredictable time for each location.
Ordering your results and using
SELECT DISTINCT ON can be a very useful way to reduce a complex query. I think one of the most interesting takeaways from this is that although we normally think of ordering results as being useful in, perhaps, finding the most expensive “something” or most recently created “other thing”, ordering can also be used as a tool to group results.
Hopefully you find use for this variation of
SELECT DISTINCT in your SQL life.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.