Instantly share code, notes, and snippets.

Embed
What would you like to do?
Ordering a query result set by an arbitrary list in PostgreSQL

I'm hunting for the best solution on how to handle keeping large sets of DB records "sorted" in a performant manner.

Problem Description

Most of us have work on projects at some point where we have needed to have ordered lists of objects. Whether it be a to-do list sorted by priority, or a list of documents that a user can sort in whatever order they want.

A traditional approach for this on a Rails project is to use something like the acts_as_list gem, or something similar. These systems typically add some sort of "postion" or "sort order" column to each record, which is then used when querying out the records in a traditional order by position SQL query.

This approach seems to work fine for smaller datasets, but can be hard to manage on large data sets with hundreds (or thousands) of records needing to be sorted. Changing the sort position of even a single object will require updating every single record in the database that is in the same sort group. This requires potentially thousands of writes to the database for every sort order update.

Possible Solutions

One typical alternative approach is to store a single, serialized column that contains the full sort order (for instance a serialized array column that contains the ordered list of record ids). Using this approach, we are now required to query the records out from the database using this specified order.

Ideally, it would be nice if the database just used the order of the in clause so a query of select * from foo where id in (67,23,1362,24) returned the records in that order, but I'm not aware of a database that does this.

MySQL has a built-in ORDER BY Field() clause where you could do something like ORDER BY FIELD(67,23,1362,24)

PostgreSQL doesn't seem to have any similar functions. There are two somewhat widely discussed solutions that I have come across.

  1. Using the PostgreSQL CASE clause like the following:
    SELECT * FROM foo ORDER BY
    CASE
      WHEN id=67 THEN 1
      WHEN id=23 THEN 2
      WHEN id=1362 THEN 3
      WHEN id=24 THEN 4
    END
  1. Using a custom JOIN table like the following:
    SELECT * FROM foo
    JOIN (VALUES (67,1), (23,2), (1362,3), (24,4)) as x(id, ordering) ON foo.id = x.id
    ORDER BY x.ordering

I'm currently using solution #2 as it was easier to implement as a module that I can include in any AR::Base class and have it work well with other scopes.

I'm wondering if there is an obvious solution I'm just missing, or any alternatives or alternate implementations of the db schema to begin with that might make this a non-issue.

What are the best ways to keep a large set of data sorted according to user specified order with PostgreSQL?

Sidenote

For this particular use case that I'm dealing with, we allow sorting more than one object at a time. The objects (images) are laid out in a grid on the UI and the user can select a single image to reorder via drag and drop, or they can select multiple images to be reordered via drag and drop at the same time. This makes using something like the acts_as_list#insert_at method less useful. I suppose we could call #insert_at or manually #update_attribute(:position, N) one at a time for each item that was moved in position, but I'd like to do it all in one fell swoop if possible.

@shoe

This comment has been minimized.

shoe commented Sep 2, 2012

I think it depends a lot on the size of your list. For example, if N < 100, you'll get lots of reasonable behavior out of a serialized sort array. You can even do the sorting in ruby without much problem. You do end up reading and writing the list a lot, but it's small, so who cares?

OTOH, if N = 100000, you have some not-so-nice behaviors from a serialized array. Now the single reads and writes are larger, parse times are slower, etc. If you want to do things like get items 30000-30099 to show on a page, it's gonna get painful.

However, if you have large N, you can still get great performance out of an indexed column, while still keeping the number of records updated equal to the number of records being moved. You just have to use a sort column with a datatype with a continuous sort function, like float or string (not integer). Then, there are various convenient ways to update the sort column to reorder, depending on how much integrity you want to enforce.

Note: if you consider the pathological case, with very large datasets, and with extremely bad reordering operations (e.g. swapping two adjacent items 2^^64 times) you can run into problems, either with collisions in the sort key (float), or ballooning space requirement (string). In practice, this isn't really a concern, and the sort column can be renormalized periodically anyway to avoid such issues.

@david

This comment has been minimized.

david commented Sep 2, 2012

Not sure if this is what @shoe means, but I think you can also use an integer as long as indexes aren't consecutive (i.e. treated more like weights)? With sufficient space between weights, you can move objects without updating the other records.

E.g., say you have (Record 1, 100), (Record 2, 200), (Record 3, 300). If you want to move Record 3 up, you'd update Record 3's weight to 250, leaving room for more resorting. I guess at some point the sort column's weights would have to be reset. As usual, that depends on context.

@david

This comment has been minimized.

david commented Sep 2, 2012

Err, that would be 150, not 250.

@grin

This comment has been minimized.

grin commented Sep 2, 2012

@shoe

This comment has been minimized.

shoe commented Sep 2, 2012

yep, that's the technique I was describing. I see that they even rebalance synchronously, if needed. I can see the advantage of using a spaced-out integer rather than string (storage), but I wonder if there is any advantage of using integer over float or float over integer. Behavior of integers is much easier to model. Otherwise, I suppose they are pretty similar. You have to choose your spacing a priori in either case.

@shoe

This comment has been minimized.

shoe commented Sep 2, 2012

@grin nice find!

@cpjolicoeur

This comment has been minimized.

Owner

cpjolicoeur commented Sep 2, 2012

So, it seems the ideal solution is to use the original methodology of keeping the "sort position" on each individual record and not using a serialized field containing the sort order.

@shoe

This comment has been minimized.

shoe commented Sep 3, 2012

it depends on N, and on access patterns, but in general, the sort column on the record is the more robust solution.

@diogenes

This comment has been minimized.

diogenes commented Sep 3, 2012

@cpjolicoeur What about to create a custom function in PostgreSQL? Maybe something like this: http://wiki.postgresql.org/wiki/Array_Index

Then you can do:

SELECT * FROM foo WHERE id IN (3, 20, 16, 8, 14)
ORDER BY array_idx(ARRAY[3, 20, 16, 8, 14], foo.id)

Using this approach, will be easy to implement a generic scope to do this kind of sorting. Something like this:

scope :ordered_by_rank, lambda { |rank, column = :id|
  order("array_idx(ARRAY[#{Array(rank).join(', ')}], #{column})")
}
@agrberg

This comment has been minimized.

agrberg commented Sep 4, 2012

@diogenes Your approach is interesting but I'm not familiar enough with custom functions in PostgreSQL to know how it would be different from the case statement example Craig listed as a potential solution.

@cpjolicoeur

This comment has been minimized.

Owner

cpjolicoeur commented Sep 4, 2012

@diogenes, I had thought of duplicating the ORDER BY FIELD() feature of MySQL as a postgres function, and there are several examples online of such functions, but this app is currently hosted on Heroku and I dont have permissions to create functions

@leckylao

This comment has been minimized.

leckylao commented Jul 16, 2014

Hi @cpjolicoeur, love the discussion and loving and using the custom join approach. It works like charm. Thank you very much.

Just want to share my use-case here that I have items(id, owner_id) and a list of owners(1, 3, 15, 4) and ordering items in the order of the list of owners.

owners = [1, 3, 15, 4]
values = ''
owners.each_with_index do |owner, index|
  values += ", " if index > 0
  values += %((#{owner}, #{index}))
end
@items = Item.joins("join (VALUES #{values}) as x(id, ordering) on items.owner_id = x.id").order("x.ordering").page params[:page]
@g-ilham

This comment has been minimized.

g-ilham commented Nov 5, 2014

sort in mysql:

    > ids = [11,31,29]
    => [11, 31, 29]
    > User.where(id: ids).order("field(id, #{ids.join(',')})")

in postgres:

    def self.order_by_ids(ids)
      order_by = ["case"]
      ids.each_with_index.map do |id, index|
        order_by << "WHEN id='#{id}' THEN #{index}"
      end
      order_by << "end"
      order(order_by.join(" "))
    end

    User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id) 
    #=> [3,2,1]
@alexbel

This comment has been minimized.

alexbel commented Nov 6, 2014

@g-ilham Perfect solution!

@robuye

This comment has been minimized.

robuye commented Nov 13, 2014

given you have intarray extension (http://www.postgresql.org/docs/9.1/static/intarray.html) it is possible to do:

def order_by_ids(ids)
  order("idx(ARRAY#{ids}, id")
end


irb(main):062:0* ids
=> [1, 2, 3, 4, 5, 9, 8, 7, 6]
irb(main):063:0> User.where("id < 10").order("idx(ARRAY#{ids}, id)").map(&:id)
  User Load (1.7ms)  SELECT "users".* FROM "users"  WHERE (id < 10)  ORDER BY idx(ARRAY[1, 2, 3, 4, 5, 9, 8, 7, 6], id)
=> [1, 2, 3, 4, 5, 9, 8, 7]
@JacobEvelyn

This comment has been minimized.

JacobEvelyn commented Feb 27, 2015

@leckylao thanks for your example; I have a very similar use case and it works perfectly!

Does anyone know if any of these approaches are typically more performant than others (obviously individual data sets and database setups matter a lot; I'm more curious about obvious order-of-magnitude algorithmic differences)? My gut tells me that the idx and array_idx approaches will be slower because a database would naively be doing an O(n^2) search to find the indices, but I don't know how that might compare to the overhead of, say, another database join, or a ton of CASE statements, etc.

@murbanski

This comment has been minimized.

murbanski commented Mar 9, 2015

ORDER BY ID=10 DESC, ID=2 DESC, ID=56 DESC, ID=40  DESC

Works perfectly.

@JacobEvelyn

This comment has been minimized.

JacobEvelyn commented Mar 13, 2015

Thanks so much for the tip @murbanski! That's awesome, and works in both Postgres and SQLite (I didn't get a chance to test MySQL). We just bundled this up into a gem for easier use: https://github.com/panorama-ed/order_as_specified

Edit: Just tested MySQL and it works as well.

@deryagin

This comment has been minimized.

deryagin commented May 21, 2015

Works in Postgres:

ORDER BY (ID=10, ID=2, ID=56, ID=40) DESC
@DanielVartanov

This comment has been minimized.

DanielVartanov commented Oct 28, 2015

Works in Postgres:

ORDER BY position(id::text in '10, 20, 56, 40')

@vasilakisfil

This comment has been minimized.

vasilakisfil commented Nov 10, 2015

Which one is faster? Any db-specialist who would like to give us a hint? :)

@gingerlime

This comment has been minimized.

gingerlime commented Nov 11, 2015

thanks @DanielVartanov ! 👍 seems really neat. It's easy to use in active record scopes like this:

class Something < ActiveRecrd::Base

  scope :for_ids_with_order, ->(ids) {
    order = sanitize_sql_array(
      ["position(id::text in ?)", ids.join(',')]
    )
    where(:id => ids).order(order)
  }    
end

# usage:
Something.for_ids_with_order([1, 3, 2])

also posted on this SO answer

@semanticart

This comment has been minimized.

semanticart commented Jun 23, 2016

The answer put forward by @gingerlime uses text comparison, so it gets confused if you have ids where one is a substring of another.

Consider ids [1234, 55, 3, 1, 2]. The sorted results will have 55 last because 3, 1, and 2 match a substring of 1234.

A join with ORDINALITY seems to work fine, though.

@nhontpnv

This comment has been minimized.

nhontpnv commented Jul 26, 2016

@murbanski thank you so much, I have to sign up just for thanking you.

@igtulm

This comment has been minimized.

igtulm commented Sep 14, 2017

@cpjolicoeur I found the solution below very nice for me. E.g. you use PL/PGSQL and pass IN-values and array for array_position from single parameter. What do you think about it?

SELECT * FROM foo WHERE id IN (3, 1, 2) ORDER BY array_position(ARRAY[3, 1, 2], id);

It works since PostgreSQL 9.5.

@kevinzwhuang

This comment has been minimized.

kevinzwhuang commented Oct 6, 2017

@cpjolicoeur @igtulm I also found array_position to be a great solution. In my research (using EXPLAIN), I found that using array_position has been the least expensive method.

ORDER BY array_position(ARRAY[1, 2, 3]::int[], table_name.id::int)

Note: I'm casting id to int because primary keys starting with Rails 5 are bigint
Note 2: Also casting the array into int[] to ensure it works even if I'm providing a list of strings in Rails

To build off the AR scope integration that @gingerlime proposed, a scope could be written like so:

class User < ApplicationRecord

  scope :with_id_order, ->(ids) {
    order = sanitize_sql_array(["array_position(ARRAY[?]::int[], users.id::int)", ids])
    order(order)
  }

end

# Usage
User.with_id_order([5, 10, 20])
# Users 5, 10, 20, 1, 2, 3, 4...etc

# Restrict by the ids given
user_ids = [5, 10, 20]
User.where(id: user_ids).with_id_order(user_ids)
# Users 5, 10, 20

This scope gives a bit more flexibility in that it allows you to order with a given array of prioritized ids while querying for other users too.

@msacket

This comment has been minimized.

msacket commented Jan 13, 2018

I'm a bit late to this ballgame, but here is another solution using PostgreSQL that allows for user-definable ordering of lists. I've written about it here. There is also a php implementation.

CREATE TABLE fruit ( "id" serial NOT NULL, "name" text NOT NULL, PRIMARY KEY ("id") ); 
INSERT INTO fruit (name) VALUES ('Apple');
INSERT INTO fruit (name) VALUES ('Orange'); 
INSERT INTO fruit (name) VALUES ('Pear');
CREATE TABLE user_preferences (fk_user int4, fruits text[]);
INSERT INTO user_preferences (fk_user, fruits) VALUES (1, ARRAY['Pear','Orange','Apple'])
SELECT 
	fruit.id, 
	fruit.name 
FROM ( 
	SELECT 
		s.position, 
		p.fruits[s.position] AS name 
	FROM user_preferences p, 
	generate_series(1, array_length(p.fruits, 1)) AS s(position) 
	WHERE p.fk_user = 1 
) x 
JOIN fruit ON (fruit.name=x.name) 
ORDER BY x.position asc;
+----+--------+
| id | name   | 
+----+--------+
| 3  | Pear   |
| 2  | Orange |
| 1  | Apple  |
+----+--------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment