Skip to content

Instantly share code, notes, and snippets.

@jcoglan
Created July 2, 2021 14:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jcoglan/c1ba332ecfe53866ffb077728eb30615 to your computer and use it in GitHub Desktop.
Save jcoglan/c1ba332ecfe53866ffb077728eb30615 to your computer and use it in GitHub Desktop.

How does CouchDB sort strings?

Create documents:

$ acurl -X PUT "$host/encoding/a" -d '{ "cp": "U+F925", "data": "chr: 拉" }'
{"ok":true,"id":"a","rev":"1-5f4eb3548d021ed8d02b3e2622783b52"}

$ acurl -X PUT "$host/encoding/b" -d '{ "cp": "U+1F631", "data": "chr: 😱" }'
{"ok":true,"id":"b","rev":"1-853e9ee71d55c2312d9f2c5fc6956ab6"}

Check the encoding data is returned in:

$ acurl -X GET "$host/encoding/_all_docs?include_docs=true" \
    | jq -r '.rows | .[] | .doc.data' \
    | hexdump -C

00000000  6e 75 6c 6c 0a 63 68 72  3a 20 ef a4 a5 0a 63 68  |null.chr: ....ch|
00000010  72 3a 20 f0 9f 98 b1 0a                           |r: .....|
00000018

1st char (拉) is ef a4 a5, 2nd char (😱) is f0 9f 98 b1, looks like UTF-8. We'd expect 拉 to sort before 😱 based on their codepoints and UTF-8 bytes.

In UTF-16BE, 拉 is f9 25, and 😱 is d8 3d de 31, so 😱 would sort before 拉.

In UTF-16LE, 拉 is 25 f9, and 😱 is 3d d8 31 de, so 😱 would sort after 拉.

To decide which encoding is in use we need another example: U+F946, 牢.

$ acurl -X PUT "$host/encoding/c" -d '{ "cp": "U+F946", "data": "chr: 牢" }'
{"ok":true,"id":"c","rev":"1-dd58f21b9fedf2322abd2a8485f56321"}

Full example set:

docid | char | codepoint | UTF-8       | UTF-16BE    | UTF-16LE
------+------+-----------+-------------+-------------+------------
a     | 拉   | U+F925    | EF A4 A5    | F9 25       | 25 F9
b     | 😱   | U+1F631   | F0 9F 98 B1 | D8 3D DE 31 | 3D D8 31 DE
c     | 牢   | U+F946    | EF A5 86    | F9 46       | 46 F9

We'd expect the following sort orders for different representations:

  • codepoint: A, C, B
  • UTF-8: A, C, B
  • UTF-16BE: B, A, C
  • UTF-16LE: A, B, C

Create index:

$ acurl -X POST "$host/encoding/_index" \
    -d '{ "ddoc": "by-data", "type": "json", "index": { "fields": ["data"] } }'

{"result":"created","id":"_design/by-data","name":"1f39708131deed6f5649d8c9447ae53729ceb7ef"}

Mango query, sort by data:

$ acurl -X POST "$host/encoding/_find" \
    -d '{ "selector": {}, "sort": [{ "data": "asc" }] }'

{"docs":[
{"_id":"b","_rev":"1-853e9ee71d55c2312d9f2c5fc6956ab6","cp":"U+1F631","data":"chr: 😱"},
{"_id":"a","_rev":"1-5f4eb3548d021ed8d02b3e2622783b52","cp":"U+F925","data":"chr: 拉"},
{"_id":"c","_rev":"1-dd58f21b9fedf2322abd2a8485f56321","cp":"U+F946","data":"chr: 牢"}
],
"bookmark": "g1AAAAA_eJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozJoMkOGASOSAhkDhHckaRlcL7pW1ZWQARIRGp"}

Order is B, A, C which is consistent with UTF-16BE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment