I have a PostgreSQL table converted from a MongoDB collection. The layout for the PostgreSQL table is:
create table "objects" (
"data" jsonb not null check ("data" ? '_key')
);
create index "idx__objects__key__score" on "objects"(("data"->>'_key') asc, (("data"->>'score')::numeric) desc);
create unique index "uniq__objects__key" on "objects"(("data"->>'_key')) where not ("data" ? 'score');
create unique index "uniq__objects__key__value" on "objects"(("data"->>'_key') asc, ("data"->>'value') desc);
create index "idx__objects__expireAt" on "objects"((("data"->>'expireAt')::numeric) asc) where "data" ? 'expireAt';
The data in the MongoDB collection is formatted in one of five ways:
- A
_key
field (string) and avalue
field (string). This corresponds to the Redisstring
type. - A
_key
field (string) and amembers
field (array of strings, unique, order not important). This corresponds to the Redisset
type. - A
_key
field (string) and anarray
field (array of strings). This corresponds to the Redislist
type. - A
_key
field (string), avalue
field (string), and ascore
field (number). This corresponds to the Rediszset
type. - A
_key
field (string) and neither avalue
nor ascore
field, but any number of other string (or number, but it is acceptable to replace a number with its string representation when converting this). This corresponds to the Redishash
type.
Apart from the zset
type, _key
must be unique. This means that a given _key
will always refer to a specific one of the five types of data, and that there are never duplicate entries for the same _key
for four of the five types. Additionally, zset
entries always have a unique _key
+value
pair, although the score
may be the same between different zset
entries.
Additionally, each row in the table may contain a field named expireAt
, which is a number representing the number of milliseconds after the unix epoch when the row should be deleted. This field is not considered when applying the rules in the list above. The inconsistency this causes with zset
values is currently unsolved in the MongoDB version, so we follow the Redis rules where possible and delete all elements of a zset
when it expires. I do not think zset
data is ever assigned an expiration in the application that generates this data, but it would be good to handle this gracefully if practical.
I could leave this as-is, but query performance is terrible because PostgreSQL can't figure out any kind of useful statistics from this mess of data, and it's very hard to create a constraint that validates a single-column table that's acting like five different tables and doesn't cause someone to have their face melt from looking at it.
Because fixing the entire database to use a sensible relational layout is outside the scope of this project, the data will need to be able to be retrieved in the same format as the original.
I would consider the data set I will be testing with to be medium-size. The objects
table is 7159 MiB, with an empty expireAt
index, a 1465 MiB key__score
index, a 305 MiB key
index, and a 1435 MiB key__value
index. There are between 32 million and 33 million rows in the objects
table.