This test application demonstrates an issue with Cosmos DB where the request unit cost (RU) for a SQL query on id
is unexpectedly high and increases linearly with the number of documents in the collection. This issue only occurs when the collection is partitioned and the partition key is not id
.
Cosmos DB provides an API to retrieve a document by ID and this operation will always have a RU cost that is "low". This cost essentially depends on the size of the document and as the number of documents increases this cost will not increase.
So even though there is an API to retrieve a document by ID in some cases it is still useful to perform SQL queries that involves the id
property. The IDs may encode the type of the document allowing for retrieval of specific document types by using queries like
SELECT * FROM root WHERE STARTSWITH(root.id, "Foo")
The Microsoft Azure Storage Explorer also uses SQL to retrieve a document by ID:
SELECT * FROM c WHERE c.id = "Foobar"
The issue demonstrated by this test application can easily make the above sample queries require so many request units that the query fails. When this happens Microsoft Azure Storage Explorer will simply not display any results and you may belive that you have entered an incorrect ID when in fact the document exists.
The test application is configured to use the emulator but can easily be modified to use a specific Cosmos DB instance.
The application creates the database Test
and in this database two document collections, Fast
and Slow
, are created. To speed things up the offer throughput for these collections are set to 10000. Note that this may incur a cost as long as these document collections exists in Azure. The test application will delete the database when the test is done so the actual cost will be quite small.
The only difference between the two document collections are the partition key. In Slow
the partition key is defined as /key
and in Fast
it is defined as /id
.
The test application creates 5000 documents in each collection. A single document is then retrieved by ID from each collection using the following SQL:
SELECT * FROM root WHERE root.id = "..."
The request unit cost of this query is then reported.
In the Fast
document collection the request unit cost is less than 3. This is the collection where the id
property is the partition key. Changing the number of documents in the collection to something like 10000 will not change this cost.
In the Slow
document collection the request unit cost is close to 90. This is the collection where the key
property is the partition key. Doubling the number of documents in the collection will almost double the cost. With enough documents in the collection the cost to retrieve a single document by ID will be so high that the system trying to retrieve the document will become very slow which often can lead to failures in distributed systems (timeouts etc.).
I have been in contact with Cosmos DB support and they have confirmed the issue. Fortunately, they have made some changes to how indexing is performed and a Cosmos DB instance using the new indexing strategy no longer suffers from this problem. Unfortunately, the new indexing strategy is not yet available by default so I had to ask Cosmos DB support to upgrade my Cosmos DB instances to use the new strategy (this will incur some downtime). Hopefully, the new strategy will be the default for all new Cosmos DB instances in the future.
Without knowing the exact details on how Cosmos DB is implemented I believe that executing a SQL query on id
in a document collection with a partition key that is not id
will result in a scan of all documents leading to a request unit cost that increases linearly with the total size of the documents in the collection.
If the query involves other properties that are indexed the number of documents to scan can be greatly reduced bringing the request unit cost down. It is not possible to configure an index on id
that will remove the need to perform the scan.