Azure Cosmos DB Indexing Experiments
Azure Cosmos DB’s pricing by “Request Units” (or RUs) effectively means that the more you can optimize your data structures and your queries, the more money you can save. While this is also true for databases like SQL Server, the direct link between optimization and pricing is perhaps a bit more distant than with Cosmos where you pay for a specific throughput capacity.
To that end, it is important to understand how to structure your data to take advantage of cost savings with Cosmos. I think early horror stories of cost issues were partially due to the immature technology (you can see some examples of how recent improvements reduced RUs for common operations) and perhaps partially due to poor data design.
For me, it has been important to actually test different scenarios to understand how to structure data for the best cost and performance to cut through some of those early horror stories and not make the same mistakes.
Because of the document oriented nature of the database, it is common to end up with embedded references to external documents to reduce the number of operations required to display a record.
Consider the following structure:
1 2 3 4 5 6 7 |
{ orderId: 12934, itemsRef: [ {itemId: 3842, itemName: "Paper Clips (100 ct)"}, {itemId: 1569, itemName: "Staples (500 ct)"} ] } |
The main document is an Order which contains references to Items. By embedding some key information into the root Order document, we can reduce the cost of loading the Order.
But if we want to now find all Orders which contain a specific Item, we have the following options:
- Keep track of Item to Order mapping on the Item – bad idea since it means the document size has no bound.
- Keep track of Item to Order mapping in some other storage structure – not a bad idea and could be very performant, but requires multiple systems and increases complexity.
- Query across Orders using the itemsRef property – consumes our RUs and performance depends on many factors including data and index design.
Option 3 is what we are interested in, but the question is what’s the best way to represent this in Cosmos. One option is as indicated above. A second option is like so:
1 2 3 4 5 6 7 |
{ orderId: 12934, itemsRef: [ "3842|Paper Clips (100 ct)", "1569|Staples (500 ct)" ] } |
In this design, we encode the values as a string. This may be more performant depending on how the indexer behind the scenes handles strings versus how it handles arrays of objects. Of course, there are tradeoffs here because in the first case, we can query itemId even without knowing the name or if the name changes, we are still OK. In this second scenario, we always have to have an “original name” or “static name” which we use specifically for looking up the object as a reference.
So the question is: how will Cosmos behave? What can we expect?
To find out, I wrote a simple console app to create 1,000,000 documents with a randomly selected number of embedded references (between 3 and 8) from a set of 50,000 possible data points. Each document looks like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
{ "id": "b0c7d016-381e-42d0-b5f1-4c5ac970734f", "partition": "25", "index": 15, "usersString": [ "6385f1f2-0c16-4f07-9cc4-f8591b0ea059|Zoe Bond", "1eb49261-0dbd-49ab-a319-e56bfe89650e|Colten Rodriguez", "01edc111-3cf6-4cbf-ac08-e357659ee6c0|Andy Lindsey" ], "usersRef": [ { "id": "6385f1f2-0c16-4f07-9cc4-f8591b0ea059", "label": "Zoe Bond" }, { "id": "1eb49261-0dbd-49ab-a319-e56bfe89650e", "label": "Colten Rodriguez" }, { "id": "01edc111-3cf6-4cbf-ac08-e357659ee6c0", "label": "Andy Lindsey" } ], "_rid": "VcwpAIvh+NABAAAAAAAABA==", "_self": "dbs/VcwpAA==/colls/VcwpAIvh+NA=/docs/VcwpAIvh+NABAAAAAAAABA==/", "_etag": "\"00000000-0000-0000-18b8-1353aa9a01d6\"", "_attachments": "attachments/", "_ts": 1587568193 } |
The questions we want to answer are:
- Is it more cost effective to query by usersString?
- Is there any additional cost to querying by usersRef?
- Is there any difference in performance based on how the index strategy is specified?
Let’s query using usersString with the default indexing policy. We execute the following query:
1 2 |
SELECT * FROM c WHERE ARRAY_CONTAINS(c.usersString, '01edc111-3cf6-4cbf-ac08-e357659ee6c0|Andy Lindsey') |
Then we query using usersRef with the default indexing policy with the following query:
1 2 |
SELECT * FROM c WHERE ARRAY_CONTAINS(c.usersRef, {'id': '01edc111-3cf6-4cbf-ac08-e357659ee6c0'}, true) |
Cosmos’ ARRAY_CONTAINS allows partial matches of JSON objects when the third parameter of the predicate is set to true so we can effectively send it a prototype of the entity we are looking for in the embedded reference.
Let’s look at the results:
Surprisingly, the RUs are exactly the same. However, the compute load of using the JSON matching is noticeably higher than the compute time of the string match. At best I can tell, the system function execution time of the string match is around 0.02 ms (as the stats occasionally shows 0.02ms instead of 0). This is not bad at all, even in a compute sensitive runtime environment like Azure Functions (which has a memory-time price component).
One question I had was whether the index policy could be tuned to improve the performance. According to this Stackoverflow post from 2018, it was possible to change the index performance by using hash versus range for specific cases of array searches (but this was right around the time that the Cosmos team changed how indexing worked). I customized the index policy:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/usersString/[]/?" }, { "path": "/usersRef/[]/id/?" } ], "excludedPaths": [ { "path": "/*" }, { "path": "/\"_etag\"/?" } ] } |
So rather than using the default “index everything” policy, we specifically designate the indexes to create.
It turns out that there’s no difference in performance.
The conclusion is that it makes more sense to use an object reference and partial JSON matches than using the string matches. There is an additional compute cost associated with it, but I think it’s worth it for ease of use and a better application design.