Skip to content

Instantly share code, notes, and snippets.

@smithsz
Last active July 4, 2016 21:26
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save smithsz/30fb97662c549061e581 to your computer and use it in GitHub Desktop.
Save smithsz/30fb97662c549061e581 to your computer and use it in GitHub Desktop.
The Cloudant Changes Feed

#A Changes Feed Example

##Create a new database:

curl -XPUT https://samsmith.cloudant.com/new_database

By default, this creates an n=3, q=4 database.

This means the database is split into 4 shard ranges. Each shard range being stored three times.

You can see how any database is sharded via the API:

curl -XGET https://samsmith.cloudant.com/new_database/_shards

{
  "shards": {
    "00000000-3fffffff": [
      "dbcore@db1.mead.cloudant.net",
      "dbcore@db2.mead.cloudant.net",
      "dbcore@db3.mead.cloudant.net"
    ],
    "40000000-7fffffff": [
      "dbcore@db1.mead.cloudant.net",
      "dbcore@db2.mead.cloudant.net",
      "dbcore@db3.mead.cloudant.net"
    ],
    "80000000-bfffffff": [
      "dbcore@db1.mead.cloudant.net",
      "dbcore@db2.mead.cloudant.net",
      "dbcore@db3.mead.cloudant.net"
    ],
    "c0000000-ffffffff": [
      "dbcore@db1.mead.cloudant.net",
      "dbcore@db2.mead.cloudant.net",
      "dbcore@db3.mead.cloudant.net"
    ]
}

##Add some docs

We now add 10 documents to this database: doc1, doc2, …, doc10

You can see which shard range is holding a particular document via the API.

Lets see which range is holding doc1:

curl -XGET https://samsmith.cloudant.com/new_database/_shards/doc1

{
  "range": "c0000000-ffffffff",
  "nodes": [
    "dbcore@db1.mead.cloudant.net",
    "dbcore@db2.mead.cloudant.net",
    "dbcore@db3.mead.cloudant.net"
  ]
}

##Querying _changes

curl -XGET https://samsmith.cloudant.com/new_database/_changes

Our first query returns this sequence of updates:

seq: 1-XXXX		id: doc3
seq: 2-XXXX		id: doc4
seq: 3-XXXX		id: doc2
seq: 4-XXXX		id: doc7
seq: 5-XXXX		id: doc8
seq: 6-XXXX		id: doc6
seq: 7-XXXX		id: doc10
seq: 8-XXXX		id: doc1
seq: 9-XXXX		id: doc5
seq: 10-XXXX	id: doc9

However, our second query returns this sequence of updates:

seq: 1-XXXX		id: doc3
seq: 2-XXXX		id: doc1
seq: 3-XXXX		id: doc4
seq: 4-XXXX		id: doc2
seq: 5-XXXX		id: doc7
seq: 6-XXXX		id: doc8
seq: 7-XXXX		id: doc5
seq: 8-XXXX		id: doc6
seq: 9-XXXX		id: doc9
seq: 10-XXXX	id: doc10

You’ll notice that the ordering doesn’t appear consistent here.

To see why this is we need to know which documents are held by each of the 4 shard ranges:

shard range: 00000000-3fffffff	-	holds docs: doc3, doc7
shard range: 40000000-7fffffff	-	holds docs: doc2, doc6, doc10
shard range: 80000000-bfffffff	-	holds docs: doc4, doc8
shard range: c0000000-ffffffff	-	holds docs: doc1, doc5, doc9

With this in mind, lets look back at the two differing _changes results from earlier.

Although the overall ordering appears different, you’ll notice that doc3 always had a lower update seq than doc7. Similarly, doc6 always had a lower update seq than doc10. This is because these docs were from the same shard range and the update histories of the 3 shard copies were identical.

If the shard copies of a particular range have a different update history then we'd see that even this partial ordering doesn't hold true.

This fact isn’t overly useful; but is important in understanding how the changes result is generated.

Our full changes history so far...

curl -XGET https://samsmith.cloudant.com/new_database/_changes

{
  "results": [
{
	"seq":"1-g1AAAAEEeJzLYWBgYMlgTmGQSUlKzi9KdUhJMtTLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGcyJjLlCAPSXZNDXN3JiAXuIMT3IAkkn1IPMTGYjTkscCJBkagBRQ134StR2AaAPZlgUA91NUQw",
   	"id":"doc3",
  	"changes":[
    	{
    		"rev":"1-967a00dff5e02add41819138abb3284d"
    	}
	]
},
  
	
    
{
	  "seq":"9-g1AAAAFseJzLYWBgYMlgTmGQSUlKzi9KdUhJMtTLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGUyJTLlCAPSXZNDXN3Jg43UkOQDKpHmwAM9SARHPL5GRTAwLaiTM_jwVIMjQAKaAV-5HsMDQyMbBIoqIdByB2gP3BDLYj2TTRLMkslYARWQBehXKh",
      "id":"doc9",
      "changes":[
      	{
      	 	"rev":"1-967a00dff5e02add41819138abb3284d"
      	}
      ]
},
{
	"seq": "10-g1AAAAEueJzLYWBgYMlgTmGQSUlKzi9KdUhJMtbLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGUyJTLlCAPcXA1NAizRxVtxEO3UkOQDKpHmwAcyIz2IBkSwPTJHNDAtqJc10eC5BkaABSQCv2IxyZZmGQYppqSoohByCGgH0Kcai5cYqloUlyFgCF0Vwn",
      "id": "doc10",
      "changes": [
       	{
          "rev": "1-967a00dff5e02add41819138abb3284d"
        }
      ]
    }
  ],
  "last_seq": "10-g1AAAAEPeJzLYWBgYMlgTmGQSUlKzi9KdUhJMtbLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGUyJTLlCAPcXA1NAizRxVtxEO3UkOQDKpHmoAM9iAZEsD0yRzQ-Ksz2MBkgwNQApoxn6EK9IsDFJMU01JMeQAxBAkl5gbp1gamiRnAQAublE9",
  "pending": 0
}

Each update seq, once decoded, tells us several things:

  • What shard copies were used to created the changes results.
  • The highest seq the client has seen from each of the shard ranges.

For example, this last update seq decodes to the following:

10-g1AAAAEueJzLYWBgYMlgTmGQSUlKzi9KdUhJMtbLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGUyJTLlCAPcXA1NAizRxVtxEO3UkOQDKpHmwAcyIz2IBkSwPTJHNDAtqJc10eC5BkaABSQCv2IxyZZmGQYppqSoohByCGgH0Kcai5cYqloUlyFgCF0Vwn

[
  {‘dbcore@db3.mead.cloudant.net’, '00000000-3fffffff', 2},
  {'dbcore@db2.mead.cloudant.net', '40000000-7fffffff', 3},
  {'dbcore@db2.mead.cloudant.net', '80000000-bfffffff', 3},
  {'dbcore@db3.mead.cloudant.net', 'c0000000-ffffffff', 3}
]

The above tells us that for the shard range 00000000-3fffffff, we’ve chosen to stream changes from the copy on node db3.mead. It also says the client has seen all changes up to seq 2 from this shard.

Similarly, we’ve chosen the copy of 40000000-7fffffff shard on db2.mead. The client has seen up to seq 3.

..and so on...

We can pass any update seq into a new _changes query as a since parameter. This will ensure that the changes are gathered from the same set of internal shards (if available). The result will show all changes that we have not already seen (we'll see later how we might get back things we have seen too).

Lets add doc11 and doc12 and query _changes using our last_seq as a since parameter:

curl https://samsmith.cloudant.com/new_database/_changes?since=“10-g1AAAAEPeJzLYWBgYMlgTmGQSUlKzi9KdUhJMtbLTU1M0UvOyS9NScwr0ctLLckBqmJKZEiy____f1YGUyJTLlCAPcXA1NAizRxVtxEO3UkOQDKpHmoAM9iAZEsD0yRzQ-Ksz2MBkgwNQApoxn6EK9IsDFJMU01JMeQAxBAkl5gbp1gamiRnAQAublE9”

And we get…

seq: 11-XXXX	id: doc11
seq: 12-XXXX	id: doc12

However, since doc11 and doc12 haven’t landed in the same shard range, we query again and the ordering flips...

seq: 11-XXXX	id: doc12
seq: 12-XXXX	id: doc11

The point I wish to stress here is that you will always see every change.

I mentioned that using the since parameter in your queries gathers changes from the same set of internal shards. However, what if a node is down and the shard we want is not available?

I don't want to get into the weeds here, but needless to say, we choose a substitute shard and stream changes from that. The tricky bit is when the substitute shard has a different update history. We then have to find a suitable seq in which to begin streaming so that no changes are missed. This often introduces changes that the client might have already seen. Again, the single guarantee that we make here is that the client will see every change at least once. I can go into further detail here if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment