Skip to content

Instantly share code, notes, and snippets.

@cscotta
Created November 6, 2011 19:40
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save cscotta/1343362 to your computer and use it in GitHub Desktop.
Save cscotta/1343362 to your computer and use it in GitHub Desktop.
cscotta@ordasity:~/Desktop$ scala -cp mongo-2.7.0.jar Repro.scala
Inserting canary...
Inserting test data...
Paging through records...
Spotted the canary!
Updating canary object...
Spotted the canary!
Whoops, shipped the same order multiple times!
cscotta@ordasity:~/Desktop$
import com.mongodb._
import java.util.UUID
// Connect to Mongo
val mongo = new Mongo("localhost", 27017)
val db = mongo.getDB("repro_databoxor")
val collection = db.getCollection("repro")
var canarySightings = 0
// Insert our "canary" object.
println("Inserting canary...")
val canary = new BasicDBObject()
canary.put("name", "canary")
canary.put("value", "value")
collection.insert(canary)
// Insert 1,000,000 other objects.
println("Inserting test data...")
for (i <- 1 to 100000) {
val doc = new BasicDBObject()
doc.put("name", UUID.randomUUID.toString)
doc.put("value", UUID.randomUUID.toString)
collection.insert(doc)
}
// The function we'll call to operate on records returned from the DB.
def shipOrderToCustomer(doc: DBObject) {
if (doc.get("name") == "canary") {
canarySightings += 1
println("Spotted the canary!")
if (canarySightings > 1) println("Whoops, shipped the same order multiple times!")
}
}
// In one thread (or process or machine, etc.), read through records an act on them.
val reader = new Thread(new Runnable {
def run = {
println("Paging through records...")
val cursor = collection.find()
while (cursor.hasNext)
shipOrderToCustomer(cursor.next())
}
})
// In another thread (or process, machine, etc.), update one of the records.
val updater = new Thread(new Runnable {
def run = {
Thread.sleep(1000)
println("Updating canary object...")
val query = new BasicDBObject()
query.put("name", "canary")
val newDoc = new BasicDBObject()
newDoc.put("name", "canary")
var value = ""
for (i <- 1 to 1000) value += UUID.randomUUID.toString
newDoc.put("value", value)
collection.update(query, newDoc)
}
})
reader.start
updater.start
updater.join
reader.join
@cscotta
Copy link
Author

cscotta commented Nov 6, 2011

Just did a quick test to verify this behavior -- pretty sure that's correct.

I added an index in the console with db.repro.ensureIndex({"name":1}), inserted more documents to expand the collection size to 6.7MM records, then timed a couple queries to determine whether or not the server-side behavior differed greatly / to verify that the index is not being hit.

For a query which has one result (a document with {"name": "canary"}), here are the times that I see:

val res = collection.find(query)
while (res.hasNext) println(res.next)
=> 2ms

val res = collection.find(query).snapshot
while (res.hasNext) println(res.next)
=> 7927ms (during which the mongod process was pegged at 100% on an 8-core system)

It seems that this limitation makes the snapshot isolation feature very expensive to use, requiring the DB to scan the collection rather than just the one (or few) documents that actually match the query. In a large collection, even if well-indexed, and especially if the collection does not fit entirely in memory, it seems that the cost of avoiding the retrieval of duplicate results is impractically high -- especially with the daemon pegging a core while serving a query with one result which one would expect could be satisfied by an index. I suppose one mitigation strategy would involve caching result _ids on the client side when retrieving them and de-duping manually, though this would require a lot of memory for large recordsets.

Based on that, I'm not sure that I'd agree that disabled cursor snapshotting is "merely a default." The default behavior results in a retrieval pattern that should not occur, and the "correct" behavior is essentially too expensive to use (even with properly indexed collections) for many cases. It's great to be aware of these limitations (which are well-documented on 10gen's site), but unfortunate that scenarios like this are difficult to envision when planning and selecting a database, only manifesting as serious issues long down the road without terribly practical solutions due to curious limitations of the engine.

Thanks again for replying quickly, Brendan. I think it's important that folks are aware of Catch-22s like these. It's easy to get caught up in how clean and simple MongoDB's API looks, the flexibility of its query language, and how easy it is to get up and going. Just hurts pretty bad to run into trouble like this in production long after the glory days are over without much of a remedy. Evaluating a database is a very serious decision and not one to be made lightly.

Unfortunately, it seems that in many cases where Mongo seemed like an ideal choice at first, unexpected gotchas like this can pop up and leave you hanging high and dry. It's difficult for me to imagine a scenario in which I could've read all of the docs on 10gen's site, then envisioned an application performing an update mid query that resulted in the size of a document growing, being moved to the end of a collection, and causing it to appear again in the resultset, in addition to the snapshot option designed to prevent this being unusable in a large collection. I probably would've seen that cursors can request snapshot isolation, but likely wouldn't have made the connection that using it would prevent indexes from being hit. Best to chalk that one (and all the rest) up to hindsight, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment