Skip to content

Instantly share code, notes, and snippets.

@cscotta
Created November 6, 2011 19:40
Show Gist options
  • Save cscotta/1343362 to your computer and use it in GitHub Desktop.
Save cscotta/1343362 to your computer and use it in GitHub Desktop.
cscotta@ordasity:~/Desktop$ scala -cp mongo-2.7.0.jar Repro.scala
Inserting canary...
Inserting test data...
Paging through records...
Spotted the canary!
Updating canary object...
Spotted the canary!
Whoops, shipped the same order multiple times!
cscotta@ordasity:~/Desktop$
import com.mongodb._
import java.util.UUID
// Connect to Mongo
val mongo = new Mongo("localhost", 27017)
val db = mongo.getDB("repro_databoxor")
val collection = db.getCollection("repro")
var canarySightings = 0
// Insert our "canary" object.
println("Inserting canary...")
val canary = new BasicDBObject()
canary.put("name", "canary")
canary.put("value", "value")
collection.insert(canary)
// Insert 1,000,000 other objects.
println("Inserting test data...")
for (i <- 1 to 100000) {
val doc = new BasicDBObject()
doc.put("name", UUID.randomUUID.toString)
doc.put("value", UUID.randomUUID.toString)
collection.insert(doc)
}
// The function we'll call to operate on records returned from the DB.
def shipOrderToCustomer(doc: DBObject) {
if (doc.get("name") == "canary") {
canarySightings += 1
println("Spotted the canary!")
if (canarySightings > 1) println("Whoops, shipped the same order multiple times!")
}
}
// In one thread (or process or machine, etc.), read through records an act on them.
val reader = new Thread(new Runnable {
def run = {
println("Paging through records...")
val cursor = collection.find()
while (cursor.hasNext)
shipOrderToCustomer(cursor.next())
}
})
// In another thread (or process, machine, etc.), update one of the records.
val updater = new Thread(new Runnable {
def run = {
Thread.sleep(1000)
println("Updating canary object...")
val query = new BasicDBObject()
query.put("name", "canary")
val newDoc = new BasicDBObject()
newDoc.put("name", "canary")
var value = ""
for (i <- 1 to 1000) value += UUID.randomUUID.toString
newDoc.put("value", value)
collection.update(query, newDoc)
}
})
reader.start
updater.start
updater.join
reader.join
@bwmcadams
Copy link

For what it's worth, you'll probably be much happier using Casbah (the scala driver) instead of raw Java from Scala; things such as proper iterators and support for native Scala types make life much easier.

As to your issue w/ multiple inserts, it looks likely to be a cursor issue. By default query results are not snapshotted (like all things it is merely a default); your update is likely causing the file to exceed it's allocated space on the on-disk file and be relocated to the end, which could cause you to see the same record twice on a previously opened cursor (because it moved from its original location and the cursor passes it over again).

An 11 character change will fix this problem. See my fork of your gist for an example of the fix in action.

@cscotta
Copy link
Author

cscotta commented Nov 6, 2011

Thanks for replying, Brendan. I grabbed the Java driver because it's the one I was more familiar with; Casbah looks nice.

You're right -- enabling cursor snapshotting prevents the updated document from appearing twice after being moved to the end of the collection due to growth in the document's size. However, it looks like the docs ("How to do Snapshotted Queries in the Mongo Database") state that:

"Because snapshot mode traverses the _id index, it may not be used with sorting or explicit hints. It also cannot use any other index for the query."

So, with cursor snapshot turned on, it looks like the query planner can't actually use an index -- seems like it would just have to resort to a full scan of the collection to satisfy the query / getMore, then. Is that correct?

– Scott

@cscotta
Copy link
Author

cscotta commented Nov 6, 2011

Just did a quick test to verify this behavior -- pretty sure that's correct.

I added an index in the console with db.repro.ensureIndex({"name":1}), inserted more documents to expand the collection size to 6.7MM records, then timed a couple queries to determine whether or not the server-side behavior differed greatly / to verify that the index is not being hit.

For a query which has one result (a document with {"name": "canary"}), here are the times that I see:

val res = collection.find(query)
while (res.hasNext) println(res.next)
=> 2ms

val res = collection.find(query).snapshot
while (res.hasNext) println(res.next)
=> 7927ms (during which the mongod process was pegged at 100% on an 8-core system)

It seems that this limitation makes the snapshot isolation feature very expensive to use, requiring the DB to scan the collection rather than just the one (or few) documents that actually match the query. In a large collection, even if well-indexed, and especially if the collection does not fit entirely in memory, it seems that the cost of avoiding the retrieval of duplicate results is impractically high -- especially with the daemon pegging a core while serving a query with one result which one would expect could be satisfied by an index. I suppose one mitigation strategy would involve caching result _ids on the client side when retrieving them and de-duping manually, though this would require a lot of memory for large recordsets.

Based on that, I'm not sure that I'd agree that disabled cursor snapshotting is "merely a default." The default behavior results in a retrieval pattern that should not occur, and the "correct" behavior is essentially too expensive to use (even with properly indexed collections) for many cases. It's great to be aware of these limitations (which are well-documented on 10gen's site), but unfortunate that scenarios like this are difficult to envision when planning and selecting a database, only manifesting as serious issues long down the road without terribly practical solutions due to curious limitations of the engine.

Thanks again for replying quickly, Brendan. I think it's important that folks are aware of Catch-22s like these. It's easy to get caught up in how clean and simple MongoDB's API looks, the flexibility of its query language, and how easy it is to get up and going. Just hurts pretty bad to run into trouble like this in production long after the glory days are over without much of a remedy. Evaluating a database is a very serious decision and not one to be made lightly.

Unfortunately, it seems that in many cases where Mongo seemed like an ideal choice at first, unexpected gotchas like this can pop up and leave you hanging high and dry. It's difficult for me to imagine a scenario in which I could've read all of the docs on 10gen's site, then envisioned an application performing an update mid query that resulted in the size of a document growing, being moved to the end of a collection, and causing it to appear again in the resultset, in addition to the snapshot option designed to prevent this being unusable in a large collection. I probably would've seen that cursors can request snapshot isolation, but likely wouldn't have made the connection that using it would prevent indexes from being hit. Best to chalk that one (and all the rest) up to hindsight, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment