Skip to content

Instantly share code, notes, and snippets.

@nol166
Created November 7, 2022 14:45
Show Gist options
  • Save nol166/dd1ac2d6fc7fd969b89ba3bccb5a3271 to your computer and use it in GitHub Desktop.
Save nol166/dd1ac2d6fc7fd969b89ba3bccb5a3271 to your computer and use it in GitHub Desktop.
m201 Notes

M201 Notes

What are Indexes?

Think of indexes as a set of key value pairs, where each key is the value of the field that we indexed on, and the value (of the key) is the document itself. You can have multiple indexes on the same collection. Like an index in the back of a book, index keys are stored in order. Because of this, we don't need to look at every single index entry. MongoDB uses a B-Tree to store this information. Without an index, a new insertion will mean that a new comparison is needed, but with an index, not every insertion means that a new comparison is needed. As a result, querying is faster with an index. O^ log n vs O^ n.

Indexes are not without overhead however. Whenever a document is changed or deleted, the B-Tree needs to be updated. You shouldn't not have so many indexes that it slows down the CRUD operations. You should have enough indexes to make the queries fast, no more and no less.

Performance

MongoDB is a document database that uses multiple CPU cores. The performance is improved by each CPU core you have, however only one CPU core can be used for update operations. This is because a lock is placed on the document being updated. This is called a write lock. The lock is released when the update is complete. Raid 0 (striping).

In terms of disk usage, MongoDB recommends a Raid 10 configuration. Raid 10 is a combination of Raid 0 and Raid 1. Raid 0 is striping and Raid 1 is mirroring. Striping is where data is split across multiple disks. Mirroring is where data is copied to multiple disks. This is a good configuration because it provides redundancy and performance.

Persistance on Disk

There are different storage engines in MongoDB, but those are out of scope.

mongod --dbpath /data/db --storageEngine wiredTiger

To start a mongod instance with a different database path, use the --dbpath option. The --fork option will run the mongod instance in the background. The --logpath option will log the output to a file. The --logappend option will append to the log file instead of overwriting it.

mongod --dbpath /data/db --fork --logpath /var/log/mongodb.log

The --directoryperdb option will create a directory for each database. This option creates a new folder for each database and changes the default structure of the /dat/db directory. One folder for database. Inside the folder for a database, we get another folder for each collection and each index as long as it's a user created collection and not admin, local, or config. When the collection is updated, so is the index directory.

mongod --dbpath /data/db --directoryperdb

Data is persisted to disk after it is stored in ram. A insert, for example, won't write to the disk until a checkpoint unless a write concern is specified. A checkpoint is a point in time where the data is written to disk. The default checkpoint is every 60 seconds.

db.collection.insertOne({name: "John"}, {w: 3})

Journaling

Journaling is an essential part of persistency that acts as a failsafe against data corruption as a result of an unexpected shutdown, power loss, or otherwise incomplete file writes. Journal can help restore the state of the database by logging ahead information about each write to the disk. The journal has it's own file structure that includes individual write operations. Doing this however, will cause some performance impact.

db.collection.insertOne({name: "John"}, {w:1, j: true})

Single Field Indexes

A single field index is an index that is created on a single field. Here is an index created on the ssn field. Running a query before the index is created will take longer because we need to scan the entire collection (COLLSCAN). After the index is created, the query will be faster because we can use the index to find the document.

db.collection.createIndex({ssn: 1})

If a document in the collection doesn't have a value for SSN at the time the index is created, the value will be null.

Explainable Object

Using the .explain() method we can see the details of how the query was ran. In this example, we searched for a specific SSN after creating an index on the SSN field. We can see that the query planner used the index to find the document as denoted by the IXSCAN value.

db.collection.find({ssn: "123-45-6789"}).explain()

Info in return object:

`executionStats` - shows the number of documents examined, the number of documents returned, and the total execution time.
`nReturned` - the number of documents returned by the query.
`totalKeys Examined` - the number of index keys examined by the query.

Return (before index)

totalDocsExamined: 500000
nReturned: 1

Returns (after index)

totalDocsExamined: 1 
nReturned: 1

We only need to examine one document (after we created the index) because the index is used to find the document. The index key is the value of the field that we indexed on. The index value is the document itself.

Dot notation when using Indexes

We can use dot notation when using indexes -- Let's check out an example. First we insert a document with a nested field.

db.examples.insertOne({ _id: 1, subdoc: { indexedField : "value", otherField: "value" } } )

Then we create another example, but this document will have a different value for the indexedField:

db.examples.insertOne({ _id: 2, subdoc: { indexedField : "wrongValue", otherField: "value" } } )

Then we create an index on the nested field.

db.examples.createIndex({ "subdoc.indexedField": 1 })

Now let's query the document, but use an explain to see the execution stats:

db.examples.explain("executionStats").find({ "subdoc.indexedField": "value" })

Output:

 winningPlan: {
      stage: 'FETCH',
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { 'subdoc.indexedField': 1 },
        indexName: 'subdoc.indexedField_1',
        isMultiKey: false,
        multiKeyPaths: { 'subdoc.indexedField': [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { 'subdoc.indexedField': [ '["value", "value"]' ] }
      }
    },
    rejectedPlans: []
  },
  executionStats: {
    executionSuccess: true,
    nReturned: 1,
    executionTimeMillis: 6,
    totalKeysExamined: 1,
    totalDocsExamined: 1,
    executionStages: {
      stage: 'FETCH',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 2,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      docsExamined: 1,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 2,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 1,
        keyPattern: { 'subdoc.indexedField': 1 },
        indexName: 'subdoc.indexedField_1',
        isMultiKey: false,
        multiKeyPaths: { 'subdoc.indexedField': [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { 'subdoc.indexedField': [ '["value", "value"]' ] },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 0,
        dupsDropped: 0
      }
    }
  },

Notice that the

Important You should never index on a field that points to a subdocument because doing so would mean we would have to query the entire subdoc. It is better to use dot notation to index on a specific field in the subdoc. Indexing on more than one field should be done with a compound index. (covered later).

Using single-field indexes to query ranges of values

We can query documents that have a specific range. For example, if we wanted to find all social security numbers with a specific range of greater than a value but less than another value, we could do that using our index.

db.people.explain("executionStats").find({ ssn: { $gt: "555-00-0000", $lt: "556-00-0000" } })

This query returns the following:

 winningPlan: {
      stage: 'FETCH',
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { ssn: 1 },
        indexName: 'ssn_1',
        isMultiKey: false,
        multiKeyPaths: { ssn: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { ssn: [ '("555-00-0000", "556-00-0000")' ] }
      }
    },
    rejectedPlans: []
  },
  executionStats: {
    executionSuccess: true,
    nReturned: 49, # 49 documents returned, what we expected
    executionTimeMillis: 23, # fast!
    totalKeysExamined: 49, #important
    totalDocsExamined: 49, #important
    executionStages: {
      stage: 'FETCH',
      nReturned: 49, # important line
      executionTimeMillisEstimate: 0,
      works: 50,
      advanced: 49,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      docsExamined: 49,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 49,
        executionTimeMillisEstimate: 0,
        works: 50,
        advanced: 49,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 1,
        keyPattern: { ssn: 1 },
        indexName: 'ssn_1',
        isMultiKey: false,
        multiKeyPaths: { ssn: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { ssn: [ '("555-00-0000", "556-00-0000")' ] },
        keysExamined: 49,
        seeks: 1,
        dupsTested: 0,
        dupsDropped: 0
      }
    }
  },
  command: {
    find: 'people',
    filter: { ssn: { '$gt': '555-00-0000', '$lt': '556-00-0000' } },
    '$db': 'm201'
  },

Only 49 documents needed to be scanned to find the documents that matched the query. This is because the index is sorted and we can use the index to find the range of documents that match the query.

Querying on a set of values

We can query on a set of values for an indexed field. FOr example, we can find 4 different social security numbers using the following query:

db.people.explain("executionStats").find({ ssn: { $in: ["001-29-9184", "177-45-0950", "265-67-9973"] } })

Returns:

winningPlan: {
      stage: 'FETCH',
      inputStage: {
        stage: 'IXSCAN',
      ...
executionStats: {
    executionSuccess: true,
    nReturned: 3,
    executionTimeMillis: 1,
    totalKeysExamined: 6, # Not ideal, but still fast 
    totalDocsExamined: 3,
    ...
}

The query returned 3 documents, but 6 keys were examined. This is because of the algorithm that the index uses to find the documents. It is not ideal, but it is still very fast.

We can also find multiple values and combine that query with other fields in the document, even if the other fields are not indexed. For example, we can find all documents that have a specific social security number and a specific first name:

db.people.explain("executionStats").find({ ssn: { $in: ["001-29-9184", "177-45-0950", "265-67-9973"] }, last_name: "H" })

What happens here is the index is used to find the documents that match the query on the ssn field. Then, the documents are filtered to find the ones that match the last_name field. Compound indexes can make the same query much faster, but that is covered later.

Summary

  • Single field are by far the most simple indexes to create and use in MongoDB
  • syntax is db.collection.createIndex({ field: <direction> })
  • Keys from only one field
  • Can find a single value fro the indexed field
  • Can find a range of values
  • Can use dot notation to index on a field in a subdocument (db.collection.createIndex({ "subdoc.field": <direction> }))
  • Can be used to find several distinct values in a single query

Explain Method

The explain method is a very useful tool for understanding how MongoDB is executing a query. It can be used to understand how indexes are being used, sometimes without even having to run the query. They can answer questions like:

  • Is your query using the index you expect?
  • Is your query using an index to provide a sort?
  • Is your query using an index to provide the projection?
  • How selective is the index?
  • What part of your "plan" is the slowest or most expensive computationally?

Adding the explain method to the end of a query is quick and easy, but it comes with some limitations. One of them being that you can't run the same method on more than one query.

Explainable Objects

Explainable objects can help you reach DRY code. They are objects that can be used to run explain on multiple queries. Here is a simple example:

const explainable = db.people.explain()
explainable.find({ "address.city": "New York" })
explainable.find({ "address.city": "Northville" })

The above example will return the explain output without actually running the query. This is useful when you want to validate that your command is working before actually running it. Think of a case where a server is in production or under a heavy load. You can save yourself some CPU cycles by running the explain method first. This is also the default behavior of the explain method.

Running the following is the same as passsing no arguments to the explain method:

const explainable = db.people.explain("queryPlanner")

There is another mode called allPlansExecution that is very verbose. It also actually runs the query, whereas queryPlanner does not.

mongosh
use m201
const explainable = db.people.explain() # version for normal explain
const expRun = db.people.explain("executionStats") # version for explain with execution stats
const expRunVerbose = db.people.explain("allPlansExecution") # version for explain with execution stats and verbose output (runs the query)

We created 3 explainable objects. The first one is the default explain method. The second one is the explain method with execution stats. The third one is the explain method with execution stats and verbose output.

If we run the following query:

expRun.find({ "last_name": "Johnson", "address.state": "New York" })

The collection scan would be used and over 50k documents would need to be scanned. Not great. So, if we then create an index on the last name, the exectution time would be much faster. The totalDocsExamined should ideally be close as possible to the nReturned value.

Rejected plans

The rejected plans object is something that you will get when you run the explain method with the allPlansExecution argument. It is an array of objects that contain the explain output for the query that was not chosen by the query planner. This is useful for understanding why a query was not chosen.

db.people.createIndex({ "address.state":1, "last_name": 1}) # at this point we would have one index for state and last name, and another for last name

Adding this index will cause our query to return a new rejected plan object. If we run a sort, we can see some more information about the rejected plans.

// run a sort query
var res = db.people.find({"last_name":"Johnson", "address.state":"New York"}).sort({"birthday":1}).explain("executionStats")

// checkout the execution stages (doing an in-memory sort)
res.executionStats.executionStages

In the output, we can see that it first uses the index to find the documents that match the query. Then, it does a fetch, and finally an in-memory sort. This is because the birthday is not supported by an index. This will show as SORT_KEY_GENERATOR in the rejectedPlans object. If the sort takes up more than 32mb of memory, it will cancel out the sort.

Running explain on a sharded cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment