Skip to content

Instantly share code, notes, and snippets.

@leommoore
Last active November 9, 2020 18:11
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save leommoore/392e0a45b3eef903d45b to your computer and use it in GitHub Desktop.
Save leommoore/392e0a45b3eef903d45b to your computer and use it in GitHub Desktop.
MongoDB 3.2.x Reporting

MongoDB 3.2.x Reporting

The find() operator creates a cursor to the data. The client can then parse through the data and do calculations on the data for reports purposes. However, with large data sets this is not practical. It makes far more sense for the server to do the calculations and return the result. This can save a considerable amount of unnecessary network traffic. You can however run the count on the cursor without much overhead.

db.mycollection.find().count()

Mongo has two other methods for reporting on data:

  • Aggregation
  • Map-Reduce

Aggregation

The aggregation command takes a series of operators and processes them in a sequence, passing the documents from one operator to the next in a pipeline.

db.mycollection.aggregate(operator1, operator2, .., operatorN)

or

db.runCommand({aggregate: 'mycollection',operator1, operator2, .., operatorN)

Operators

The aggregation command has a number of operators which can be used to process the data. It is quite possible that any of these operators may appear multiple times in an aggregation command.

$match

The $match operator selects only the records which match the specified criteria. As a general rule, you should try to match as early as possible to reduce the documents being processed by supsequent operators.

db.mycollection.aggregation({$match: {salePrice: {$gt: 100}}}, .....)

You can match on and element or any array element.

db.sales.aggregation($match: {vendor: 'Acme Stores'}})

$group

The $group operator is similar to the group by operator in sql in that it will group documents by a criteria. The $group operator is the mainstay of aggregation. The most important part of the $group operator is the _id component. The _id is similar to the group by operator in sql. The main difference is that there must be an _id field and it cannot be called anything else, but you can group by any field or combination of fields.

$sum

A simple $group could be the count of Sales Transactions.

db.sales.aggregate({$group: {_id: 'all', count: {$sum: 1}}})
{ "_id" : "all", "count" : 525 }

In this case, rather then giving _id a value from the data we are specifying a constant. This means that all documents will be grouped together. Also, since we are asking it to sum a value of 1 for each document this this means effectively that this is a count of all documents.

If I wanted to do a total of the all sales, I will have to specify a value in the data to be summed. To access the value in a field you denote it with a $.

db.sales.aggregate({$group: {_id: 'all', total: {$sum: '$receiptTotal'}}})
{ "_id" : "all", "total" : 10259.20999999999 }
$max, $min, $avg

$max, $min and $avg will find the max, min or average in a specified field.

db.sales.aggregate({$group: {_id: 'all', max: {$max: '$receiptTotal'}}})
{ "_id" : "all", "max" : 305 }

db.sales.aggregate({$group: {_id: 'all', min: {$min: '$receiptTotal'}}})
{ "_id" : "all", "min" : 0.49 }

db.sales.aggregate({$group: {_id: 'all', avg: {$avg: '$receiptTotal'}}})
{ "_id" : "all", "avg" : 19.54135238095236 }

You can combine all these together too:

db.sales.aggregate({
$group: {_id: 'all', 
salesCount: {$sum: 1},
salesTotal: {$sum: '$receiptTotal'}, 
avg: {$avg: '$receiptTotal'}, 
avg: {$avg: '$receiptTotal'}, 
max: {$max: '$receiptTotal'},
min: {$min: '$receiptTotal'}
}})

{ "_id" : "all", "salesCount" : 525, "salesTotal" : 10259.20999999999, "avg" : 19.54135238095236, "min" : 0.49, "max" : 305 }

You can use the _id to group the data by different criteria. For example if we would like to group it by store:

db.receipts.aggregate(
{$group: {_id: '$store', 
salesCount: {$sum: 1},
salesTotal: {$sum: "$receiptTotal"},
avg: {$avg: "$receiptTotal"},
min: {$min: "$receiptTotal"},
max: {$max: "$receiptTotal"}
}})

{ "_id" : "London", "salesCount" : 2, "salesTotal" : 66.6, "avg" : 33.3, "min" : 33.3, "max" : 33.3 }
{ "_id" : "Paris", "salesCount" : 1, "salesTotal" : 31.85, "avg" : 31.85, "min" : 31.85, "max" : 31.85 }
{ "_id" : "New York", "salesCount" : 1, "salesTotal" : 46.81, "avg" : 46.81, "min" : 46.81, "max" : 46.81 }
{ "_id" : "Rome", "salesCount" : 1, "salesTotal" : 15, "avg" : 15, "min" : 15, "max" : 15 }
{ "_id" : "Tokyo", "salesCount" : 1, "salesTotal" : 31, "avg" : 31, "min" : 31, "max" : 31 }
{ "_id" : "Rio", "salesCount" : 3, "salesTotal" : 24.63, "avg" : 8.21, "min" : 3, "max" : 13.63 }
{ "_id" : "Berlin", "salesCount" : 4, "salesTotal" : 48.51, "avg" : 12.12, "min" : 2.29, "max" : 21.97 }
$addToSet

You can also use the $addToSet operator to create a set of data from the aggregated group. The $addToSet operator will only add distinct items to the set so there are no duplicates.

db.products.aggregate({$group: {_id: '$vendorName', products: {$addToSet: "$productName"}}})

{ 
    "_id" : "Acme Stores", 
    "products" : [ "Corn Flakes", "Beans", "Apples" ] 
},
{
    "_id" : "Super Discount Market",
    "products" : ["Cooking Sauce", "Pasta", "Goldfish Food", "Milk", "Coffee", "Bread"]
}
$push

The $push operator operates in the same way except that it will add all items even if they are duplicated.

db.products.aggregate({$group: {_id: '$vendorName', products: {$push: "$productName"}}})

{ 
    "_id" : "Acme Stores", 
    "products" : [ "Corn Flakes", "Beans", "Apples", "Corn Flakes", "Corn Flakes" ] 
},
{
    "_id" : "Super Discount Market",
    "products" : ["Cooking Sauce", "Pasta", "Goldfish Food", "Milk", "Coffee", "Bread", "Pasta", "Milk"]
}
$first

The $first operator will return the items in the first document for the group criteria.

db.products.aggregate({$group: {_id: '$vendorName', products: {$first: "$productName"}}})

{ "_id" : "Acme Stores",  "products" : "Corn Flakes"},
{ "_id" : "Super Discount Market", "products" : "Cooking Sauce"}
$last

The $last operator will return the items in the first document for the group criteria.

db.products.aggregate({$group: {_id: '$vendorName', products: {$last: "$productName"}}})

{ "_id" : "Acme Stores",  "products" : "Corn Flakes"},
{ "_id" : "Super Discount Market", "products" : "Milk"}

Note: For Acme Stores, Corn Flakes is in both the first and last documents for the group.

If you want to use $first and $last to make sense you normally have to sort it first. The default sort is by _id.

$sort

The $sort operator sorts the documents by the specified fields. You may want to sort a number of times in the pipeline to get the documents in the order you want.

db.sales.aggregation({$sort: {transactionDate: 1}}....)
db.sales.aggregation({$sort: {transactionDate: -1}}....)

A sort order of 1 is ascending and a sort order of -1 is descending. A $sort operator can use an index if it is be before a $group,$project or $unwind operator. This is because once you are not using the original document then the index is invaid.

$limit

The $limit operator does exactly what it says it does, it limits the number of documents to the number specified. An example would be where you want to get only the top 10 customers.

db.sales.aggregation({$limit: 10}, .....)

Like $match, $limit can be used to reduce the number of documents that are passed onto the next stage. For example, maybe you just want the 10 largest customers.

$skip

The $skip operator ignores the specified (non-negative) number of documents and only returns the documents after that point.

db.sales.aggregation({$skip: 100}, .....)

$unwind

Sometimes documents may contain an array which you need to break into it's component parts to analyse. The $unwind operator allows you to break a specified array into its individual parts as seperate documents. For example if the following documents are in a collection:

{_id: 1, createdBy: "Jack", tags: ["Movies", "Sci-Fi"]} 
{_id: 2, createdBy: "Jane", tags: ["Sports"]}

The using the unwind operator:

db.sales.aggregation({$unwind: "$tags"})

This will create a seperate document for each array entry in tags.

{_id: 1, createdBy: "Jack", tags: "Movies"} 
{_id: 1, createdBy: "Jack", tags: "Sci-Fi"} 
{_id: 2, createdBy: "Jane", tags: "Sports"}    

If there is no data in the array then no documents are output. IF the field specified is not an array then it will throw and error.

$project

The $project operator allows you to select only the fields you require from the document. Like $match it should be done as early as possible to reduce the size of the documents being passed to the next operator in the pipeline. It can always be used again later in the pipeline to shape the result. It can also be used to create new fields, for example you could add two fields together to make a new field. The ````$projectoperator works like the field selection in thefind``` command. True is represented by 1 and False is represented by 0.

db.sales.aggregation({$project: {saleAmount: 1}})

{
    "result": [
        {"_id": 1, "saleAmount": 23.99},
        {"_id": 2, "saleAmount": 1.99},
        {"_id": 3, "saleAmount": 5.54},
        {"_id": 4, "saleAmount": 6.50},
    ]
}

Like the field selection in the find query the $project will automatically return the _id unless explicity told not to. For example:

db.sales.aggregation({$project: {"_id": 0, "saleAmount": 1}})

{
    "result": [
        {"saleAmount": 23.99},
        {"saleAmount": 1.99},
        {"saleAmount": 5.54},
        {"saleAmount": 6.50},
    ]
}

You can also use $project to rename the field name. For example:

db.sales.aggregation({$project: {"_id": 0, "price": "$saleAmount"}})

{
    "result": [
        {"price": 23.99},
        {"price": 1.99},
        {"price": 5.54},
        {"price": 6.50},
    ]
}
Arithmetic Operators

When using the $project operator you can also use $add, $subtract, $multiply, $divide and $mod.

{$project: {
        _id: 0,
        lineTotal: {
            $multiply: [
                $saleAmount, 
                $quantity
            ]
        }
    }
}

Note: Add and Multiply can have any number of arguments, whereas divide, subtract and mod can only have two. Instead of a field you can also use a value, so you could multiply the $salesAmount by 1.

String Operators

There are a number of functions available to maniplate Strings.

$concat

You can use the ```concat```` operator to join strings together. For example:

{$project: {
    _id: 0,
    address: {
            $concat: [
                $address1, ", ",
                $address2, ", ",
                $city, ", ",
                "Ireland", ", ",
                "$postcode"
            ]
        }
    }
}

This will give us:

Royal Dublin Society, Ballsbridge, Dublin, Ireland, D04 AK83
$substr

The $substr operator allows you to extract a piece of text from a string. You specify the index and then the number of characters you require.

{$project: {
    _id: 0,
    postcode: $postcode,
    district: {
            $substr: [
                "$postcode",
                0,
                3
            ]
        }
    }
}

This will return the first three characters of the postcode.

{
    postcode: D04 AK83,
    district: D04
}
Other String Operators
 {$toLower: str}
 {$toUpper: str}
 {$substr: [str, offset, count]}
 {$substr: [s1, s2, s3...]}
 {$strcasecmp: [s1, s2]} //Case Insensitive
 {$cmp: [s1, s2]}        //Case Sensitive
Date Operators

You can also manipulate dates using the following functions:

{$project: {
    _id: 0,
    salesDate: 1,
    dayOfYear: {$dayOfYear: "$salesDate"},
    dayOfMonth: {$dayOfMonth: "$salesDate"},
    dayOfWeek: {$dayOfWeek: "$salesDate"},
    year: {$year: "$salesDate"},
    month: {$month: "$salesDate"},
    week: {$month: "$salesDate"},
    hour: {$hour: "$salesDate"},
    minute: {$minute: "$salesDate"},
    second: {$second: "$salesDate"},
    millisecond: {$millisecond: "$salesDate"}
}}

This gives us:

{ 
    "salesDate" : ISODate("2014-03-28T13:25:00Z"), 
    "dayOfYear" : 87, 
    "dayOfMonth" : 28, 
    "dayOfWeek" : 6, 
    "year" : 2014, 
    "month" : 3, 
    "week" : 3, 
    "hour" : 13, 
    "minute" : 25, 
    "second" : 0, 
    "millisecond" : 0 
}
Conditional Functions

There are a number of conditionals you can use.

$cond

You can use the $cond operator to do a conditional check and return a value for true or false.

{$project: {
    _id: 0,
    $cond: [
        $eq: {country: "USA"},
        "Rockin in the USA!",
        "Rockin somewhere else!"
    ]
}}

Or a more useful example to calculate the Quarter:

var qtrCalc = {
    "$cond": [
        {
            "$gt": [
                {
                    "$month": "$salesDate"
                },
                9
            ]
        },
        "Q4",
        {
            "$cond": [
                {
                    "$gt": [
                        {
                            "$month": "$salesDate"
                        },
                        6
                    ]
                },
                "Q3",
                {
                    "$cond": [
                        {
                            "$gt": [
                                {
                                    "$month": "$salesDate"
                                },
                                3
                            ]
                        },
                        "Q2",
                        "Q1"
                    ]
                }
            ]
        }
    ]
}

db.sales.aggregate{$project: {q: qtrCalc, _id: 0, d: "$salesDate"}})
$ifNull

The $ifNull function provides a useful way to get around missing fields by defining a default value. If for example the quantity was missing from the sales order you could set the default to 1 is the qualtity field is null.

{
    "$project": {
        "_id": 0,
        "quanity": {
            "$ifNull": [
                "$quantity",
                1
            ]
        }
    }
}

$geoNear

The $geoNear operator is a specialized operator which allows you to query the documents on the basis of their geographical location stored in a specialized field by supplying a geograpical point.

Other Operators

Boolean
{$and: [e1, e2, e3..]}
{$or: [e1, e2, e3..]}
{$not: expr}
Comparison

There are also a number of comparison operators such as $eq $ne $lt $gt $lte $gte

{$eq: [v1, v2]}

There is also a straight comparison operator which returns -1 if v1<v2, 0 if v1=v2 and 1 if v1>v2. You can also compare strings. In this case the strings are case sensitive.

{$cmp: [v1, v2]}

Performance Guidlines

The guiding issue for Mongo is memory usage. Mongo has a hard limit of 10% of memory for any aggregation. If the memory usage goes over that then the aggregation will fail. This is to protect the normal operations of the database. Therefore to the get the best performace you should try to reduce the documents being looked at and the field data being passed from each stage of the pipeline.

  • $match early - Limit the documents
  • $project necessary fields only - Get rid of unnecessary fields to free up memory space
  • Use Indexes - $match, $sort, $skip, $limit can benefit from using indexes.
  • $sort early - Before $group, $project, $unwind so that you can get the benefit of indexes. Remember once the documents are modified in any way then the index can no longer be used.

There is always a conflict between doing $project before $sort to reduce the data. The best way is to test it and see which order works best for you data.

Map Reduce

Map Reduce is a mechanism popularized by Google for querying large data sets. See http://mapreduce.org/ for more information on MapReduce. The canonical example application of MapReduce is a process to count the appearances of each different word (like Google's indexing) in a set of documents.

There are two fundamental pieces of a MapReduce query:

"Map" step:

The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

###"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

In Mongo it looks like:

db.mycollection.mapReduce(myMap, myReduce, options)

function myMap(){...}
function myReduce(k,v){...}
var options: {out: {inline: 1}};

function myMap() {
    emit(key, value)
}

function myReduce(key, values) {
    //crunch values together, then..
    return reducedValue
}

Within the map function the current document can be accessed using this. You can call emit as often as you like when processing a document (for example when you are unwinding an array) or not at if the search condition is not met (for example the sales price is 0).

function myMap(){
    var total = this.price + this.shipping;
    var category = this.category.name;
    if (total>10) {
        emit(category,total);
    }
}

The reduce function takes a key (eg. product category, year, month etc) for which you want to group the data and a series of values. It then performs whatever grouping function you require (ie sum, count etc) and returns the result. The beauty of this is that the reduce function can take as input either the output from a map or a reduce function. This means that you can split the load across multiple server and combine the results.

function myReduce(key, values){
    var reducedValue = Array.sum(values);
    return reducedValue;
}

Mongo provides an optional finalize function which gets called once per key value when all the processing is finished. This gives you an opportunity to output the result in a form that you require.

function myFinalize(key, reducedValue) {
    return key + ' has ' + reducedValue;
}

Note: reducedValue is a single value.

MongoDB Example

This is a simple example to determine of the average of the odd and even numbers in a set.

Setup the Data:

> for(i=0; i<10; i++){db.test.save({'number':i});}

View the result

> db.test.find();

{ "_id" : ObjectId("521af7716436f79c39eb2089"), "number" : 0 }
{ "_id" : ObjectId("521af7716436f79c39eb208a"), "number" : 1 }
{ "_id" : ObjectId("521af7716436f79c39eb208b"), "number" : 2 }
{ "_id" : ObjectId("521af7716436f79c39eb208c"), "number" : 3 }
{ "_id" : ObjectId("521af7716436f79c39eb208d"), "number" : 4 }
{ "_id" : ObjectId("521af7716436f79c39eb208e"), "number" : 5 }
{ "_id" : ObjectId("521af7716436f79c39eb208f"), "number" : 6 }
{ "_id" : ObjectId("521af7716436f79c39eb2090"), "number" : 7 }
{ "_id" : ObjectId("521af7716436f79c39eb2091"), "number" : 8 }
{ "_id" : ObjectId("521af7716436f79c39eb2092"), "number" : 9 }

Create the Map Function

This runs on each record in the collection and returns a single result for each record.

var mapFunction = function() {
    var out = "odd";
    if (this.number % 2 == 0) {out = "even";}
    emit(out, 1);
}

Create the Reduce Function

var reduceFunction =  function(ids, number) {
    return Array.sum(number);
};

MapReduce

This function is already defined in MongoDB.

db.test.mapReduce(mapFunction,reduceFunction, {"out": {"inline": 1}});

{
    "results" : [
        {
            "_id" : "even",
            "value" : 5
        },
        {
            "_id" : "odd",
            "value" : 5
        }
    ],
    "timeMillis" : 831,
    "counts" : {
        "input" : 10,
        "emit" : 10,
        "reduce" : 2,
        "output" : 2
    },
    "ok" : 1
}

Output Options

As seen before you can specify the output as going to the screen:

{out: {inline: 1}}

You can also output to a collection:

{out: "mycollection_out"}

This can be very useful when you want to store summary data for reporting. For example you could calculate the summary data each hour to save on processing load. Note that by default the output collection is overwritten each time. There are other options to merge data. The previous option is a shortened version of:

{out: {replace: "mycollection_out"}}

To merge the data you would use:

{out: {replace: "mycollection_out"}}

This will replace only those entries that match the key. This can be very useful if you want to calculate a summary each evening of the days transactions. In this case we rather than doing the work on all the data you just want to limit it to doing the work on todays data so you need to use a query to limit the data input.

First lets add some dates:

db.test.update({number: {$gte: 4}},{$set: {dt: ISODate("2016-03-01")}},{multi: 1})
db.test.update({number: {$lt: 4}},{$set: {dt: ISODate("2016-01-01")}},{multi: 1})
db.test.find()
 
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a2"), "number" : 0, "dt" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a3"), "number" : 1, "dt" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a4"), "number" : 2, "dt" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a5"), "number" : 3, "dt" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a6"), "number" : 4, "dt" : ISODate("2016-03-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a7"), "number" : 5, "dt" : ISODate("2016-03-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a8"), "number" : 6, "dt" : ISODate("2016-03-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9a9"), "number" : 7, "dt" : ISODate("2016-03-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9aa"), "number" : 8, "dt" : ISODate("2016-03-01T00:00:00Z") }
{ "_id" : ObjectId("56f260c3260cd83acf7ea9ab"), "number" : 9, "dt" : ISODate("2016-03-01T00:00:00Z") }

Lets get a

var options = {out: {merge: "results"}, query: {dt: {$gte: ISODate("2016-02-01")}}}
db.test.mapReduce(mapFunction,reduceFunction, options);

db.results.find()

{ "_id" : "even", "value" : 3 }
{ "_id" : "odd", "value" : 3 }

This is the correct answer as there are 6 entries from 5-9 inclusive split equally between even and odd. To feed the results into another reduce function which would have the effect of merging the data you should use:

{out: {reduce: "mycollection_out"}}

You can optional keep the output data in json instead of bson by specifying:

{jsMode: true, out: {...}}

You can also store your data in a sharded collection

{out: {reduce: 'mycollection', sharded: true}}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment