Skip to content

Instantly share code, notes, and snippets.

@scarnecchia
Last active January 19, 2020 20:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save scarnecchia/3039373d8b4ee9a4bcb800af97069151 to your computer and use it in GitHub Desktop.
Save scarnecchia/3039373d8b4ee9a4bcb800af97069151 to your computer and use it in GitHub Desktop.
MongoDB's tools for subsetting document collections are not working.

This properly subsets a collection of documents in MongoDB Compass Community:

{ 'created_at': { $gte: 'Sat Dec 21 00:00:00 +0000 2019', $lte: 'Thu 31 Dec 05:00:00 +0000 2019'}}

This does not, in Python 3.6, and returns the entire collection

df = pd.DataFrame(list(collection.find({ 'created_at': { '$gte': 'Sat Dec 21 00:00:00 +0000 2019', '$lte': 'Thu 31 Dec 05:00:00 +0000 2019' } })))

Converting the date to an ISODate does not work either nor does using '$match' in the aggregation pipeline.

@scarnecchia
Copy link
Author

scarnecchia commented Jan 19, 2020

Solution: created_at and timestamp_ms are stored as strings. The solution I've hit on is to use $dateFromString in the $project pipeline stage to convert the date to a ISO date. Then use $match to filter on the date. (I still can't get it to filter on both bounds of the date range, which is likely user error, but this works for my purposes for now.

projection = {
        '$project': {
            '_id': 0, 
            'id': 1, 
            'created_at': {
                '$dateFromString': {
                    'dateString': '$created_at'
                }
            }, 
            'user': 1, 
            'entities': 1, 
            'lang': 1, 
            'text': 1, 
            'retweeted_status': 1
        }
    }

match = {
        '$match': {
            'created_at': {
                '$lte': dt.datetime(2020, 1, 1, 0, 0, 0, tzinfo=dt.timezone.utc)
            }
        }
    }

cursor = collection.aggregate([projection, match])
df =  pd.DataFrame(list(cursor))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment