Skip to content

Instantly share code, notes, and snippets.

@dylankilkenny
Last active July 21, 2020 22:57
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b to your computer and use it in GitHub Desktop.
Save dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b to your computer and use it in GitHub Desktop.
import pandas as pd
import requests
import json
def getPushshiftData(after, sub):
url = 'https://api.pushshift.io/reddit/search/submission?&size=1000&after='+str(after)+'&subreddit='+str(sub)
r = requests.get(url)
data = json.loads(r.text)
return data['data']
#list of post ID's
post_ids = []
#Subreddit to query
sub='btc'
# Unix timestamp of date to crawl from.
# 2018/04/01
after = "1522618956"
data = getPushshiftData(after, sub)
# Will run until all posts have been gathered
# from the 'after' date up until todays date
while len(data) > 0:
for submission in data:
post_ids.append(submission["id"])
# Calls getPushshiftData() with the created date of the last submission
data = getPushshiftData(sub=sub, after=data[-1]['created_utc'])
obj = {}
obj['sub'] = sub
obj['id'] = post_ids
# Save to json for later use
with open("submissions.json", "w") as jsonFile:
json.dump(obj, jsonFile)
@novice95
Copy link

Hey,
I was just going through your code, can you please let me know what is the size parameter in the above code in the line(6) url.

@pvwalke
Copy link

pvwalke commented Feb 2, 2020

Hey,
I was just going through your code, can you please let me know what is the size parameter in the above code in the line(6) url.

Size is "limit of returned entries"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment