plallin/twitter_scrap.py

## twitter_scrap.py
"""
The account was scraped using the following Twitter library: https://github.com/sixohsix/twitter
I prefer this library over Tweepy as it is much "closer" to the API (less abstraction).

The way the following script works is basically as follows:
- First, I get all of the user's friend (I get max 200 friend per call) and store their the friends' names in a list
- Once I have the list of all the user's friends, I iterate through it to get the friends of the friends
- In a second script, I count the number of times a particular username occurs, and then return the sorted list of
friends (the most followed ones).
- That's it, more or less

AltShiftX needed his personal account scraped (not his account AltShiftX). The account @AltShiftX can be scraped in less
than 45 calls which takes about 30 minutes. He follows ~420 people on his personal account, so it took longer.

I made some calculations and the minimum time it could take (that is, if all his friends followed < 200 people) is
7.5 hours. The maximum time it could take (that is if all your friends followed > 1000 people as I'm only getting a list
up to a 1000) is about 33.5. I ran the script from my Raspberry Pi starting Friday 3/4pm and the script was finished
on Saturday circa 1pm.

There are way to improve this: if I had used friend/ids instead of friends/list, I could have scraped 5,000 friends per
call, so for example, 420 friends take 3 friends/lists calls (200+200+20) but a single friend/ids call. The disadvantage
is that friends/list give you the full details of the accounts scraped, while ids only return their id number. However,
You could then use users/lookup to get the full details of up to 100 ids in 1 call, so it would be pretty fast to get
a decent amount of details as 15 calls will get you 1500 (100 * 15) user details, which is a lot :-)

"""

import time
import json
from twitter import *
import datetime

print("Starting to scrap! " + str(datetime.datetime.now()))

with open("config.json") as data_file:  # just a config file I have to save my keys
    data = json.load(data_file)

data = data["MyAccountName"]  # Account whose keys I will use (I have a couple of accounts)
CONSUMER_KEY = data["consumer_key"]
SECRET_CONSUMER_KEY = data["secret_consumer_key"]
ACCESS_TOKEN = data["access_token"]
SECRET_ACCESS_TOKEN = data["secret_access_token"]

t = Twitter(auth=OAuth(ACCESS_TOKEN, SECRET_ACCESS_TOKEN, CONSUMER_KEY, SECRET_CONSUMER_KEY))
account_to_be_scraped = "Insert account to be scraped here"
file_out = "recommendations.txt"  # the friends of his friends will be added to that file.

friends = []  # holds the list of friends of account_to_be_scraped
next_page_loc = -1  # for cursor purposes; if there is no next page, cursor points to location 0
friends_list = t.friends.list(screen_name=account_to_be_scraped,
                              count=200,
                              skip_status=True)

while next_page_loc != 0:
    next_page_loc = friends_list['next_cursor']  # get location of next page
    for friend in friends_list['users']:
        friends.append(friend['screen_name'])
    if next_page_loc == 0:
        break  # we reached the end of the user's friend list.
    else:
        friends_list = t.friends.list(screen_name=account_to_be_scraped,
                                      cursor=next_page_loc,
                                      count=200,
                                      skip_status=True)

with open(file_out, "a") as out:
    for friend in friends:
        out.write(friend + "\n")

for friend in friends:
    friend_friends_list = t.friends.list(screen_name=friend,
                                         count=200,
                                         skip_status=True)
    count = 0  # required for friend following a large number of person; we want to scrap the first 1000 (5 * 200) only
    while count < 5:
        next_page_loc = friend_friends_list['next_cursor']
        for follow in friend_friends_list['users']:
            with open(file_out, "a") as out:
                out.write(follow['screen_name'] + "\n")
        remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"]
        if remaining_calls <= 1:
            print("Got sleepy while scrapping {}'s data...".format(friend))
            time.sleep(60 * 15)  # wait 15 minutes for API limit to replenish
            remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"]
            print("sleep over! I have now {} calls left".format(remaining_calls))
        if next_page_loc == 0:
            break  # end of friend list.
        else:
            friend_friends_list = t.friends.list(screen_name=friend,
                                                 cursor=next_page_loc,
                                                 count=200,
                                                 skip_status=True)
        count += 1

print("over and out :-)" +  str(datetime.datetime.now()))
	"""
	The account was scraped using the following Twitter library: https://github.com/sixohsix/twitter
	I prefer this library over Tweepy as it is much "closer" to the API (less abstraction).

	The way the following script works is basically as follows:
	- First, I get all of the user's friend (I get max 200 friend per call) and store their the friends' names in a list
	- Once I have the list of all the user's friends, I iterate through it to get the friends of the friends
	- In a second script, I count the number of times a particular username occurs, and then return the sorted list of
	friends (the most followed ones).
	- That's it, more or less

	AltShiftX needed his personal account scraped (not his account AltShiftX). The account @AltShiftX can be scraped in less
	than 45 calls which takes about 30 minutes. He follows ~420 people on his personal account, so it took longer.

	I made some calculations and the minimum time it could take (that is, if all his friends followed < 200 people) is
	7.5 hours. The maximum time it could take (that is if all your friends followed > 1000 people as I'm only getting a list
	up to a 1000) is about 33.5. I ran the script from my Raspberry Pi starting Friday 3/4pm and the script was finished
	on Saturday circa 1pm.

	There are way to improve this: if I had used friend/ids instead of friends/list, I could have scraped 5,000 friends per
	call, so for example, 420 friends take 3 friends/lists calls (200+200+20) but a single friend/ids call. The disadvantage
	is that friends/list give you the full details of the accounts scraped, while ids only return their id number. However,
	You could then use users/lookup to get the full details of up to 100 ids in 1 call, so it would be pretty fast to get
	a decent amount of details as 15 calls will get you 1500 (100 * 15) user details, which is a lot :-)

	"""

	import time
	import json
	from twitter import *
	import datetime

	print("Starting to scrap! " + str(datetime.datetime.now()))

	with open("config.json") as data_file: # just a config file I have to save my keys
	data = json.load(data_file)

	data = data["MyAccountName"] # Account whose keys I will use (I have a couple of accounts)
	CONSUMER_KEY = data["consumer_key"]
	SECRET_CONSUMER_KEY = data["secret_consumer_key"]
	ACCESS_TOKEN = data["access_token"]
	SECRET_ACCESS_TOKEN = data["secret_access_token"]

	t = Twitter(auth=OAuth(ACCESS_TOKEN, SECRET_ACCESS_TOKEN, CONSUMER_KEY, SECRET_CONSUMER_KEY))
	account_to_be_scraped = "Insert account to be scraped here"
	file_out = "recommendations.txt" # the friends of his friends will be added to that file.

	friends = [] # holds the list of friends of account_to_be_scraped
	next_page_loc = -1 # for cursor purposes; if there is no next page, cursor points to location 0
	friends_list = t.friends.list(screen_name=account_to_be_scraped,
	count=200,
	skip_status=True)

	while next_page_loc != 0:
	next_page_loc = friends_list['next_cursor'] # get location of next page
	for friend in friends_list['users']:
	friends.append(friend['screen_name'])
	if next_page_loc == 0:
	break # we reached the end of the user's friend list.
	else:
	friends_list = t.friends.list(screen_name=account_to_be_scraped,
	cursor=next_page_loc,
	count=200,
	skip_status=True)

	with open(file_out, "a") as out:
	for friend in friends:
	out.write(friend + "\n")

	for friend in friends:
	friend_friends_list = t.friends.list(screen_name=friend,
	count=200,
	skip_status=True)
	count = 0 # required for friend following a large number of person; we want to scrap the first 1000 (5 * 200) only
	while count < 5:
	next_page_loc = friend_friends_list['next_cursor']
	for follow in friend_friends_list['users']:
	with open(file_out, "a") as out:
	out.write(follow['screen_name'] + "\n")
	remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"]
	if remaining_calls <= 1:
	print("Got sleepy while scrapping {}'s data...".format(friend))
	time.sleep(60 * 15) # wait 15 minutes for API limit to replenish
	remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"]
	print("sleep over! I have now {} calls left".format(remaining_calls))
	if next_page_loc == 0:
	break # end of friend list.
	else:
	friend_friends_list = t.friends.list(screen_name=friend,
	cursor=next_page_loc,
	count=200,
	skip_status=True)
	count += 1

	print("over and out :-)" + str(datetime.datetime.now()))