Skip to content

Instantly share code, notes, and snippets.

@Ningensei848
Last active September 1, 2022 08:16
Show Gist options
  • Save Ningensei848/e6de072a4612879d4ac5487ca84c26b7 to your computer and use it in GitHub Desktop.
Save Ningensei848/e6de072a4612879d4ac5487ca84c26b7 to your computer and use it in GitHub Desktop.
Retrieve tweets/mentions/likes in bulk
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "bulk_get_tweets_ja.ipynb",
"private_outputs": true,
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyNhVYt/U6H/BOHgHeSRHI+V",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/Ningensei848/e6de072a4612879d4ac5487ca84c26b7/bulk_get_tweets_ja.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# ツイート/メンション/いいねをまとめて取得する\n",
"\n",
"## これはなに?\n",
"\n",
"[Ningensei848/SATwi](https://github.com/Ningensei848/SATwi) においては、`dailyUpdate.py` が定期的に少しずつデータを取得していた。\n",
"一方で、一瞬で大量にデータを取得したいという需要もある。\n",
"\n",
"この Notebook では、`dailyUpdate.py` で課していた `start_time` の制約をなくし、Twitter API v2 における各エンドポイントの仕様上の上限までツイートを取得できるようにした。\n",
"\n",
"- tweet ... 3200 件\n",
"- mention ... 800 件\n",
"- like ... 7500 件\n",
"\n",
"ただし、like については `start_time` 制約ではなく、75 req / 15 min という Rate limit に由来するものである。もし実行中に上限に達してしまった場合、エラーの旨が表示されてそれ以上データ取得できなくなる。\n",
"\n",
"対応としては、`targetList.txt` をもとにした複数人実行ではなく、一人ずつ `UNIQUE_TARGET_ID` に指定して集めるというアプローチが堅実だろう。\n",
"\n",
"ただ、これについては 7500 件以上存在していても集めきれないという問題も残る(今後の課題)\n",
"\n",
"## 使い方\n",
"\n",
"〈認証情報の定義〉というコードセル内に、実行に必要な各種変数を入力するだけでよい。\n",
"\n",
"- GITHUB_USERNAME: github でのユーザ名を入力\n",
"- GITHUB_EMAIL: github でのメールアドレスを入力\n",
"- REPOSITORY_NAME: [SATwi](https://github.com/Ningensei848/SATwi) をインポートして作成した自前のプライベートリポジトリ名を入力\n",
"- GITHUB_TOKEN: 少なくとも `repo` の権限を付与した PAT を入力\n",
"- BEARER_TOKEN: Twitter の App Access Token a.k.a. `BEARER_TOKEN` を入力\n",
"\n",
"集めたいデータの種類について、チェックボックスにチェックすることで、そのデータを取得するように指定できる(チェックされない場合には、そのデータは取得されない)。\n",
"\n",
"- ENABLE_TWEETS: ツイートを集めたい場合は、チェックする\n",
"- ENABLE_MENTION: メンションを集めたい場合は、チェックする\n",
"- ENABLE_LIKED_TWEETS: いいね を集めたい場合は、チェックする\n",
"\n",
"基本的には自前のプライベートリポジトリ内にある `targetList.txt` に書かれた ID についてデータを集めるが、`UNIQUE_TARGET_ID` に別途 ID を指定することで、そのアカウントだけを対象としてデータを集めることができる。\n",
"\n",
"いいね の取得制限を避けたいときなどに利用すると良い。\n",
"\n",
"## 注意事項\n",
"\n",
"この notebook には、必然的に認証情報が含まれることになる。\n",
"\n",
"各種変数を入力して実行した後には、第三者に再度共有しないほうがいいだろう。\n",
"もし、**二次配布を希望する場合には、認証情報を消去してから行なうこと**。\n",
"\n",
"(製作者は責任取れませんのであしからず)\n"
],
"metadata": {
"id": "wQaVVBug2Mi3"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "q3jlkDxZ1lvm"
},
"outputs": [],
"source": [
"# @title 認証情報の定義\n",
"\n",
"# git clone https://github.com/username/repo.git\n",
"# Username: your_username\n",
"# Password: your_token\n",
"\n",
"# @markdown #### GitHub の情報を入力 → cf. [アクセストークンを発行する](https://github.com/settings/tokens)\n",
"GITHUB_USERNAME = \"Ningensei848\" # @param {\"type\": \"string\"}\n",
"REPOSITORY_NAME = \"SATwi-imported-private\" # @param {\"type\": \"string\"}\n",
"GITHUB_EMAIL = \"k.kubokawa@klis.tsukuba.ac.jp\" # @param {\"type\": \"string\"}\n",
"GITHUB_TOKEN = \"ghp_prjbnsahH84URhskgfanj__this_is_dummy_token__xxx\" # @param {\"type\": \"string\"}\n",
"OWNER_AND_REPO = f\"{GITHUB_USERNAME}/{REPOSITORY_NAME}\"\n",
"\n",
"# @markdown #### Twitter から `BEARER_TOKEN` の情報を入力 → cf. [Twitter Developer Portal](https://developer.twitter.com/en/portal/dashboard)\n",
"BEARER_TOKEN = \"AAAAAAAAAAAAAAAAAAAAAQzBcQEAAklswqDYprlQo8jG8gsdfDENI2y9_this_is_dummy_token__xxx\" # @param {\"type\": \"string\"}\n",
"\n",
"# @markdown ### ※上記の入力内容は**秘匿情報**である;他者に公開しないよう細心の注意を!\n",
"\n",
"# @markdown ---\n",
"\n",
"# @markdown #### オプション設定\n",
"# @markdown ###### `tweets` (対象が発信したツイート)を集めたい場合は以下にチェック\n",
"ENABLE_TWEETS = False #@param {type:\"boolean\"}\n",
"# @markdown ###### 集めたいツイートの規模を指定(最大 3200 まで)\n",
"MAX_RESULTS_TWEET = 100 # @param {type:\"integer\"}\n",
"\n",
"# @markdown ###### `mention` (メンションされたツイート)を集めたい場合は以下にチェック\n",
"ENABLE_MENTION = False #@param {type:\"boolean\"}\n",
"# @markdown ###### 集めたいメンションの規模を指定(最大 800 まで)\n",
"MAX_RESULTS_MENTION = 100 # @param {type:\"integer\"}\n",
"\n",
"# @markdown ###### `liked_tweets` (いいねしたツイート)を集めたい場合は以下にチェック\n",
"ENABLE_LIKED_TWEETS = False #@param {type:\"boolean\"}\n",
"# @markdown ###### 集めたい「いいね」の規模を指定(最大 7500 まで)\n",
"MAX_RESULTS_LIKE = 100 # @param {type:\"integer\"}\n",
"\n",
"# @markdown ---\n",
"\n",
"# @markdown #### `targetList.txt` を無視して、**特定の一人について情報を集めたい場合**、以下にユーザ ID を入力\n",
"UNIQUE_TARGET_ID = 0 #@param {type:\"integer\"}\n",
"\n"
]
},
{
"cell_type": "code",
"source": [
"# @title 必要な外部ライブラリ群を pip でインストール\n",
"# @markdown > WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:\n",
"\n",
"# @markdown などと注意されるが、ランタイムを再起動すれば実行環境はリセットできるはずなので無視\n",
"\n",
"%pip install --upgrade pip\n",
"%pip install --upgrade python-dotenv requests \n",
"%pip install requests-oauthlib tqdm commentjson\n"
],
"metadata": {
"cellView": "form",
"id": "55u2g5QMZcQS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title GitHub から個々人の SATwi リポジトリを clone してくる\n",
"\n",
"import subprocess\n",
"\n",
"%cd \"/content\"\n",
"proc = [\"git\", \"clone\", f\"https://{GITHUB_TOKEN}@github.com/{GITHUB_USERNAME}/{REPOSITORY_NAME}.git\"]\n",
"_ = subprocess.run(proc, encoding=\"utf-8\", stdout=subprocess.PIPE)\n",
"\n",
"%cd \"/content/$REPOSITORY_NAME\"\n"
],
"metadata": {
"cellView": "form",
"id": "3GZPMyspXF3Q"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 必要なライブラリ群をインポート\n",
"\n",
"import os\n",
"import re\n",
"import time\n",
"import json\n",
"import urllib.request\n",
"from datetime import datetime, timedelta, timezone\n",
"from pathlib import Path\n",
"\n",
"import requests\n",
"import commentjson\n",
"from tqdm import tqdm\n",
"# @markdown > Import \"script.lib\" could not be resolved(reportMissingImports)\n",
"\n",
"# @markdown などと下線が引かれるが、問題なくインストール出来るはずなので無視\n",
"from script.lib import createTimelinesUrl, saveAsJSON\n"
],
"metadata": {
"cellView": "form",
"id": "_MDJNTogZTvg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 関数定義:gitCommit()\n",
"# @markdown git コマンドを定義して、関数としてひとまとめにする\n",
"\n",
"def makeCommands():\n",
" dt = datetime.now(timezone(timedelta(hours=9))).strftime(\"%Y-%m-%d %H:%M:%S\")\n",
" git_config_name = [\"git\", \"config\", \"--local\", \"user.name\", GITHUB_USERNAME]\n",
" git_config_email = [\"git\", \"config\", \"--local\", \"user.email\", GITHUB_EMAIL]\n",
" git_add = [\"git\", \"add\", \".\"]\n",
" git_commit = [\"git\", \"commit\", \"-m\", f\"Update: at {dt}\"]\n",
" git_pull = [\"git\", \"pull\", \"--rebase\"]\n",
" git_gc = [\"git\", \"gc\", \"--prune=all\"]\n",
" git_push = [\"git\", \"push\"]\n",
"\n",
" return [git_config_name, git_config_email, git_add, git_commit,git_pull, git_gc, git_push]\n",
"\n",
"\n",
"def gitCommit():\n",
" for proc in makeCommands():\n",
" res = subprocess.run(proc, encoding=\"utf-8\", capture_output=True, text=True)\n",
" if len(res.stderr):\n",
" print(res.stderr)\n"
],
"metadata": {
"cellView": "form",
"id": "5Qg2bQay57N6"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 関数定義 isPrivate()\n",
"# @markdown リポジトリの公開範囲を調べる\n",
"\n",
"def isPrivate():\n",
"\n",
" url = f\"https://api.github.com/repos/{OWNER_AND_REPO}\"\n",
" req = urllib.request.Request(url)\n",
" req.headers = {\"Accept\": \"application/vnd.github+json\", \"Authorization\": f\"token {GITHUB_TOKEN}\"}\n",
"\n",
" res = urllib.request.urlopen(req)\n",
" content = json.loads(res.read().decode(\"utf-8\"))\n",
" return content[\"private\"]\n"
],
"metadata": {
"cellView": "form",
"id": "XRsoDef9VLdA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 関数定義:connectEndpoint()\n",
"\n",
"def bearerOAuth(r):\n",
" \"\"\"\n",
" Method required by bearer token authentication.\n",
" \"\"\"\n",
"\n",
" r.headers[\"Authorization\"] = f\"Bearer {BEARER_TOKEN}\"\n",
" r.headers[\"User-Agent\"] = \"v2UserTweetsPython\"\n",
" return r\n",
"\n",
"\n",
"def connectEndpoint(url, params):\n",
" response = requests.request(\"GET\", url, auth=bearerOAuth, params=params)\n",
" # print(response.status_code)\n",
" if response.status_code != 200:\n",
" raise Exception(\"Request returned an error: {} {}\".format(response.status_code, response.text))\n",
" return response.json()\n"
],
"metadata": {
"cellView": "form",
"id": "nT-CSrqz2k4_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 関数定義:getParams()\n",
"# @markdown `queryParameters.json` から必要なパラメータを読み込む\n",
"\n",
"def convertListToStr(list_):\n",
" return \",\".join(list_) if type(list_) is list else str(list_)\n",
"\n",
"\n",
"def getParams(pagination_token: str = None):\n",
" filepath = cwd / \"queryParameters.json\"\n",
" config = commentjson.loads(filepath.read_text())\n",
" param_fields = [\n",
" \"expansions\",\n",
" \"tweet.fields\",\n",
" \"media.fields\",\n",
" \"place.fields\",\n",
" \"poll.fields\",\n",
" ]\n",
" param_dict = {k: convertListToStr(v) for k, v in config.items() if k in param_fields}\n",
" param_dict.update(\n",
" {\n",
" \"max_results\": 100,\n",
" # \"start_time\": START_TIME.isoformat(timespec=\"seconds\") + \"Z\",\n",
" \"pagination_token\": pagination_token,\n",
" }\n",
" )\n",
"\n",
" return param_dict\n"
],
"metadata": {
"cellView": "form",
"id": "IdIXty9tpTZi"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 関数定義:procedure()\n",
"# @markdown 繰り返し使われる処理をまとめたもの\n",
"\n",
"def procedure(user_id: int, result_count: int, endpoint=\"tweets\", next_token=None):\n",
" url = createTimelinesUrl(user_id, endpoint)\n",
" params = getParams() if next_token is None else getParams(next_token)\n",
"\n",
" if result_count == 0:\n",
" return\n",
" elif result_count - 100 < 0:\n",
" params[\"max_results\"] = 100 - result_count\n",
" result_count = 0\n",
" else:\n",
" result_count -= 100\n",
"\n",
" try:\n",
" json_response = connectEndpoint(url, params)\n",
" except Exception as e:\n",
" print('-' * 80 + '\\n\\tERROR at connectEndpoint(url, params)\\n' + '-' * 80)\n",
" print(e)\n",
" print('-' * 80 + '\\n\\n')\n",
" return\n",
"\n",
" if \"data\" not in json_response:\n",
" print(f\"`data` not found. user_id is {user_id} and endpoint is {endpoint}\")\n",
" return\n",
" else:\n",
" saveAsJSON(user_id, endpoint, json_response)\n",
"\n",
" if \"meta\" in json_response and \"next_token\" in json_response[\"meta\"]:\n",
" pagination_token = json_response[\"meta\"][\"next_token\"]\n",
" time.sleep(3) # wait 3 seconds\n",
" procedure(user_id, result_count, endpoint, pagination_token)\n",
"\n",
" return\n"
],
"metadata": {
"cellView": "form",
"id": "mGusuT6Ulq4-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title main()\n",
"# @markdown `targetList.txt` から対象の ID 一覧を読み出して実行\n",
"\n",
"# @markdown ただし、`UNIQUE_TARGET_ID` が指定されている場合は、その一人だけを対象に取る\n",
"\n",
"# 正規表現\n",
"pattern_user_id = re.compile(r\"\\d+\")\n",
"\n",
"cwd = Path.cwd()\n",
"source = cwd / \"targetList.txt\"\n",
"\n",
"for id in source.read_text().split(\"\\n\"):\n",
" print(id)\n",
"\n",
"target_id_list = [\n",
" int(pattern_user_id.match(id)[0])\n",
" for id in source.read_text().split(\"\\n\")\n",
" if len(id) > 0 and pattern_user_id.match(id) is not None\n",
"]\n",
"\n",
"# 特定の一人についてデータを集める場合、`targetList.txt` は無視する\n",
"if UNIQUE_TARGET_ID:\n",
" print(\"But target(s) in targetList.txt above is ignored.\")\n",
" print(f\"We collecting data about {UNIQUE_TARGET_ID}\")\n",
" target_id_list = [ UNIQUE_TARGET_ID ]\n",
"\n",
"for user_id in tqdm(target_id_list):\n",
" if ENABLE_TWEETS:\n",
" print(f\"\\nNow we are currently collecting {user_id}'s Tweets ...\\n\")\n",
" procedure(user_id, MAX_RESULTS_TWEET, endpoint=\"tweets\")\n",
" if ENABLE_MENTION:\n",
" print(f\"\\nNow we are currently collecting {user_id}'s Mentions ...\\n\")\n",
" procedure(user_id, MAX_RESULTS_MENTION, endpoint=\"mentions\")\n",
" if ENABLE_LIKED_TWEETS:\n",
" print(f\"\\nNow we are currently collecting {user_id}'s Liked Tweets ...\\n\")\n",
" procedure(user_id, MAX_RESULTS_LIKE, endpoint=\"liked_tweets\")\n"
],
"metadata": {
"cellView": "form",
"id": "IjrD3FxUj8On"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 最後に...\n",
"# @markdown リポジトリにプッシュして完成(コメントアウトされない限り、コミットは実行されない)\n",
"\n",
"# if isPrivate():\n",
"# gitCommit()\n",
"# else:\n",
"# print(\"リポジトリの公開範囲が Private ではありません!\")\n",
"# print(\"このまま Push すると Twitter の規約違反であり、\")\n",
"# print(\"さらに【著作権侵害】に抵触します!!\")\n",
"# print(\"---------------------------------------------------------\")\n",
"# print(\"データを保存したい場合、今すぐ公開設定を変更してください\")\n"
],
"metadata": {
"cellView": "form",
"id": "jOH_Z0uZLgWY"
},
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment