Skip to content

Instantly share code, notes, and snippets.

@ruebot
Created December 10, 2019 04:26
Show Gist options
  • Save ruebot/87203bdd1f332b8afe1fcc9634f5dfff to your computer and use it in GitHub Desktop.
Save ruebot/87203bdd1f332b8afe1fcc9634f5dfff to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# twut walkthrough\n",
"\n",
"How to get here?\n",
"\n",
"We'll assume you have the [Anaconda distribution](https://www.anaconda.com/) installed, or at least Python 3.7+ and Jupyter Notebooks.\n",
"\n",
"\n",
"```\n",
"$ git clone https://github.com/archivesunleashed/twut.git\n",
"$ cd twut\n",
"$ mvn clean install\n",
"\n",
"$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook /path/to/spark-3.0.0-preview-bin-hadoop2.7/bin/pyspark --py-files /path/to/twut/target/twut.zip --driver-class-path /path/to/twut/target/twut-0.0.1-SNAPSHOT-fatjar.jar --jars /path/to/twut/target/twut-0.0.1-SNAPSHOT-fatjar.jar\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's import `twut`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from twut import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, to use `twut` we will need to load in some line-oriented JSON twitter data as a DataFrame. We have three example resources included in the repo that come from the Twitter Sample API using the [`sample`](https://github.com/docnow/twarc#sample) command in [`twarc`](https://github.com/docnow/twarc). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = \"/home/nruest/Projects/au/twut/src/test/resources/500-sample.jsonl\"\n",
"df = spark.read.json(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've loaded up 500 tweets from the Sample API to work with, and we can access them in a DataFrame using the variable `df`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the hashtags, and assign the hashtags DataFrame that `twut` will create for us to variable, `hashtags`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"hashtags = SelectTweet.hashtags(df)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------+\n",
"|hashtags |\n",
"+------------------------+\n",
"|安元江口と夜あそび |\n",
"|DavidoDidntCum |\n",
"|DEMoniocratas |\n",
"|もっとホットなクリスマス|\n",
"|Tenerife |\n",
"|مساءالخير |\n",
"|FakeNews |\n",
"|CyberMonday |\n",
"|INEC |\n",
"|killarney |\n",
"+------------------------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"hashtags.show(10, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How about images?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"image_urls = SelectTweet.imageUrls(df)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------------------------------------+\n",
"|image_url |\n",
"+-----------------------------------------------+\n",
"|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|\n",
"|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|\n",
"|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|\n",
"|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|\n",
"|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|\n",
"|https://pbs.twimg.com/media/EKwsNH-UYAAJuxZ.jpg|\n",
"|https://pbs.twimg.com/media/EKyZ3k2VUAEMltk.jpg|\n",
"|https://pbs.twimg.com/media/EKxI3nPVUAEkhee.jpg|\n",
"|https://pbs.twimg.com/media/EKyaQk0WsAAvsyP.jpg|\n",
"|https://pbs.twimg.com/media/EKyat3IWoAAJgq2.jpg|\n",
"+-----------------------------------------------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"image_urls.show(10, False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How about filtering out retweets?"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"230"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"no_retweets = FilterTweet.removeRetweets(df)\n",
"no_retweets.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do the users look like?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n",
"|favourites_count|followers_count|friends_count|id_str |location |name |screen_name |statuses_count|verified|\n",
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n",
"|8302 |101 |133 |1027887558032732161|nct🌱 |车美 |M_chemei |3720 |false |\n",
"|2552 |73 |218 |2548066344 |null |ひーこ☆禿げても愛せ |heeko_gr_029 |15830 |false |\n",
"|4305 |1715 |98 |715850628 |0179.Kuwait♡دار جابر |Danahdenou |aldanah_94 |74967 |false |\n",
"|1870 |337 |53 |1081163420748046337|null |翔 |yoyoyopisannn |10702 |false |\n",
"|1544 |246 |240 |703120446 |Rio de Janeiro, Brasil|vilixo |vinismachadoo |16273 |false |\n",
"|2331 |91 |83 |973424490934714368 |日本 山口 |イサオ(^^)最近ディスクにハマル🎵|isao777sp2 |2137 |false |\n",
"|34258 |366 |562 |716598636247777281 |Johore, Malaysia |kimî |kimeowmy |46484 |false |\n",
"|0 |24 |7 |2587221716 |液晶の裏側 |貞子ちゃんbot |sadako_okadas |67549 |false |\n",
"|115 |123 |149 |1221632856 |TANJUNGPINANG |Dian Ramadita |dian05ramadita|631 |false |\n",
"|28 |5 |141 |1051802857610072064|Bayern, Deutschland |matias |matyas_0385 |32 |false |\n",
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"users = SelectTweet.userInfo(no_retweets)\n",
"users.show(10, False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment