Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save akainth015/fd62a0d717b6e4027496a5960363d41f to your computer and use it in GitHub Desktop.
Save akainth015/fd62a0d717b6e4027496a5960363d41f to your computer and use it in GitHub Desktop.
Generate a dump of sentiment data using UCSD's database of Amazon reviews
import gzip
import json
import subprocess
import sys
import pandas as pd
if __name__ == '__main__':
reviews = gzip.open("Downloads/Books_5.json.gz", "r")
data = list()
while len(data) < int(sys.argv[1]):
review_json = next(reviews)
review = json.loads(review_json)
if review['overall'] != 5.0:
continue
reviewText = review['reviewText']
command = [
"gcloud",
"ml",
"language",
"analyze-sentiment",
"--content",
reviewText
]
output = subprocess.run(command, capture_output=True)
annotations = json.loads(output.stdout)
document_sentiment = annotations['documentSentiment']
score = document_sentiment['score']
if score < 0.3:
continue
num_tokens = len(reviewText.split(" "))
magnitude = document_sentiment['magnitude']
data.append((num_tokens, score, magnitude, reviewText))
df = pd.DataFrame(data, columns=["Tokens", "Score", "Magnitude", "Review Text"])
df.to_csv("data.csv")
@akainth015
Copy link
Author

akainth015 commented Aug 29, 2020

The data used for this script is sourced from the following paper. A write up will be created later on, as well as the source code for Concept.

Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019
pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment