Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Generate a dump of sentiment data using UCSD's database of Amazon reviews
import gzip
import json
import subprocess
import sys
import pandas as pd
if __name__ == '__main__':
reviews ="Downloads/Books_5.json.gz", "r")
data = list()
while len(data) < int(sys.argv[1]):
review_json = next(reviews)
review = json.loads(review_json)
if review['overall'] != 5.0:
reviewText = review['reviewText']
command = [
output =, capture_output=True)
annotations = json.loads(output.stdout)
document_sentiment = annotations['documentSentiment']
score = document_sentiment['score']
if score < 0.3:
num_tokens = len(reviewText.split(" "))
magnitude = document_sentiment['magnitude']
data.append((num_tokens, score, magnitude, reviewText))
df = pd.DataFrame(data, columns=["Tokens", "Score", "Magnitude", "Review Text"])
Copy link

akainth015 commented Aug 29, 2020

The data used for this script is sourced from the following paper. A write up will be created later on, as well as the source code for Concept.

Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment