Generate a dump of sentiment data using UCSD's database of Amazon reviews
import gzip
import json
import subprocess
import sys
import pandas as pd
if __name__ == '__main__':
reviews ="Downloads/Books_5.json.gz", "r")
data = list()
while len(data) < int(sys.argv[1]):
review_json = next(reviews)
review = json.loads(review_json)
if review['overall'] != 5.0:
reviewText = review['reviewText']
command = [
output =, capture_output=True)
annotations = json.loads(output.stdout)
document_sentiment = annotations['documentSentiment']
score = document_sentiment['score']
if score < 0.3:
num_tokens = len(reviewText.split(" "))
magnitude = document_sentiment['magnitude']
data.append((num_tokens, score, magnitude, reviewText))
df = pd.DataFrame(data, columns=["Tokens", "Score", "Magnitude", "Review Text"])
akainth015 commented Aug 29, 2020

The data used for this script is sourced from the following paper. A write up will be created later on, as well as the source code for Concept.

Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

