Skip to content

Instantly share code, notes, and snippets.

@nicflores
Last active March 8, 2019 21:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nicflores/3773aa2542661b036222d5cedfd0ceda to your computer and use it in GitHub Desktop.
Save nicflores/3773aa2542661b036222d5cedfd0ceda to your computer and use it in GitHub Desktop.

Data Science on JSON using SlamData REFORM and RStudio or Jupyter Notebooks

Obstacles in doing/learning Data Science

Whether you're a professional data scientist or studying to become a data scientist, you'll likely need to work with a JSON dataset. JSON isn't easy to work with. It’s not tabular, and you can't just push it to a SQL database -- at least not without a “bit” of work.

Tabularizing JSON data usually requires writing python scripts. But given multiple JSON datasets there's a high chance that each one is very different, so your python code is likely not re-usable.

As a professional data scientist, tabularizing your JSON is time you have to factor into your report delivery. You may have the option to use your company's resources to deliver a tabular form of your complicated JSON. Either way, somewhere along your workflow, someone is spending time and money writing throw away python code to tabularize JSON.

If you are learning data science then you've likely bumped into blogs such as this one. The article spends approximately 80% of the content showing how to explore JSON data using Linux commands and python code. It isn't until the last few paragraphs where the article dives into answering questions about actually doing something with the data.

What if you didn't have to spend your time writing one-off python scripts to tabularize your JSON data? You might deliver insight more quickly, or more frequently, or maybe your company would save money on ETL resources.

Imagine if data science students didn't have to spend time converting JSON data into a table. Instead, they could spend time learning to ask and answer questions about their data.

Imagine if students of Berkeley's Data 8 course spent zero time tabularizing complicated JSON datasets. Rather, they could spend time applying their skills in a class project involving complicated JSON datasets not previously considered due to the overhead of tabularizing.

In this short blog I'll show you how to quickly tabularize a 700MB NBA JSON dataset I found online, nbagames.json, using SlamData REFORM.

I'm an engineer at SlamData and I'm very proud of the engineering feat my colleagues have created. I'm here show you how you can benefit from our creation, SlamData REFORM.

Access your JSON data with ease

SlamData REFORM can read your JSON data from multiple sources. In this tutorial we'll be reading a JSON file from an S3 bucket. Download the nbagames.json and upload it to an S3 bucket. There are plenty of instructions on how to do this via a quick google search.

SlamData REFORM can readily read your data from an API, local file, Azure, etc. Check out [our instructions] (https://slamdata.com/user-guides/) for these scenarios.

Tabularize your JSON data using SlamData REFORM

In a 2-minute video, I'm going to show you how to create a table from the NBS JSON data I mentioned above. First, I will connect to my S3 bucket containing the nbagames.json data. Then, I will point REFORM to the nbagames.json file and create some columns.

Create Table with SlamData REFORM

And that's it! I have just created a table from the 700MB JSON file without needing to explore my data using Linux commands and especially without having to write any python.

Importing the SlamData REFORM table

Using RStudio

The short video below demonstrates how quickly one can import the table created from the NBA JSON data in SlamData REFORM using RStudio. Like many of the other tools in this realm, RStudio can read data from a URL, display the columns we selected, and produce a simple bar plot.

Import REFORM table into RStudio

Using Jupyter Notebook

Of course, if you like doing data science using python you certainly can! My point is that with SlamData REFORM there's no need to write python to tabularize your JSON. Below, I've included an example on how to get started with a Jupyter Notebook, pandas, and python.

Import REFORM table into Jupyter Notebook

Conclusion

Yes! It’s actually this easy to get started analyzing JSON data. Just connect SlamData REFORM to your datasource and select your columns, then import into your favorite data science tool.

SlamData REFORM has other amazing uses. Would you like to stream data into AWS RedShift, tabularize data stuck in MongoDB, or front your data API with REFORM? Let us know, we are happy to help.

To get started with REFORM get in touch with our sales people for more information.

SlamData REFORM is also available on [AWS Marketplace] (https://aws.amazon.com/marketplace/pp/B07N4B9N7Z).

Get started analyzing your complex JSON, skip the python scripts with SlamData REFORM!


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment