Skip to content

Instantly share code, notes, and snippets.

@davidrichards
Last active June 23, 2016 20:07
Show Gist options
  • Save davidrichards/2bdf7fd17a7a3fd9da5c1b78e2b9bc64 to your computer and use it in GitHub Desktop.
Save davidrichards/2bdf7fd17a7a3fd9da5c1b78e2b9bc64 to your computer and use it in GitHub Desktop.

Introducing Dimensional

The organizers of Utah County Data Science Meetup all have data needs. Surprise. And we spend a lot of time sourcing, cleaning, and integrating that data into our projects (writing high quality ETL scripts). What if, asks the illustrious Jeff Potter we could share?

Share you ask?

Yes, share data, share scripts.

How do I know that what I have is useful to other people?

I don't. But if I collect crime statistics by zip code from public data, I can share that I have a transform script and/or data for that. So too all kinds of data such as demographics or income or disease...and that's just the people data.

Isn't this stuff generally available? I mean data.gov, the ACS, NCES, FBI, CDC...everyone's publishing already.

Yes, and if there are missing values, outliers, different granularities...

Wait, wait...you're saying you want to take data and decide for other people how to handle missing values? Outliers? Different analysis takes different treatment, right?

Right, but if I decided to fill missing values with some strategy, and I disclose how I did that, you can just take my work and use the data in the same format. Or, if you don't like that, you can clone (or fork, we haven't agreed on terminology yet) my script, adjust it to your needs and publish your better way if you want.

Do I have to publish?

Absolutely not. But if you have high quality transform scripts for accessing interesting data, please do.

OK, you are talking about sharing ETL scripts, where do I host these scripts, how do I register them?

Well, the idea right now is you register your scripts with our system and we'll maintain a curated list of available data and scripts. Do that like this:

give some examples

Got it. You're registering scripts, but you mentioned data. How does that work?

If you have data that's small enough to be shared in its current state, you can additionally share the transformed data.

here's how

How do I know this is safe?

We use checksums and versioning to know that what you intended was delivered. Additionally, we verify in a safe environment that your scripts produce one file and have no side effects.

How do I know the resulting data is safe?

We don't, exactly. We use some scripts to avoid issues like SQL injection and we test that the scripts don't have side effects in a database. For CSV, we check that the data is well-formed. We keep track of who is submitting to us and keep track of their reputation. We have the ability to ban users that are doing malicious things. We haven't had problems like this yet.

How do I know of the quality of the data I'm receiving?

You can review the script with either a rating system or a detailed review.

How do I discover data that meets my needs?

You can use

dimensional search <criteria>
give more examples

Also, when you submit a script or data, you are asked some specific questions to make the data more discoverable.

How do I reuse data that I've received? Say I have several projects, do I need to download the data every time?

No. dimensional will keep track of what you have loaded and the version of it:

dimensional list
dimensional list <criteria>
dimensional
help
generic help
list of most-common applications
specific help with each application name (e.g. dimensional help import)
import
name (required)
--columns, -c a list of columns to include
--version, -v
--location, -l (especially important for bootstrapping locally, or from gists)
fork | clone
name (required)
new name (?)
--version, -v
register | push
register a new script or data package
name
--location, -l (default current directory, name.py)
pr
maybe this is just git-based instead. e.g. homebrew
remove
remove the local copy of the data
name (required)
update
update the versions of installed scripts
name (optional) limit to a specific script
upgrade
upgrade the dimensional system
info
return the info on a particular script
name (required)
search
search for scripts meeting some criteria
list
list the installed scripts
archive
store the transformed data somewhere and make it accessible to other people directly
verify
check a script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment