The organizers of Utah County Data Science Meetup all have data needs. Surprise. And we spend a lot of time sourcing, cleaning, and integrating that data into our projects (writing high quality ETL scripts). What if, asks the illustrious Jeff Potter we could share?
Share you ask?
Yes, share data, share scripts.
How do I know that what I have is useful to other people?
I don't. But if I collect crime statistics by zip code from public data, I can share that I have a transform script and/or data for that. So too all kinds of data such as demographics or income or disease...and that's just the people data.
Isn't this stuff generally available? I mean data.gov, the ACS, NCES, FBI, CDC...everyone's publishing already.
Yes, and if there are missing values, outliers, different granularities...
Wait, wait...you're saying you want to take data and decide for other people how to handle missing values? Outliers? Different analysis takes different treatment, right?
Right, but if I decided to fill missing values with some strategy, and I disclose how I did that, you can just take my work and use the data in the same format. Or, if you don't like that, you can clone (or fork, we haven't agreed on terminology yet) my script, adjust it to your needs and publish your better way if you want.
Do I have to publish?
Absolutely not. But if you have high quality transform scripts for accessing interesting data, please do.
OK, you are talking about sharing ETL scripts, where do I host these scripts, how do I register them?
Well, the idea right now is you register your scripts with our system and we'll maintain a curated list of available data and scripts. Do that like this:
give some examples
Got it. You're registering scripts, but you mentioned data. How does that work?
If you have data that's small enough to be shared in its current state, you can additionally share the transformed data.
here's how
How do I know this is safe?
We use checksums and versioning to know that what you intended was delivered. Additionally, we verify in a safe environment that your scripts produce one file and have no side effects.
How do I know the resulting data is safe?
We don't, exactly. We use some scripts to avoid issues like SQL injection and we test that the scripts don't have side effects in a database. For CSV, we check that the data is well-formed. We keep track of who is submitting to us and keep track of their reputation. We have the ability to ban users that are doing malicious things. We haven't had problems like this yet.
How do I know of the quality of the data I'm receiving?
You can review the script with either a rating system or a detailed review.
How do I discover data that meets my needs?
You can use
dimensional search <criteria>
give more examples
Also, when you submit a script or data, you are asked some specific questions to make the data more discoverable.
How do I reuse data that I've received? Say I have several projects, do I need to download the data every time?
No. dimensional will keep track of what you have loaded and the version of it:
dimensional list
dimensional list <criteria>