davidrichards/dimensional_intro.md

## dimensional_intro.md

      
    Raw
  

              dimensional_intro.md
            
          
    Introducing Dimensional

The organizers of Utah County Data Science Meetup all have data needs.  Surprise.  And we spend a lot of time sourcing, cleaning, and integrating that data into our projects (writing high quality ETL scripts).  What if, asks the illustrious Jeff Potter we could share?
Share you ask?
Yes, share data, share scripts.
How do I know that what I have is useful to other people?
I don't.  But if I collect crime statistics by zip code from public data, I can share that I have a transform script and/or data for that.  So too all kinds of data such as demographics or income or disease...and that's just the people data.
Isn't this stuff generally available?  I mean data.gov, the ACS, NCES, FBI, CDC...everyone's publishing already.
Yes, and if there are missing values, outliers, different granularities...
Wait, wait...you're saying you want to take data and decide for other people how to handle missing values?  Outliers?  Different analysis takes different treatment, right?
Right, but if I decided to fill missing values with some strategy, and I disclose how I did that, you can just take my work and use the data in the same format.  Or, if you don't like that, you can clone (or fork, we haven't agreed on terminology yet) my script, adjust it to your needs and publish your better way if you want.
Do I have to publish?
Absolutely not.  But if you have high quality transform scripts for accessing interesting data, please do.
OK, you are talking about sharing ETL scripts, where do I host these scripts, how do I register them?
Well, the idea right now is you register your scripts with our system and we'll maintain a curated list of available data and scripts.  Do that like this:
give some examples

Got it. You're registering scripts, but you mentioned data.  How does that work?
If you have data that's small enough to be shared in its current state, you can additionally share the transformed data.
here's how

How do I know this is safe?
We use checksums and versioning to know that what you intended was delivered.  Additionally, we verify in a safe environment that your scripts produce one file and have no side effects.
How do I know the resulting data is safe?
We don't, exactly.  We use some scripts to avoid issues like SQL injection and we test that the scripts don't have side effects in a database.  For CSV, we check that the data is well-formed.  We keep track of who is submitting to us and keep track of their reputation.  We have the ability to ban users that are doing malicious things.  We haven't had problems like this yet.
How do I know of the quality of the data I'm receiving?
You can review the script with either a rating system or a detailed review.
How do I discover data that meets my needs?
You can use
dimensional search <criteria>
give more examples

Also, when you submit a script or data, you are asked some specific questions to make the data more discoverable.
How do I reuse data that I've received?  Say I have several projects, do I need to download the data every time?
No. dimensional will keep track of what you have loaded and the version of it:
dimensional list
dimensional list <criteria>


## interface_possibilities.txt
dimensional
  help
    generic help
    list of most-common applications
    specific help with each application name (e.g. dimensional help import)

  import
    name (required)
    --columns, -c a list of columns to include
    --version, -v
    --location, -l (especially important for bootstrapping locally, or from gists)

  fork | clone
    name (required)
    new name (?)
    --version, -v

  register | push

    register a new script or data package

    name
    --location, -l (default current directory, name.py)

  pr
    maybe this is just git-based instead.  e.g. homebrew

  remove
    remove the local copy of the data

    name (required)

  update
    update the versions of installed scripts

    name (optional) limit to a specific script

  upgrade
    upgrade the dimensional system

  info
    return the info on a particular script

    name (required)

  search
    search for scripts meeting some criteria

  list
    list the installed scripts

  archive
    store the transformed data somewhere and make it accessible to other people directly

  verify
    check a script
	dimensional
	help
	generic help
	list of most-common applications
	specific help with each application name (e.g. dimensional help import)

	import
	name (required)
	--columns, -c a list of columns to include
	--version, -v
	--location, -l (especially important for bootstrapping locally, or from gists)

	fork \| clone
	name (required)
	new name (?)
	--version, -v

	register \| push

	register a new script or data package

	name
	--location, -l (default current directory, name.py)

	pr
	maybe this is just git-based instead. e.g. homebrew

	remove
	remove the local copy of the data

	name (required)

	update
	update the versions of installed scripts

	name (optional) limit to a specific script

	upgrade
	upgrade the dimensional system

	info
	return the info on a particular script

	name (required)

	search
	search for scripts meeting some criteria

	list
	list the installed scripts

	archive
	store the transformed data somewhere and make it accessible to other people directly

	verify
	check a script