Skip to content

Instantly share code, notes, and snippets.

@jpotts18
Last active June 23, 2016 22:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jpotts18/d534189cee290db563c14dbd942caa45 to your computer and use it in GitHub Desktop.
Save jpotts18/d534189cee290db563c14dbd942caa45 to your computer and use it in GitHub Desktop.
class Dataset(object):
name = None
source_url = None
processing_notes = None # What modifications to the original data set were done. Outliers, Imputation, etc?
license = None # Something about how this can be used CC/Apache/etc.
columns = []
def __str__(self):
cols = "\n".join(['{} -\t{} - \t{}'.format(c.name, c.data_type, c.description) for c in self.columns])
return 'Name: {}\nSource Url: {}\nLicense: {}\nNum. Observation {}\nNum. Columns: {}\n\n{}'.format(self.name,
self.source_url,
self.license,
len(self.columns[0].data),
len(self.columns), cols)
class Column(object):
name = None
data_type = None
description = None # I think it is very important to know how the data is calculated
data = []
class PrimaryKey(Column):
pass
zip = PrimaryKey()
zip.data = ['77379', '84064', '84003']
zip.data_type = str
zip.name = 'zipcodes'
zip.description = 'A 5 digit postal code used in the US'
city = Column()
city.data = ['Spring', 'Provo', 'Highland']
city.data_type = str
city.name = 'City'
city.description = 'Name of city'
state = Column()
state.data = ['TX', 'UT', 'UT']
state.data_type = str
state.name = 'State'
state.description = 'Name of State'
data_set = Dataset()
data_set.name = 'Places I have lived'
data_set.columns = [zip, state, city]
data_set.license = 'Creative Commons Share-Alike'
data_set.processing_notes = 'I removed cities that I haven\'t lived in'
data_set.source_url = 'jpotts18.github.io/datasets/zips.csv'
print data_set
# Name: Places I have lived
# Source Url: jpotts18.github.io/datasets/zips.csv
# License: Creative Commons Share-Alike
# Num. Observation 3
# Num. Columns: 3
# zipcodes - <type 'str'> - A 5 digit postal code used in the US
# State - <type 'str'> - Name of State
# City - <type 'str'> - Name of city
@rosenbrockc
Copy link

I like it. As far as the Column.description, I had a few thoughts:

  1. Maybe include a source_url at the Column level also; probably data will be grabbed from other places to form the Dataset.
  2. Another attribute called transformation or something similar, which is a pipeline string like from the executable version of petl would be appropriate: .cut('foo', 'baz').convert('baz', float).selectgt('baz', 0.5).head().data() | petl.

Unless I am missing where this class fits in the overall picture. It might also be nice to have an attribute that defines how "clean" the column is. Though we need to have a discussion on how to quantify that. I'll open an issue on the main repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment