Skip to content

Instantly share code, notes, and snippets.

@microhello
Created October 22, 2014 11:19
Show Gist options
  • Save microhello/fe2603731883ef3b594c to your computer and use it in GitHub Desktop.
Save microhello/fe2603731883ef3b594c to your computer and use it in GitHub Desktop.
DataQuality
###Data Quality###
1. Definition
http://en.wikipedia.org/wiki/Data_quality
>Reference:
>[1]http://www.slideshare.net/OpenDataSupport/open-data-quality-29248578
>[2]http://www.slideshare.net/dba_alex/data-quality-overview?related=1
[3]http://www-01.ibm.com/software/data/quality/
---
2.Tools
- Data Cleaner (~50M)
http://sourceforge.net/projects/datacleaner/
http://datacleaner.org/resources/docs/3.7/html_single/
- Talend Open Studio For Data Quality(~500M)
http://talend.dreamhosters.com/top/release/V5.5.1/TOS_DQ-r118616-V5.5.1.zip
----------
3.Dataset
[orderdb](orderdb)
[airport.csv](airport.csv)
[journal in mongodb](journal)
---
4.Scene
- High Level view
> **Simple Statistic**: Orders
> **DateTime Analysis** : Orders
> **Yearly,Monthly Distribution**:Orders
- Completeness
> **Completeness, Null Check**: customers
- Exception Values -
> **Value Distribution** : customers.country/customers.city
- Exception Pattern
> **Pattern Finder** :customers.postalcode/customers.phone
- Reference Integrity
> **Reference Integrity** : products.productcode->orderdetails.productcode
---
5.Other Feature
- csv file ,mongodb etc. datasource support
- custom regex and regex marketplace
- javascript transform, custom extension ,extension marketplace
- console,http interface
---
6.Tips
- Out of Memory
>java -Xmx2048m -jar Datacleaner.jar
- Export to HTML
>Maybe big size html file
>HTML using online js and css
---
7.Talend Open Studio For Data Quality Overview
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment