#Data Quality###
##1. Definition http://en.wikipedia.org/wiki/Data_quality
Reference:
- [1]http://www.slideshare.net/OpenDataSupport/open-data-quality-29248578
- [2]http://www.slideshare.net/dba_alex/data-quality-overview?related=1
- [3]http://www-01.ibm.com/software/data/quality/
##2.Tools
-
Data Cleaner (~50M) http://sourceforge.net/projects/datacleaner/ http://datacleaner.org/resources/docs/3.7/html_single/
-
Talend Open Studio For Data Quality(~500M) http://talend.dreamhosters.com/top/release/V5.5.1/TOS_DQ-r118616-V5.5.1.zip
##3.Dataset
orderdb airport.csv journal in mongodb
##4.Scene
- High Level view
Simple Statistic: Orders DateTime Analysis : Orders Yearly,Monthly Distribution:Orders
- Completeness
Completeness, Null Check: customers
- Exception Values -
Value Distribution : customers.country/customers.city
- Exception Pattern
Pattern Finder :customers.postalcode/customers.phone
- Reference Integrity
Reference Integrity : products.productcode->orderdetails.productcode
##5.Other Feature
- csv file ,mongodb etc. datasource support
- custom regex and regex marketplace
- javascript transform, custom extension ,extension marketplace
- console,http interface
##6.Tips
- Out of Memory
java -Xmx2048m -jar Datacleaner.jar
- Export to HTML
Maybe big size html file HTML using online js and css
##7.Talend Open Studio For Data Quality Overview