Skip to content

Instantly share code, notes, and snippets.

@fivesmallq
Forked from microhello/DataQuality
Last active August 29, 2015 14:07
Show Gist options
  • Save fivesmallq/2bfc90874d4326f284e4 to your computer and use it in GitHub Desktop.
Save fivesmallq/2bfc90874d4326f284e4 to your computer and use it in GitHub Desktop.

#Data Quality###

##1. Definition http://en.wikipedia.org/wiki/Data_quality

Reference:


##2.Tools


##3.Dataset

orderdb airport.csv journal in mongodb


##4.Scene

  • High Level view

Simple Statistic: Orders DateTime Analysis : Orders Yearly,Monthly Distribution:Orders

  • Completeness

Completeness, Null Check: customers

  • Exception Values -

Value Distribution : customers.country/customers.city

  • Exception Pattern

Pattern Finder :customers.postalcode/customers.phone

  • Reference Integrity

Reference Integrity : products.productcode->orderdetails.productcode


##5.Other Feature

  • csv file ,mongodb etc. datasource support
  • custom regex and regex marketplace
  • javascript transform, custom extension ,extension marketplace
  • console,http interface

##6.Tips

  • Out of Memory

java -Xmx2048m -jar Datacleaner.jar

  • Export to HTML

Maybe big size html file HTML using online js and css


##7.Talend Open Studio For Data Quality Overview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment