Skip to content

Instantly share code, notes, and snippets.

@peijiehu
Last active July 6, 2017 05:35
Show Gist options
  • Save peijiehu/1b5567e4c644eca1f971e50c5dcc87c0 to your computer and use it in GitHub Desktop.
Save peijiehu/1b5567e4c644eca1f971e50c5dcc87c0 to your computer and use it in GitHub Desktop.

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:

  1. Find out whether existing data can be easily used for other purposes
  2. Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
  3. Assess data quality, including whether the data conforms to particular standards or patterns[2]
  4. Assess the risk involved in integrating data in new applications, including the challenges of joins
  5. Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
  6. Assess whether known metadata accurately describes the actual values in the source database
  7. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  8. Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality.

Data profiling for assessing data quality is very useful - it can ensure different stages of data cleaning or transformations have been done correctly and in compliance of requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment