Skip to content

Instantly share code, notes, and snippets.

@LeenSrouji
Last active April 18, 2022 07:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save LeenSrouji/b66cc20b6f45a83370c1ea3a2c19f230 to your computer and use it in GitHub Desktop.
Save LeenSrouji/b66cc20b6f45a83370c1ea3a2c19f230 to your computer and use it in GitHub Desktop.
blog post 1

15 Useful OpenSource Data Quality Python Libraries

carlos-muza-hpjSkU2UYSU-unsplash

photo by Carlos Muza on Unsplash



Whether you're using data for business analysis or for building Machine-learning models, Poor data can hold you back and consume a lot of your time in discovering or fixing issues.

In this article I have gathered top open-source python libraries to assist increasing data quality in your day to day work.

Data Profiling And Assessment

Exploratory Analytics

1. Pandas Profiling [Github]

A library that generates profiling report from Pandas Dataframes.
Key features:

  • Data profiling (Missing and unique Values, , ..)
  • Data distributions and Histograms
  • Quantile and Descriptive statistics (mean, std deviation, Q1, ...)
  • Type inference
  • Interactions and Correlations
  • Creates Html format report

image image

2. Great Expectations [Github]

A shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
Key features:

  • Built from Expectation constructs which are data assertions
  • Declarative data tests ex: - Expect this table to have between x & y number of rows - Expect missing values not to exceed 20% - Expect date column format to be MM-DD-YYYY - More out of the box constructs: uniqueness, outliers and more - Enable writing custom expectations
  • Automated data profiling
  • Rendering the tests into human-readable forms and docs.
  • Integration with many tools and systems such as (Pandas, Jupyter Notebooks, Spark, mysql, databricks, ...)

image

3. SodaSQL [Github]

Open-source command-line tool, Executes SQL queries based on defined input to run tests on different datasets on different data sources (like Snowflake, posgreSQL, Athena, ...) to find invalid or missing data. In addition it can collect defined metrics like min, max, avg, stddev and many more.

Key features:

  • Custom SQL tests
  • Defining tests in yml format for each table
  • Can integrate with data orchestration tool
  • Connects and scans your datasets
  • Identify columns format (like email, date, phone numbers, ...)
  • Json tests results

Predictive Analytics

4. Ydata [Github]

A library for assessing Data Quality throughout the multiple stages of a data pipeline development. It helps capturing a holistic view of the data by looking at it from multiple dimensions:

  • Missing Values
  • Duplicates
  • Data Drifts and Outlier detection
  • Data Relations and Correlations

In addition, it provides an integration with great expectations which runs data assertions allowing you to validate, profile your data and automate report creation. image

5. DeepChecks [Github]

A python package for validating machine learning models and data with minimal effort. It includes checks related to various types of issues such as:

  • Model performance
  • Data integrity:
    • Mixed Nulls
    • Duplicates
    • Strings Mismatch and Comparison
    • Frequency changes
    • special Characters and more
  • Distribution dirfts
  • Feature Importance and more

Although Deepchecks was built for Machine Learning, we can benefit from Data Integrity and drift detection for our data testing, we can use the SingleDatasetIntegrity Suite or the specefic checks from the other suites while treating the train and test sets as a reference and current dataframes.

Morever, DeepsChecks provides the option to write customized Suites and checks. The results are visualized in a nice way, either in Table or a Plotly Graph.

image

image

6. Evidently AI [Github]

A tool to analyze and monitor machine learning models. Key features:

  • Data distributions
  • Data Drift
  • Building custom dashboard
  • Model health and performance
  • Integration with Grafana and Prometheus

image

7. Alibi Detect [Github]

A machine learning dedicated library focused on outlier, adversarial and drift detection. Key features:

  • Drift and Outliers Detection for tabular data, text, images and time series
  • Cover both online and offline detectors
  • TensorFlow and PyTorch backends are supported for drift detection.

Data Cleaning and Formating:

1. Scrabadub [Github]

Identifies and removes PII (Personal Identifiable Information) from free text. like names, phone numbers, addresses, credit-card numbers and many more. In addition, you can implement customized detectors.

  text = "My cat can be contacted on example@example.com, or 1800 555-5555"
  scrubadub.clean(text)
  >>'My cat can be contacted on {{EMAIL}}, or {{PHONE}}'

2. Arrow [Github]

Arrow provides a sensible and human-friendly approach to creating, manipulating, formatting and converting dates, times and timestamps.

  utc = arrow.utcnow()
  time= utc.to('US/Pacific')
  past = time.dehumanize("2 days ago")
  print(past)
  >> 2022-01-09T10:11:11.939887+00:00
  print(past.humanize(locale="ar"))
  z>> 'منذ يومين'

3. Beautifier [Github]

A library to cleanup url patterns and emails. also it helps you to:

  • Check email validity
  • Parse emails by domain and username
  • Parse Urls by domain and parameters
  • Clean unicodes, special charecters and unnessesary redirection patterns from the urls

4. Ftfy [Github]

Ftfy = 'Fixes text for you' and this is exactly what this library do:

  • Fixes text with bad unicode

  • Fixes line breaks

  • Unescapes html to plain text

  • Detects likely mojibak

  • Provide an explaination to show us what happened with the text

    ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
    >>"The Mona Lisa doesn't have eyebrows."
    

5. Dora [Github]

Exploratory data analysis toolkit for Python.
Key features:

  • Data cleaning (Null Values, Category to Ordinal, remove columns, transformation on columns)
  • Feature selection & extraction
  • Visualization (ploting features)
  • Partitioning data for model validation
  • Versioning transformations of data All data should be Numerical to plot it and to use most of the functions

6. DataCleaner [Github]

A tool for automatically cleaning data sets and making them ready for analysis.
Key features:

  • Dropping rows with missing value
  • Replacing missing values
  • Encoding non numerical variables
  • Works with pandas dataframes
  • Can Be used in the Command line or Script

Tables Preview:

1. Tabulate [Github]

Tabulate lets you print small, nice-looking tables with just one function call.

  • Making Tables More Readable.
  • Formatting a table to Html or other formats.

image

2. PrettyPandas [Github]

PrettyPandas helps you create report quality tables with a simple API. making tables more readable by:

  • Adding summary rows and columns
  • Number formating for currency and percentages
  • Styling background gradient

image

Summary

In recent years many open-source projects evolved around data quality, we have only mentioned a few libraries here, but still many more to explore. The growing amount of open-source projects around data quality indicates that data quality plays a growing important role for businesses. As more and more data being produced and processed organizations need to find ways to manage its quality. Failing to do so, put businesses at high risk and exposure to bad decision making, low reputation and compliance risk.

Do you agree? Please share your thoughts and comment.

I hope it helps

....

If you like this content, follow me on Medium or Visit me on Linkedin

@LeenSrouji
Copy link
Author

update latest changes from Medium article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment