LeenSrouji/open_source_python_data_libraries.md Secret

## open_source_python_data_libraries.md

      
    Raw
  

              open_source_python_data_libraries.md
            
          
    15 Useful OpenSource Data Quality Python Libraries


photo by Carlos Muza on Unsplash 		


Whether you're using data for business analysis or for building Machine-learning models, Poor data can hold you back and consume a lot of your time in discovering or fixing issues.
In this article I have gathered top open-source python libraries to assist increasing data quality in your day to day work.
Data Profiling And Assessment

Exploratory Analytics

1. Pandas Profiling [Github]

A library that generates profiling report from Pandas Dataframes.
 Key features:

Data profiling (Missing and unique Values, , ..)
Data distributions and Histograms
Quantile and Descriptive statistics (mean, std deviation, Q1, ...)
Type inference
Interactions and Correlations
Creates Html format report


2. Great Expectations [Github]

A shared, open standard for data quality.
It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
 Key features:

Built from Expectation constructs which are data assertions
Declarative data tests ex:
- Expect this table to have between x & y number of rows
- Expect missing values not to exceed 20%
- Expect date column format to be MM-DD-YYYY
- More out of the box constructs: uniqueness, outliers and more
- Enable writing custom expectations
Automated data profiling
Rendering the tests into human-readable forms and docs.
Integration with many tools and systems such as (Pandas, Jupyter Notebooks, Spark, mysql, databricks, ...)


3. SodaSQL [Github]

Open-source command-line tool, Executes SQL queries based on defined input to run tests on different datasets on different data sources (like Snowflake, posgreSQL, Athena, ...)
to find invalid or missing data.
In addition it can collect defined metrics like min, max, avg, stddev and many more.
Key features:

Custom SQL tests
Defining tests in yml format for each table
Can integrate with data orchestration tool
Connects and scans your datasets
Identify columns format (like email, date, phone numbers, ...)
Json tests results

Predictive Analytics

4. Ydata [Github]

A library for assessing Data Quality throughout the multiple stages of a data pipeline development.
It helps capturing a holistic view of the data by looking at it from multiple dimensions:

Missing Values
Duplicates
Data Drifts and Outlier detection
Data Relations and Correlations

In addition, it provides an integration with great expectations which runs data assertions allowing you to validate, profile your data and automate report creation.

5. DeepChecks [Github]

A python package for validating machine learning models and data with minimal effort.
It includes checks related to various types of issues such as:

Model performance
Data integrity:

Mixed Nulls
Duplicates
Strings Mismatch and Comparison
Frequency changes
special Characters and more


Distribution dirfts
Feature Importance and more

Although Deepchecks was built for Machine Learning, we can benefit from Data Integrity and drift detection for our data testing,
we can use the SingleDatasetIntegrity Suite or the specefic checks from the other suites while treating
the train and test sets as a reference and current dataframes.
Morever, DeepsChecks provides the option to write customized Suites and checks.
The results are visualized in a nice way, either in Table or a Plotly Graph.


6. Evidently AI [Github]

A tool to analyze and monitor machine learning models.
Key features:

Data distributions
Data Drift
Building custom dashboard
Model health and performance
Integration with Grafana and Prometheus


7. Alibi Detect [Github]

A machine learning dedicated library focused on outlier, adversarial and drift detection.
Key features:

Drift and Outliers Detection for tabular data, text, images and time series
Cover both online and offline detectors
TensorFlow and PyTorch backends are supported for drift detection.

Data Cleaning and Formating:

1. Scrabadub [Github]

Identifies and removes PII (Personal Identifiable Information) from free text.
like names, phone numbers, addresses, credit-card numbers and many more.
In addition, you can implement customized detectors.
  text = "My cat can be contacted on example@example.com, or 1800 555-5555"
  scrubadub.clean(text)
  >>'My cat can be contacted on {{EMAIL}}, or {{PHONE}}'

2. Arrow [Github]

Arrow provides a sensible and human-friendly approach to creating, manipulating, formatting and converting dates, times and timestamps.
  utc = arrow.utcnow()
  time= utc.to('US/Pacific')
  past = time.dehumanize("2 days ago")
  print(past)
  >> 2022-01-09T10:11:11.939887+00:00
  print(past.humanize(locale="ar"))
  z>> 'منذ يومين'

3. Beautifier [Github]

A library to cleanup url patterns and emails.
also it helps you to:

Check email validity
Parse emails by domain and username
Parse Urls by domain and parameters
Clean unicodes, special charecters and unnessesary redirection patterns from the urls

4. Ftfy [Github]

Ftfy = 'Fixes text for you' and this is exactly what this library do:


Fixes text with bad unicode


Fixes line breaks


Unescapes html to plain text


Detects likely mojibak


Provide an explaination to show us what happened with the text
ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
>>"The Mona Lisa doesn't have eyebrows."


5. Dora [Github]

Exploratory data analysis toolkit for Python.
 Key features:

Data cleaning (Null Values, Category to Ordinal, remove columns, transformation on columns)
Feature selection & extraction
Visualization (ploting features)
Partitioning data for model validation
Versioning transformations of data
All data should be Numerical to plot it and to use most of the functions

6. DataCleaner [Github]

A tool for automatically cleaning data sets and making them ready for analysis.
 Key features:

Dropping rows with missing value
Replacing missing values
Encoding non numerical variables
Works with pandas dataframes
Can Be used in the Command line or Script

Tables Preview:

1. Tabulate [Github]

Tabulate lets you print small, nice-looking tables with just one function call.

Making Tables More Readable.
Formatting a table to Html or other formats.


2. PrettyPandas [Github]

PrettyPandas helps you create report quality tables with a simple API.
making tables more readable by:

Adding summary rows and columns
Number formating for currency and percentages
Styling background gradient

 
Summary

In recent years many open-source projects evolved around data quality, we have only mentioned a few libraries here, but still many more to explore.
The growing amount of open-source projects around data quality indicates that data quality plays a growing important role for businesses.
As more and more data being produced and processed organizations need to find ways to manage its quality. Failing to do so, put businesses at high risk and exposure to bad decision making, low reputation and compliance risk.
Do you agree? Please share your thoughts and comment.
I hope it helps
....
If you like this content, follow me on Medium
or Visit me on Linkedin