Skip to content

Instantly share code, notes, and snippets.

@MantejGill
MantejGill / distrust_measures.csv
Created December 22, 2023 07:40
Distrust measures
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 9.
Distrust Measure,Metric,Description
Data Quality,Completeness,Measures the proportion of missing data in a dataset. A dataset with a low percentage of missing data is considered to be of higher quality.
,Validity,Measures whether the data in a dataset is accurate and conforms to a set of predefined rules or constraints.
,Consistency,Measures whether the data in a dataset is consistent with other data sources.
,Timeliness,Measures how recent the data in a dataset is. A dataset with more recent data is considered to be of higher quality
,Uniqueness,Measures whether the data in a dataset is unique or duplicated.
,Accuracy,Measures the degree to which the data in a dataset is free from errors or inaccuracies.
,Precision and Recall,"Evaluates the performance of a model. Precision measures the proportion of true positive predictions out of all positive predictions, and recall measures the proportion of true positive predictions out of all actual positive cases."
,F1-Score,"A weighted harmonic mean of precision and
@MantejGill
MantejGill / decentralized_datasets.csv
Created December 21, 2023 09:08
Decentralized Datasets
Tools URL
Ocean Protocol https://oceanprotocol.com/
Datum https://datum.org/
Enigma https://enigma.com/our-data
DataBrokerDAO https://www.databroker.global/
DeBlock https://deblock.io/
@MantejGill
MantejGill / smpc.csv
Created December 21, 2023 09:04
Secure Multi-Party Computation (SMPC)
Tools URL
PySyft https://github.com/OpenMined/PySyft
Enigma https://www.enigma.com/
Microsoft SEAL https://github.com/Microsoft/SEAL
TF Encrypted https://www.tf-encrypted.org/
CrypTen https://crypten.org/
MPyC https://mpyc.org/
FairScale https://fairscale.ai/
@MantejGill
MantejGill / data_management_and_governance_platforms.csv
Created December 21, 2023 09:01
Data Management and Governance Platforms
Tools URL
Collibra https://www.collibra.com/us/en
Informatica MDM https://www.informatica.com/in/products/master-data-management.html
SAP Master Data Governance https://www.sap.com/products/technology-platform/master-data-governance.html
Alation https://www.alation.com/
Talend MDM https://www.talend.com/resources/what-is-master-data-management/
@MantejGill
MantejGill / dataset_anonymization.csv
Created December 21, 2023 08:52
Dataset Anonymization
Tools URL
DataWig https://datawig.readthedocs.io/en/latest/
Faker https://faker.readthedocs.io/en/master/
ARX https://arx.deidentifier.org/anonymization-tool/
DataSunrise Data Masking https://www.datasunrise.com/data-masking/
Informatica Data Masking https://www.informatica.com/blogs/informatica-data-masking-solution-a-data-security-product-dynamic-data-masking-for-structured-data-masking.html
Delphix Dynamic Data Platform https://www.delphix.com/platform/masking
Solix Data Masking https://www.solix.com/data-management-solutions/data-masking/
@MantejGill
MantejGill / federated_learning.csv
Created December 21, 2023 08:48
Federated Learning
Tools URL
TensorFlow Federated (TFF) https://www.tensorflow.org/federated
PySyft https://github.com/OpenMined/PySyft
FATE https://fate.fedai.org/
@MantejGill
MantejGill / differential_privacy.csv
Created December 21, 2023 08:34
Differential Privacy
Tools URL
DataFly https://datafly.online/
DP-Lib https://www.microsoft.com/en-us/ai/ai-lab-differential-privacy
TensorFlow Privacy https://github.com/tensorflow/privacy
OpenDP https://opendp.org/
Rdp https://cran.r-project.org/web/packages/RDP/index.html
PyDP https://github.com/OpenMined/PyDP
Pytorch’s Opacus https://opacus.ai/
SecretFlow https://github.com/secretflow/secretflow
IBM’s Differential Privacy Library https://github.com/IBM/differential-privacy-library
@MantejGill
MantejGill / data_bias.csv
Created December 9, 2023 06:22
Tools to find bias in Data
Tool Description
AI Fairness 360 This is an open-source toolkit offered by IBM for the detection and elimination of bias in machine learning models
What-If Tool This tool allows users to test different scenarios within their data to check how changes affect the end results of a machine learning model
TCAV (Testing with Concept Activation Vectors) TCAV is a tool developed by Google to scan algorithmic models for common biases, such as race, gender, and location
FairML FairML is a Python open-source toolbox that is used to audit machine learning predictive models to detect bias.
@MantejGill
MantejGill / dataprofilingtools.csv
Last active December 8, 2023 08:33
Data Profiling Tools
We can make this file beautiful and searchable if this error is corrected: It looks like row 9 should actually have 2 columns, instead of 1. in line 8.
Name,Description
Dataedo,"A data profiling tool with a data catalog feature, allowing users to browse minimum, maximum, average, and median values, as well as see top values and other statistics."
Atlan,"A data management platform that provides data profiling capabilities, including data types, length, recurring patterns, and data quality assessment."
Boltic,"A free data profiling tool that offers features like data validation, data transformation, and data cleaning."
Aggregate Profiler,"An open-source data profiling tool that provides data profiling, filtering, and governance, similarity checks, data enrichment, and real-time alerting for data issues or changes."
IBM InfoSphere Information Analyzer,"A data analysis tool that helps organizations discover relationships, patterns, and trends in their data."
Informatica Data Explorer,"A data exploration tool that allows users to visualize, analyze, and clean data."
Melissa,A data quality tool that helps organizations identify and correct data quality issues.
Qua
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 6.
Dataset,Type,Description,Link
BenchMD,Medical Modalities,"The BenchMD benchmark consists of 19 real-world medical datasets across 7 medical modalities, including X-ray, CT, MRI, ultrasound, fundus, OCT, and pathology",https://www.rajpurkarlab.hms.harvard.edu/benchmd
ImageNet,Image Classification,"The ImageNet dataset is a large-scale image classification dataset with over 1.2 million images in 1,000 categories",https://www.image-net.org/
COCO,Object Detection,"The COCO dataset is a large-scale object detection, segmentation, and captioning dataset with over 330,000 images and 2.5 million object instances labeled across 80 object categories",https://cocodataset.org/#home
GLUE,Natural Language Processing,"The GLUE benchmark is a collection of nine natural language understanding tasks, including sentiment analysis, question answering, and textual entailment",https://gluebenchmark.com/
Tencent-MVSE,Video Similarity,"The Tencent-MVSE dataset is a large-scale benchmark dataset for multi-modal video similarity evalu