Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeffvestal/563b70fdf745a678a0aa927710a96785 to your computer and use it in GitHub Desktop.
Save jeffvestal/563b70fdf745a678a0aa927710a96785 to your computer and use it in GitHub Desktop.

Main Components

  • NER model for Location/Org/People identification
  • Regex patterns in a script processor toidentify pii with standard structures.

Ingest Pipeline

PUT _ingest/pipeline/pii_script-redact
{
  "description": "PII redacting ingest pipeline",
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{{{message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "String msg = ctx['message']; \n        for (item in ctx['ml']['inference']['entities']) {\n          msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\n        }\n        ctx['redacted']=msg",
        "if": "return ctx['ml']['inference']['entities'].isEmpty() == false",
        "tag": "ner_redact",
        "description": "Redact NER entities"
      }
    },
    {
      "script": {
        "source": "String fieldValue = ctx['redacted'];\n\nPattern pattern_ssn = /(\\d{3}-?\\d{2}-?\\d{4})/;\nPattern pattern_phone = /(\\d{3}?-?\\d{3}-?\\d{4})/;\n\nList patterns = new ArrayList();\npatterns.add(pattern_ssn);\npatterns.add(pattern_phone);\n\nfor (p in patterns) {\n    Matcher matcher = p.matcher(ctx['redacted']);\n    ctx['redacted'] = matcher.replaceAll(\"<redacted>\");\n}",
        "tag": "regex_redact",
        "description": "Redact regex patterns"
      }
    },
    {
      "remove": {
        "field": [
          "message",
          "ml"
        ]
      }
    }
  ]
}

NER Model From Huggingface

dslim/bert-base-NER

Load with eland

eland GH

Regex Script - readable

String fieldValue = ctx['redacted'];

Pattern pattern_ssn = /(\d{3}-?\d{2}-?\d{4})/;
Pattern pattern_phone = /(\d{3}?-?\d{3}-?\d{4})/;

List patterns = new ArrayList();
patterns.add(pattern_ssn);
patterns.add(pattern_phone);

for (p in patterns) {
    Matcher matcher = p.matcher(ctx['redacted']);
    ctx['redacted'] = matcher.replaceAll("<redacted>");
}

Test

Input doc for ingest pipeline

[
  {
    "_source": {
      "message": "Hello, my name is Bruce Wayne and I live in Gotham. My SSN is 123-45-6789 and you can reach me at 312-456-7890"
    }
  }
]

Example Output

    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "redacted": "Hello, my name is <PER> and I live in <LOC>. My SSN is <redacted> and you can reach me at <redacted>"
        },
        "_ingest": {
          "timestamp": "2023-01-19T15:26:45.161817903Z"
        }
      }
    }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment