Peter Titov

How to remove PII from your Elastic data in 3 easy steps

Personally Identifiable Information compliance is an ever increasing challenge for any organization. With Elastic's intuitive ML interface and parsing capabilities, sensitive data may be easily redacted from unstructured data with ease.

How to remove PII from your Elastic data in 3 easy steps

Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?

Elasticsearch, with its long experience in machine learning, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.

If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:

In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.

Specifically, we’ll walk through setting up a named entity recognition (NER) model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.

Loading the trained model

Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:

git clone https://github.com/elastic/eland.git

Navigate into the recently downloaded client:

cd eland/

Now let’s build the client:

docker build -t elastic/eland .

From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.

If you’re using the Elastic Cloud or have signed certificates, simply run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://<username>:<password>@<es-cluster-hostname>:<esport>/ --hub-model-id dslim/bert-base-NER --task-type ner --start

If you’re using self-signed certificates, run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://<username>:<password>@<es-cluster-hostname>:<esport>/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start

From here you’ll witness the Eland client in action downloading the trained model from HuggingFace and automatically deploying it into your cluster!

Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”

Now click the Synchronize button.

That’s it! Congratulations, you just loaded your first trained model into Elastic!

Create the redact processor and ingest pipeline

From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{{{message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\r\n                }\r\n                ctx['redacted']=msg"
      }
    },
    {
      "redact": {
        "field": "redacted",
        "patterns": [
          "%{EMAILADDRESS:EMAIL}",
          "%{IP:IP_ADDRESS}",
          "%{CREDIT_CARD:CREDIT_CARD}",
          "%{SSN:SSN}",
          "%{PHONE:PHONE}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": "\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}",
          "SSN": "\d{3}-\d{2}-\d{4}",
          "PHONE": "\d{3}-\d{3}-\d{4}"
        }
      }
    },
    {
      "remove": {
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "pii_script-redact"
      }
    }
  ]
}

OK, but what does each processor really do? Let’s walk through each processor in detail here:

  1. The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.

  2. The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.

  3. The SCRIPT processor then replaced the detected entities within the redacted field from the message field.

  4. Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).

  5. The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.

  6. The ON_FAILURE / SET processor captures any errors just in case we have them.

Slice your PII

Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.

Click on the Manage button, and then click Edit.

Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.

{
  "_source":
    {
      "message": "John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2",
    },
}

Simply press the Run the pipeline button, and you will then see the following output:

What’s next?

After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.

Conclusion

With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.

Here’s a quick recap of what we covered:

  • Loading a pre-trained named entity recognition model into an Elastic cluster
  • Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion
  • Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Share this article