Connector sync rules

edit

Connector sync rules

edit

Use connector sync rules to help control which documents are synced between the third-party data source and Elasticsearch. Define sync rules in the Kibana UI for each connector index, under the Sync rules tab for the index.

Sync rules apply to native connectors and connector clients. Sync rules do not apply to Workplace Search connectors, although similar features are available in Workplace Search. See the Workplace Search documentation.

There are two types of sync rules:

  • Basic rules - these rules are represented in a table-like view.
  • Advanced rules - these rules cover complex query-and-filter scenarios that cannot be expressed with basic rules. Advanced rules are defined through a source-specific DSL JSON snippet.
Sync rules tab

General filtering

edit

It is difficult to discuss sync rules without first level-setting the concepts of data filtering in general.

Filtering

This diagram is helpful for illustrating that data filtering can occur in several different processes/locations. First, data might be filtered at its source. We call this "remote filtering", as the process limiting the data runs external to Elastic. Next, we have the bucket of "integration filtering". This covers data filtering that occurs in the process that acts as a bridge between the data’s source and Elasticsearch (its final destination). Filtering that takes place in the Enterprise Search connectors is an example of "integration filtering". Finally, Elasticsearch itself provides the capability for filtering data right before persistence via its ingest pipelines.

This guide will not focus much on ingest pipeline filtering. However, sync rules can be used to influence both remote and integration filtering.

At this time, basic sync rules are the only way to control integration filtering for connectors. However, remote filtering covers a far broader topic than connectors alone could influence. For best results, work closely with the owners and maintainers of your data source to ensure that your source data is well organized and the source is optimized for the types of queries the connectors will issue to it.

Sync rules overview

edit

Often times, your data lake has far more data than you want exposed to the end user. For example, you may want to search a product catalog, but not include vendor contact information, even if the two are co-located for business purposes.

The optimal time to filter data is early in the data pipeline, for two reasons:

  • Performance: It’s more efficient to send a query to the backing data source than to obtain all the data and then filter it in the connector. It’s faster to send a smaller dataset over a network and to process it on the connector side.
  • Security: The query-time filtering is applied on the data source side, so the data is not sent over the network and into the connector, which can limit the exposure of your data.

In a perfect world, all filtering would be done as remote filtering.

In practice, however, this is not always possible. Some sources do not allow robust remote filtering. Others do, but require special setup (building indexes on specific fields, tweaking settings) that may require attention from other members of your business.

With this in mind, sync rules were designed to influence both remote filtering and integration filtering. Your goal should be to do as much remote filtering as possible, but integration is a perfectly viable fall-back. By definition, the remote filtering is applied before the data is obtained from a third-party source. Integration filtering is applied after the data is obtained from a third-party source, but before it is ingested into the elasticsearch index.

All sync rules occur on a given document before any ingest pipelines are run on that same document. Therefore, you could use your ingest pipelines for any processing that must occur after integration filtering has occurred.

Basic rules

edit

Each basic rule can be one of two "policies": include and exclude. Include rules are used to include the documents that "match" the specified condition. Exclude rules are used to exclude the documents that "match" the specified condition.

A "match" is determined based on a condition defined by a combination of "field", "rule", and "value".

The Field column should be used to define which field on a given document should be considered.

The following rules are available in the Rule column:

  • equals - The field value is equal to the specified value.
  • starts_with - The field value starts with the specified (string) value.
  • ends_with - The field value ends with the specified (string) value.
  • contains - The field value includes the specified (string) value.
  • regex - The field value matches the specified regular expression.
  • > - The field value is greater than the specified value.
  • < - The field value is less than the specified value.

Finally, the Value column is dependent on:

  • the data type in the specified "field"
  • which "rule" was selected.

For example, if a value of [A-Z]{2} might make sense for a regex rule, but much less so for a > rule. Similarly, you probably wouldn’t have a value of espresso when operating on an ip_address field, but perhaps you would for a beverage field.

Basic rules examples

edit
Example 1
edit

Include only documents that have a state field with the value MA. This is a case-sensitive match.

Simple equals rule
Example 2
edit

Exclude all documents that have an ID field with the value greater than 1000.

Simple greater than rule
Example 3
edit

Exclude all documents that have a state field that matches a specified regex.

Simple regex rule

Performance implications

edit
  • If you’re relying solely on basic rules in the integration filtering phase the connector will fetch all the data from the data source
  • For data sources without automatic pagination, or similar optimisations, fetching all the data can lead to memory issues. For example, loading datasets which are too big to fit in memory at once.

The native MongoDB connector provided by Elastic uses pagination and therefore has optimised performance. Just keep in mind that custom community built connector clients may not have these performance optimisations.

The following diagrams show the concept of pagination. A huge data set may not fit into the memory of a connectors instance. If you break this data set up into smaller chunks they’ll fit into memory one after another.

Extract whole dataset at once

vs

Pagination

Basic rules in the remote filtering phase

edit

Because remote filtering happens at data source query time, it is highly specific to the datasource.

If the connector cannot determine how to combine one or more basic rules into a single query to the data source, the unused basic rules will not be used remote filtering, but will instead be applied in integration filtering. If you observe this happening and want to tune performance, consider using the advanced rules to fine-tune your remote filtering.

Advanced rules

edit

Advanced rules overwrite any remote filtering query that could have been inferred from the basic rules. If an advanced rule is defined, any defined basic rules will be used exclusively for integration filtering.

Advanced rules are only used in remote filtering. You can think of advanced rules as a language-agnostic way to represent queries to the data source. Therefore, these rules are highly source-specific.

Each connector supporting advanced rules provides its own DSL to specify rules. Refer to the documentation for each connector for details.

Interplay between Basic Rules and Advanced rules

edit

You can also use basic rules and advanced rules together for filtering a data set.

The following diagram provides an overview of the order in which advanced rules, basic rules, and pipeline filtering, are applied to your documents:

Sync Rules: What is applied when?

Example

edit

In the following example we want to filter a data set containing apartments to only contain apartments with specific properties. We’ll use basic and advanced rules throughout the example.

A sample apartment looks like this in the .json format:

    {
        "id": 1234,
        "bedrooms": 3,
        "price": 1500,
        "address": {
            "street": "Street 123",
            "government_area": "Area",
            "country_information": {
                "country_code": "PT",
                "country": "Portugal"
    }
  }
}

The target data set should fulfill the following conditions:

  • Every apartment should have at least three bedrooms
  • The apartments should not be more expensive than 1000/month
  • The apartment with id 1234 should get included without considering the first two conditions
  • Each apartment should be located either Portugal or Spain
Basic rules
edit

To create a new basic rule navigate to the Sync Rules tab and select Draft new sync rules:

Draft new rules

Afterwards you need to press the Save and validate draft button to validate these rules. Note that when saved the rules will be in draft state. They won’t be executed in the next sync unless they are applied.

Save and validate draft

After a successful validation you can apply your rules so they’ll be executed in the next sync.

These following conditions can be covered by basic rules:

  1. The apartment with id 1234 should get included without considering the first two conditions
  2. Every apartment should have at least three bedrooms
  3. The apartments should not be more expensive than 1000/month
Save and validate draft

Remember that order matters for basic rules. You may get different results for a different ordering.

"Each apartment should be located either Portugal or Spain":

Advanced rules
edit

The last rule can be implemented by leveraging advanced rules.

You want to only include apartments, which are located in "Portugal" or "Spain". We need to use advanced rules here because we’re dealing with deeply nested objects.

Let’s assume that the apartment data is stored inside a MongoDB instance. For MongoDB we support aggregation pipelines in our advanced rules among other things. An aggregation pipeline to only select those properties, which are located in Portugal or Spain would look like this:

    [
      {
        "$match": {
             "$or": [
                    {
                      "address.country_information.country": "Portugal"
                    },
                    {
                      "address.country_information.country": "Spain"
                    }
                  ]
                }
      }
    ]

To create these advanced rules you navigate again to the sync rules creation dialog and select the Advanced rules tab. You can now paste your aggregation pipeline into the input field under aggregate.pipeline:

Paste aggregation pipeline

After a successful validation you can apply them again like you already for the basic rules. This view shows you the applied sync rules, which will be executed in the next sync:

Advanced rules appeared

After a successful sync you can expand the sync details to see which rules were applied:

Applied rules during sync