Sparse vector field type

edit

A sparse_vector field can index features and weights so that they can later be used to query documents in queries with a sparse_vector. This field can also be used with a legacy text_expansion query.

sparse_vector is the field type that should be used with ELSER mappings.

resp = client.indices.create(
    index="my-index",
    mappings={
        "properties": {
            "text.tokens": {
                "type": "sparse_vector"
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index',
  body: {
    mappings: {
      properties: {
        'text.tokens' => {
          type: 'sparse_vector'
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index",
  mappings: {
    properties: {
      "text.tokens": {
        type: "sparse_vector",
      },
    },
  },
});
console.log(response);
PUT my-index
{
  "mappings": {
    "properties": {
      "text.tokens": {
        "type": "sparse_vector"
      }
    }
  }
}

See semantic search with ELSER for a complete example on adding documents to a sparse_vector mapped field using ELSER.

Multi-value sparse vectors

edit

When passing in arrays of values for sparse vectors the max value for similarly named features is selected.

The paper Adapting Learned Sparse Retrieval for Long Documents (https://arxiv.org/pdf/2305.18494.pdf) discusses this in more detail. In summary, research findings support representation aggregation typically outperforming score aggregation.

For instances where you want to have overlapping feature names use should store them separately or use nested fields.

Below is an example of passing in a document with overlapping feature names. Consider that in this example two categories exist for positive sentiment and negative sentiment. However, for the purposes of retrieval we also want the overall impact rather than specific sentiment. In the example impact is stored as a multi-value sparse vector and only the max values of overlapping names are stored. More specifically the final GET query here returns a _score of ~1.2 (which is the max(impact.delicious[0], impact.delicious[1]) and is approximate because we have a relative error of 0.4% as explained below)

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "standard"
            },
            "impact": {
                "type": "sparse_vector"
            },
            "positive": {
                "type": "sparse_vector"
            },
            "negative": {
                "type": "sparse_vector"
            }
        }
    },
)
print(resp)

resp1 = client.index(
    index="my-index-000001",
    document={
        "text": "I had some terribly delicious carrots.",
        "impact": [
            {
                "I": 0.55,
                "had": 0.4,
                "some": 0.28,
                "terribly": 0.01,
                "delicious": 1.2,
                "carrots": 0.8
            },
            {
                "I": 0.54,
                "had": 0.4,
                "some": 0.28,
                "terribly": 2.01,
                "delicious": 0.02,
                "carrots": 0.4
            }
        ],
        "positive": {
            "I": 0.55,
            "had": 0.4,
            "some": 0.28,
            "terribly": 0.01,
            "delicious": 1.2,
            "carrots": 0.8
        },
        "negative": {
            "I": 0.54,
            "had": 0.4,
            "some": 0.28,
            "terribly": 2.01,
            "delicious": 0.02,
            "carrots": 0.4
        }
    },
)
print(resp1)

resp2 = client.search(
    index="my-index-000001",
    query={
        "term": {
            "impact": {
                "value": "delicious"
            }
        }
    },
)
print(resp2)
const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      text: {
        type: "text",
        analyzer: "standard",
      },
      impact: {
        type: "sparse_vector",
      },
      positive: {
        type: "sparse_vector",
      },
      negative: {
        type: "sparse_vector",
      },
    },
  },
});
console.log(response);

const response1 = await client.index({
  index: "my-index-000001",
  document: {
    text: "I had some terribly delicious carrots.",
    impact: [
      {
        I: 0.55,
        had: 0.4,
        some: 0.28,
        terribly: 0.01,
        delicious: 1.2,
        carrots: 0.8,
      },
      {
        I: 0.54,
        had: 0.4,
        some: 0.28,
        terribly: 2.01,
        delicious: 0.02,
        carrots: 0.4,
      },
    ],
    positive: {
      I: 0.55,
      had: 0.4,
      some: 0.28,
      terribly: 0.01,
      delicious: 1.2,
      carrots: 0.8,
    },
    negative: {
      I: 0.54,
      had: 0.4,
      some: 0.28,
      terribly: 2.01,
      delicious: 0.02,
      carrots: 0.4,
    },
  },
});
console.log(response1);

const response2 = await client.search({
  index: "my-index-000001",
  query: {
    term: {
      impact: {
        value: "delicious",
      },
    },
  },
});
console.log(response2);
PUT my-index-000001
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "standard"
      },
      "impact": {
        "type": "sparse_vector"
      },
      "positive": {
        "type": "sparse_vector"
      },
      "negative": {
        "type": "sparse_vector"
      }
    }
  }
}

POST my-index-000001/_doc
{
    "text": "I had some terribly delicious carrots.",
    "impact": [{"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
               {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}],
    "positive": {"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
    "negative": {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}
}

GET my-index-000001/_search
{
  "query": {
    "term": {
      "impact": {
         "value": "delicious"
      }
    }
  }
}

sparse_vector fields can not be included in indices that were created on Elasticsearch versions between 8.0 and 8.10

sparse_vector fields only support strictly positive values. Negative values will be rejected.

sparse_vector fields do not support analyzers, querying, sorting or aggregating. They may only be used within specialized queries. The recommended query to use on these fields are sparse_vector queries. They may also be used within legacy text_expansion queries.

sparse_vector fields only preserve 9 significant bits for the precision, which translates to a relative error of about 0.4%.