Securing GDPR Personal Data with Access Controls
Editor's Note (August 3, 2021): This post uses deprecated features. Please reference the map custom regions with reverse geocoding documentation for current instructions.
As described in our GDPR white paper, preventing unauthorized access to personal data is a key requirement of GDPR. In this post, we will provide an overview of Elasticsearch security features (powered by X-Pack) and then show how these can be used to implement appropriate access controls on your Elasticsearch data.
Before diving into how we secure and control access to data in Elasticsearch, let us look at what types of data GDPR requires us to protect.
What data do we need to protect?
GDPR uses the term 'Personal Data' to define any information relating to an identified or identifiable natural person ("data subject"). An identifiable person is defined as a natural person who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person [Chapter 1, Article 4(1)].
In our recent blog post covering 'pseudonymization' of data, we discussed two different categories of Personal Data. The first is 'Direct Identifiers,' which can be used to identify users on its own or through cross-referencing with publically available data. An example of this is social security numbers, e-mail addresses, and potentially phone numbers. In order to make data containing this type of identifiers more accessible and subject to less stringent security requirements, a good option is to pseudonymize them. This means that the identifier is replaced by different unique identifier that can be derived from the original data, e.g., through a cryptographic hash function. The link between these two identifiers can be stored in an Identity Store, e.g., an external database or a separate Elasticsearch index, where it can be subject to considerably more stringent access controls than the pseudonymized data set. As both identifiers are still unique, the ability to link and group data during search or analysis is not diminished.
The second type of identifiers are 'Indirect Identifiers.' In isolation, these are generally insufficient to identify users, but may lead to identification when combined with other Indirect Identifiers, data points, or publically available data. Examples of this type of identifiers are age, birth date, occupation, location, and gender. Although some users may choose to apply pseudonymization to these fields, it may not always be practical, as they may carry information required for search or analysis where being able to group by a unique value is not sufficient, e.g., when searching for data by age range or proximity to a specific location.
Protecting data through encryption
Part of securing Personal Data is ensuring data is encrypted in transit between nodes in the Elasticsearch cluster as well as between the cluster itself and clients that have access to the data. X-Pack security features support encryption across all connections to, from, and within the cluster via Transport Layer Security (TLS). The cluster can also be configured to require nodes to authenticate using certificates as they join the cluster in order to prevent rogue nodes from joining the cluster and gaining access to sensitive data.
Elasticsearch also supports deployment on systems that have system-wide, disk-based encryption at rest configured. This makes it less likely for an unauthorized person accessing the underlying file system to access the cleartext.
Authenticating users
One of the fundamental requirements of data access controls is to authenticate users before they are allowed to access the system. X-Pack provides a standalone authentication mechanism that enables quick password-protection of a cluster and supports management of users. This is easy to get started with, and suitable for a wide range of use cases.
Larger organizations often have external authentication mechanisms where users are centrally managed, and X-Pack security features include out-of-the-box support for integration with authentication systems such as LDAP, Active Directory, PKI and SAML. It is also possible to add support for custom systems or solutions by extending X-Pack with a custom plugin.
In order to further tie down access to the Elasticsearch cluster, X-Pack also supports IP filtering, allowing specific IP addresses or subnets to be white- or black-listed.
Controlling access to data
To meet GDPR requirements, authenticating users is in itself not sufficient; we must also ensure that each user has the correct level of access to the data. This is where role and attribute based access controls come in, which is the main focus of this blog post. We will take a couple of sample data sets and show how they can be secured in different ways.
For the first dataset, which we will refer to as dataset A, we will look at securing documents containing the identifier mapping from the pseudonymization process, which are stored in a central 'identity_store' index. These contain a simple mapping between two unique identifiers and no fields that we need to be able to filter on, with each document containing only two fields as follows:
{ "key": "6be0f12c7026124f637097b7af98dfe82711e7982648ef5c2f2cf51167ed17d0", "value": "86.58.0.0" }
Any user needing to reverse an identifier will therefore need access to the entire index, which means we can control access solely at the index level.
The second data set we will use, dataset B, contains sample order item documents. The number of fields has been reduced significantly, and a sample document can be seen below.
<span>{ "geoip": { "country_iso_code": "GB", <span style="color:blue">"location": { "lat": 52.4768</span>, <span style="color:blue"> "lon": -1.9341</span> } }, "quantity": 1, "created_on": "2016-11-15T12:25:55+00:00", <span style="color:blue">"customer_gender": "FEMALE", "customer_age": 31,</span> "sku": "PI911NA30-C11", "customer_id": 46, <span style="color:red">"ip": "6be0f12c7026124f637097b7af98dfe82711e7982648ef5c2f2cf51167ed17d0"</span>, <span style="color:red">"user": "81c52b4457b4966544ec582f4e1e6d2e72ec7091ebe68172b2d4dc634998719c"</span>, "price": 59.99 } </span>
In this document, we have two Direct Identifiers in red that have been pseudonymised. There are also three Indirect Identifiers (gender, age, and location) highlighted in blue. These are all required for analysis in their current form and can therefore not be replaced with unique identifiers. It is expected that not all users requiring access to the order item documents will need access to these particular and sensitive fields.
Role-Based Access Controls
Data access control in X-Pack is built around users and roles. A role grants a configurable level of access to a specific set of data. This set of data can be defined through a combination of index name patterns and queries. Each user can be assigned any number of roles, and the union of privileges granted through these roles determine what the user is able to see and do in the Elasticsearch cluster.
All commands required to reproduce this example are available in this gist. This can be run through Kibana Console, which is available under `Dev Tools` application in the Kibana menu.
Securing the identity store
For dataset A discussed above, we need to be able to grant read-only access to the `identity_store` index as a whole. This can be done through the following role definition.
PUT _xpack/security/role/identity_store_readonly { "indices": [ { "names": ["identity_store"], "privileges": ["read"] } ] }
Here, the patterns specified in the `names` field will match only a single index, and the privileges granted are limited to `read`. We can now assign this role only to users that require this level of access to the most sensitive data and be sure they are not able to tamper with it as they have read-only access.
For users that need to be able to create new documents and update existing documents, we can create a separate role with write and update privileges, but no read access. This is covered by the `index` privilege:
PUT _xpack/security/role/identity_store_write { "indices": [ { "names": ["identity_store"], "privileges": ["index"] } ] }
A full list of available privileges can be found in the X-Pack security documentation. The gist linked to earlier contains definition of roles as well as example Kibana users linked to these roles.
Securing order item data
The next data we need to secure is dataset B, which contains the order item data. In this example we will assume these are stored in an index per year that starts with the prefix `order_items`, e.g. `order_items-2018`. These indices contain orders from all countries.
As we, for this data set, have a requirement to grant access on a country by country basis, granting access to the whole index like in the previous example will not work. Instead, we will need to define a role per country and make sure that only documents originating in the correct country are included in each one. We can achieve this through document-level security.
Document-level security allows us to add a query to the role, and only documents matching the query will be included in the role. We can create a role providing access to order items from France like this:
PUT _xpack/security/role/order_items-fr-rbac-full { "indices": [ { "names": ["order_items-*"], "privileges": ["read"], "query" : { "term" : { "geoip.country_iso_code" : "FR" } } } ] }
Here we use a simple term query that matches the country iso code to a fixed value, but much more complex queries can be used, making this a very efficient and flexible feature. In these roles we have specified `names` as `order_items-*`, which will match all yearly indices.
We have given these roles the suffix `-full` as there is no limitation to the fields that users with these roles can see. When we discussed the requirements around this type of data earlier, we mentioned that we also have users that need access to these documents, but should not be able to see these fields that contain Indirect Identifiers. We clearly need to create separate roles for these users, and can use field-level security to achieve our goal.
Field-level security makes it possible to only make certain fields visible through roles. It is also possible to explicitly exclude, rather than include, fields. Roles with field-level security are always read-only, as it would not be safe to make changes to documents where not all existing fields can be seen.
For this example, we will copy and enhance the role just created by granting access to all fields and then excluding the fields we do not want our users to have access to:
PUT _xpack/security/role/order_items-fr-rbac-restricted { "indices": [ { "names": ["order_items-*"], "privileges": ["read"], "query" : { "term" : { "geoip.country_iso_code" : "FR" } }, "field_security" : { "grant" : [ "*"], "except": [ "geoip.location.*", "customer_gender", "customer_age" ] } } ] }
The suffix `-restricted` has been added to these roles to indicate that not all fields are available. It is worth noting that roles are additive, so if a user has access to all fields in a set of indices through one role, that will override any restrictions on those indices from other roles.
Assigning roles to users
Every user can be associated with any number of roles, so the patterns of creating a new role per level of granularity will work. We can assign roles to users as in the example below:
PUT _xpack/security/user/rbac1 { "username": "rbac1", "password": "testtest", "roles": ["kibana_user", "order_items-fr-rbac-restricted", "order_items-gb-rbac-restricted"], "full_name": "RBAC 1", "email": "rbac1@example.com" }
This gives the user `rbac1` restricted access to orders originating in France and Great Britain.
As the number of countries in the system grows, so will the number of roles with the current system. This is generally not a problem from a performance perspective, but can eventually result in a very large number of roles, which can become cumbersome to administer and maintain.
At this point, an option may be to reduce the granularity by managing access by regions instead of individual countries. While this works for a lot of use cases, it is not always possible, and that is when we may start looking into using a form of role-based access controls based on user attributes in order to reduce the number of roles that need to be managed.
Attribute-based access controls
Where standard roles with document-level security usually rely on static queries, attribute-based access controls make use of the fact that users can be assigned metadata and that this metadata can be accessed and used in queries through search templates. Templated queries used with document-level security can be a very flexible and powerful tool.
In this post, we will look at a simple example that shows how we can control the countries a user has access to in a flexible way through user attributes using a very small number of roles. This blog post describes how this works in greater detail and provides a more complex example.
In this example we will create just two different roles; one for full access and one for the same restricted access as in the previous example. As we want to drive access based on user metadata, we add a metadata field named `visible_countries` to each user. This contains a list of all the country ISO codes the user is allowed to access. An example can be seen below:
PUT _xpack/security/user/abac1 { "username": "abac1", "password": "testtest", "roles": ["kibana_user", "order_items-abac-restricted"], "full_name": "ABAC 1", "email": "abac1@example.com", "metadata": { "visible_countries": ["GB", "FR"] } }
How can we now create a templated query that limits access to data originating in countries based on this? In this case we know that in order to be visible, the `geoip.country_iso_code` field must match one of the countries in the user metadata field. We can express this as follows for the role covering restricted access:
PUT _xpack/security/role/order_items-abac-restricted { "indices": [ { "names": ["order_items-*"], "privileges": ["read"], "query": { "template": { "source": "{\"terms\":{\"geoip.country_iso_code\":{{#toJson}}_user.metadata.visible_countries{{/toJson}}}}" } }, "field_security" : { "grant" : [ "*"], "except": [ "geoip.location.*", "customer_gender", "customer_age" ] } } ] }
If we separate out and show the structure behind the template string, we get the query below.
{ "terms": { "geoip.country_iso_code": {{#toJson}}_user.metadata.visible_countries{{/toJson}} } }
This is a simple terms query where the list of terms to compare the documents to is extracted and formatted by the bold part of the template. Any document having the `geoip.country_iso_code` field set to any of these values will be included in the query.
Using this method we are now able to control access to countries at a very granular level using only two roles. If we wanted to scale this out to cover all 28 countries in the European Union, we would not need to create any new roles beyond the two we have. The standard role-based approach would at the same time require 56 roles to cover the same granularity.
If we wanted to enable support for regions in parallel to individual countries, we could simply add more metadata to users and events. While the current template would need to be modified and would end up a bit more complex, the number of roles used would however not necessarily need to increase.
Conclusions
In this blog post, we discussed different types of Personal Data and how X-Pack security features can be used to secure it, and grant the appropriate level of access to each user. With simple examples we demonstrated how flexible and powerful the role-based access control features within X-pack are, and how they can be used by both small and large organizations to implement a variety of access policies that fit their needs.