Locale changes in Elasticsearch 8.16 and JDK 23

time-series-data-streams-blog-720x420-2.jpg

With the upcoming release of JDK 23, there are some significant changes in locale information that will affect Elasticsearch and how you ingest and format datetime data. Firstly, a bit of background.

What is a locale?

Every time a Java program needs to parse or format a date format that uses textual strings (for example, ‘Tuesday 16th July’), it needs to consult an internal set of tables containing information on what strings it should use for the day-of-week and month-of-year fields, among others. This information depends on the language that is being used (English, French, Arabic, etc.) and in some cases the specific country or region that is being used.

It’s not just dates that are affected — everything from number formats, calendars, and time formats to the names of every timezone and every other locale is in these tables. In particular, this also includes information used to calculate week-dates - dates counting weeks since the start of the year, rather than calendar months. All this information is packaged up into a locale for that language.

How does Elasticsearch use locale information?

Elasticsearch runs on the JDK. This means we use the locale information that is provided by the JDK. Every time you have a date mapper that parses textual dates, or week-dates, the internal JDK locale tables are used to map those formats to data structures representing the corresponding date information for the locale that you have specified (or the default root locale, if not otherwise specified).

In Java versions 7 and before, the JDK used its own internal locale tables, created by Sun and Oracle, for all locale information used by the JDK. In JDK 8, released in 2014, Oracle added the CLDR locale database provided by the Unicode Consortium alongside the internal JDK database, and in JDK 9 made it the default locale database. There are a significant number of changes between the CLDR database and the original JDK database (henceforth known as the COMPAT database), and so at the time Elasticsearch continued using the COMPAT database to maintain data and index compatibility.

So what is changing?

The recent JDK release, JDK 23, completely removes the COMPAT database that Elasticseach is currently using, leaving CLDR as the only option for locale data. This means we are forced to change the locale database used by Elasticsearch running on JDK 23 and above when we upgrade to use JDK 23.

There are two aspects of the locale database that are changing in CLDR - text field values, and week-date calculations.

Firstly, the strings used to represent various text fields in a date are changing for many locales - the differences are minor, but wide ranging. Here are some examples:

COMPATCLDR
English period-of-dayAM, PMin the morning, in the afternoon, in the evening
English quarter names1, 2, 3, 4Q1, Q2, Q3, Q4
German short day-of-week namesSo, Mo, Di, Mi…So., Mo., Di., Mi. …
French narrow era namesB, Aav. J.-C., ap. J.-C.
Portuguese long day-of-week namesDomingo, Segunda-feira, Terça-feira…domingo, segunda-feira, terça-feira…

This means that if you are using the date format string EEE d MMM yyyy with the de locale, on JDK 22 this would accept the text Mi 4 Dez 2024; on JDK 23 it would only accept Mi. 4 Dez. 2024 (note the extra dots).

Secondly, the underlying data used to calculate week dates is changing. These are dates, usually of the form 2024-W34-2, counting the number of weeks since the start of the year, rather than calendar months and days. But years don’t normally start on the first day of the week; if the 1st January is a Friday, is that the first week of that year, or part of the last week of the previous year? In order to know this, the locale provides information on how many days need to be in a week for it to count as a week, and which day is the start of a new week.

In COMPAT, these take a variety of values, depending on the locale. Generally, either Sunday or Monday as the first day, and either 1 or 4 minimum days in a week. In CLDR, this changes to first day of week Sunday, 1 day minimum in a week, for every locale. This applies to all custom date formats using the Y, W, or w specifiers.

The built-in week formats (week_date, weekyear_week_date, etc) always have, and will continue to use the ISO week-date definition of first day of week Monday, 4 days minimum in a week, regardless of underlying locale database and JDK version.

What does this mean for me as an Elasticsearch user?

This affects you if you use custom date formatters using textual or week-date field specifiers. Otherwise, you are not affected. Elasticsearch from v8.15.2 will log deprecation warnings, visible in Kibana, if you are using date format specifiers that might change on upgrading to JDK 23.

Elasticsearch will continue to be shipped with JDK 22 for all remaining v7.17.x and v8.15.x releases and will use the COMPAT locale database. Versions of Elasticsearch from v7.17.25 and v8.15.2 will support running on JDK 23 as a custom JDK, and will use the CLDR database if they are.

If you run Elasticsearch versions v7.17.24 or v8.15.1, or earlier, on JDK 23 or above, it will have no locale information at all. Elasticseach will try to load the COMPAT database, which does not exist on JDK 23, and it will then default to the root locale only (which is basic English). This is likely to lead to some odd behavior, especially if you use non-English locales.

Starting with Elasticsearch version 8.16.0, Elasticsearch will be shipped with JDK 23 and use the CLDR locale database by default. This means that if you ingest or output dates using textual strings, the exact strings that are used and accepted by Elasticsearch could change. If you ingest or output data using custom week dates, the week dates are likely to change. Not only does this affect data ingested now, but it could also affect data that has already been ingested into Elasticsearch on a previous JDK version.

To reduce the impact of the most wide-ranging change to the root locale, in v8.16.0 the default locale of date fields and date processors will change from the root locale to en, which are identical between COMPAT and CLDR apart from long era names and quarter names.

If you do not want to adapt to this change now, you can continue to run any version of Elasticsearch v7 or v8 on JDK 22 or below, and Elasticsearch will use the COMPAT locale database present in those versions.

Starting with Elasticsearch v9, Elasticsearch will use the CLDR locale database regardless of the JDK version it is running on.

Note that once JDK 23 is released, JDK 22 will become unsupported by Oracle, and any future bugs and CVEs will not be fixed on that version. JDK 21 is the current long-term support version of Java, and all v7 and v8 versions of Elasticsearch will use the COMPAT database if run on JDK 21.

To use a custom JDK with Elasticsearch, follow these instructions. Note that this is not possible when running from a prebuilt docker image, or on Elastic Cloud.

How do I handle changes to strings?

Firstly, test Elasticsearch on JDK 23 with your input data to check if you are actually affected by this. This change will cause Elasticsearch to reject previously valid date fields as invalid data. This is most likely if you have custom date formats using B, G, E, O, L, M, Q, Z, a, c, e, q, v, or z field specifiers. Elasticsearch v8.15.2 and above will log Date format [<format>] contains textual field specifiers that could change in JDK 23 to the Elasticsearch log and as a warning header to relevant queries, if one of these specifiers is used with the COMPAT locale database.

If you are affected, you can choose to run Elasticsearch on JDK 22 and below for the remainder of the v8 releases. Or you can modify your input data to account for the differences in strings — this can be done as part of an ingest pipeline or by modifying your data at source before it gets to Elasticsearch.

To determine the new strings that are accepted for your particular date formatter, you can create a DateTimeFormatter with your custom date format in a standalone Java project running on JDK 23, and test what it outputs for various ZonedDateTime objects, or use the Calendar.getDisplayNames method to get all the accepted strings for a particular locale.

If you are affected by string format changes, you also need to handle reindexing existing data using the old strings — you will need to specify a script during reindexing to change the old strings into new ones, something like the following:

String updateDate(String date) {
return date
.replace("Monday", "Mon")
.replace("Tuesday", "Tue")
.replace("Wednesday", "Wed")
.replace("Thursday", "Thu")
.replace("Friday", "Fri")
}
ctx._source.my_date_field = updateDate(ctx._source.my_date_field);

Unfortunately, how you handle this change depends on your exact situation and which date formats you are using.

How do I handle changes to week dates?

If you are using custom week formats, with the Y, W, or w specifiers, the dates those formats produce could change. You will need to change to use one of the built-in formats that use the ISO week-date definition, modify your dates on ingest, output, and reindexing using custom scripts as above, or adapt your integration code to calculate week dates in the same way as the CLDR database (Sunday first day of week, 1 day minimum in a week).

In particular, if you are using the Y specifier as part of a calendar date format, you are probably using it erroneously; Joda time uses Y to represent year-of-era, but the JDK uses Y to represent week-years. You need to modify your format to use y instead, or change to a built-in format.

To reiterate:

  • Elasticsearch versions 7.17.24 and 8.15.1 and before will not have access to any locale data if they are run on JDK 23 and above.

  • All remaining 7.17.x and 8.15.x patch releases will continue to ship with JDK 22, both using the COMPAT locale database by default. They will support running on JDK 23, and will use the CLDR locale database in that situation.

  • 8.16.0 and above will ship with JDK 23 and will use the CLDR locale database by default. If Elasticsearch versions 8.16.0 and above are run on JDK 22 or below, they will use the COMPAT locale database instead.

  • 8.16.0 will change the default locale of date fields and date processors to en.

  • Elasticsearch v9, when it is released, will use the CLDR locale database regardless of JDK version it runs on.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.