ElasticSearch Reindex Performance Recommendations

Purpose

The purpose of the document is to illustrate performance changes that can be implemented to improve reindex performance. A git patch is attached to this page for inspiration for the application changes. The main bottleneck exists at the supported search engine. More efficient processing in supported search engine will decrease reindex duration significantly.

Changes

Routing

Elasticsearch/Opensearch allows routing a document a document to a shard. mod-search has default routing on the tenant id. With a default shard count of 4, this means that all records for a particular tenant will be routed to a single shard out of the four. With Elasticsearch/Opensearch trying to have a balanced cluster, primary shards are placed at least at each data node. This means that only one shard => data node is used during the index process rather than all data nodes. Recommendation is to remove routing by tenant id and just use default routing on _id or some other attribute that will ensure balanced distribution in the Elasticsearch/Opensearch cluster.

GZIP

Testing with a large dataset show potential large documents that need to travel through the network. Since there is still CPU to spare on the mod-search instances and Elasticsearch/Opensearch nodes, it is prudent to employ some compression to minimize space usage.

SMILE

SMILE is a data format that is best expressed as compacted binary JSON. It is one of the formats supported by Elasticsearch's bulk api, other than JSON. Using SMILE in concert with GZIP can be a boon; link.

Unfortunately, there is a bug that has been acknowledged by AWS Support. Essentially, when using AWS OpenSearch, the response returned by the bulk api is malformed which is causing errors in mod-search. The issue does not occur with local instances of OpenSearch or ElasticSearch, only instances in AWS Infrastructure. Due to this, performance gains could not be determined. When the issue is resolved by AWS, this can be revisited. 

Kafka Fetch Size Upon Poll

AWS recommends at least 5MB worth of data passed into the bulk api allow efficient processing of data. mod-search currently send less than 800KB in general. It is such a small size because mod-search is limited to 50 records when retrieving instance Ids from Kafka. The poll record count was reduced to 50 due to a downstream issue of mod-search querying mod-inventory-storage for the latest instance, the query(via CQL) appends IDs to the URI of the HTTP call[MSEARCH-75]. Querying more than 50 records throws an HTTP exception of having an HTTP call with a URI that is too long.

A change was made to increase the poll record count back to its previous value of 200 but query mod-inventory-storage in batches of 50. This resolved the downstream issue while retaining a high poll record count. Code used is in the attached patch.

ES Refresh Interval

Elasticsearch/Opensearch refreshes new indexed documents to be available for search every second. A configuration can delay or disable this refresh process until indexing is done.

ES Replica Count

Elasticsearch/Opensearch replicates changes on a shard for high availability. Disabling this replication can help speed up the indexing process. Replication can be enabled by increasing replica count from 0 to its default of 2.

Performance Results

The following tests were run on a large dataset which has about 9 million instances. The re-index occurring here is only for instance, contributor & subject. Values in red denotes a change from the previous configuration.


Config 1Config 2Config 3Config 4Config 5Config 6Config 7Config 8
AWS ES Instance Class

r6g.large.search

r6g.2xlarge.search

r6g.2xlarge.search

r6g.2xlarge.search

r6g.2xlarge.search

r6g.2xlarge.search

r6g.2xlarge.search

r6g.2xlarge.search

Routing

YesYesYesYesYesYesYesYes
GZIPNoNoNoNoNoNoNoNo
SMILENoNoNoNoNoNoNoNo
mod-search instance count44444888
KAFKA_EVENTS_CONCURRENCY22422244

KAFKA_CONTRIBUTOR_CONCURRENCY

11211122
Kafka fetch size50505050505050200
ES Refresh Interval1s1s1s-1-1-1-1-1
ES Number of replicas22220000
Duration

13 hrs 30 mins

6 hrs 20 mins

6hrs 30 mins

6 hrs2 hrs 30 mins2 hrs1 hr 45 mins1 hr 15 mins
ES CPU Average

99%

50%51%30%34%50%55%58%
mod-search CPU Average12%25%30%25%70%55%64%75%

Recommendations for FOLIO Dev Team

  • Use default routing for indexing and searching. This is allow optimum use of resources.
  • Increase Kafka fetch size to add more data during a single bulk request. Check attached patch for inspiration on fixing URI-Too Long error on mod-inventory-storage.
  • Allow number of shards to be configurable, possibly for each tenant.  The primary shards are 52GB and growing, ideal is 10-30GB.
    • Can be extended to just any index setting possible.
  • Introduce better monitoring that will allow an operator to determine when a reindex is complete.
    • If above is completed, it would be possible to allow configuration that will disable refresh interval and replica count during reindex and re-enable when done.
  • Investigate GZIP + SMILE. If implemented, it should be configurable. Some code is in the attached patch.
  • At the end of a reindex, the indexes storing contributor and subject data experience a lot of document deletions compared to document creation. As a first time load, deletion ought to be minimal. Consider computing interim results in Postgres and then uploading the "final" document into elasticsearch.

    GET /_cat/indices?v&s=store.size:desc
    health status index                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    green  open   mtes_instance_fs00001034         GFpif_XjTh-1PJTZac1_Hw   4   0    8196944         5951     51.4gb         51.4gb
    green  open   mtes_contributor_fs00001034      yFhS12Y4SviL170UFQZZlw   4   0    4739266      1126590     10.3gb         10.3gb
    green  open   mtes_instance_subject_fs00001034 pdD9z85SQZOW8uHMgMXvBw   4   0    3824725     13681561      2.2gb          2.2gb
    green  open   .kibana_1                        47dox0rZTD-8gUtV-xwM1w   1   1          6            1     46.4kb         18.3kb

How to change number_of_replicas and refresh_interval values of ES/OpenSearch

Using cmd and curl utility you should send request to ES/OpenSearch endpoint.

Be sure that you can connect to ES/OpenSearch endpoint.

curl -u "USERNAME:PASSWORD" --location --request GET "https://{ENDPOINT_URL}/_cat/indices" to get all indeces of ES/OpenSearch


UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!

To see the current configuration you should change type of request to GET.

curl -u "USERNAME:PASSWORD"  --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_instance_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
    "index": {
        "number_of_replicas": "2",
        "refresh_interval": "1s"
    }
}'

UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!

To see the current configuration you should change type of request to GET.

curl -u "USERNAME:PASSWORD"  --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_contributor_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
    "index": {
        "number_of_replicas": "2",
        "refresh_interval": "1s"
    }
}'

UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!

To see the current configuration you should change type of request to GET.

curl -u "USERNAME:PASSWORD"  --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_authority_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
    "index": {
        "number_of_replicas": "2",
        "refresh_interval": "1s"
    }
}'

UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!

To see the current configuration you should change type of request to GET.

curl -u "USERNAME:PASSWORD"  --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_instance_subject_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
    "index": {
        "number_of_replicas": "2",
        "refresh_interval": "1s"
    }
}'


After reindex process refresh_interval should be set to 1s, and also for high availability replication can be enabled by increasing replica count from 0 to its default of 2.


ES documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk