Lotus bugfest - Reindex resolution


The fast description of the solution:

Go to MSK and change the log.retention.minutes to 1 and then wait until the cluster will apply this configuration. After the cluster is running with a new configuration you need to change configurations back, and then wait again. It needs to reset all the topics.

After this, you will need to connect to Kafka and fix the number of partitions for topics. num.partitions=1 we've changed to num.partitions=50 as described in modules configurations.

Also will be great to add some resources for mod-search, mod-inventory-storage, okapi:

  • mod-search 2x400cpu;
  • okapi 3x600cpu;
  • mod-inventory-storage 2x500cpu.


Example of Kafka command:

watch "./kafka-consumer-groups.sh --bootstrap-server b-1.tenant-lotus.km9byk.c1.kafka.us-east-1.amazonaws.com:9092 --group lbf-mod-search-events-group --describe | grep inventory.

instance


The whole description of the situation and solution:

This issue is related to Lotus bugfest and the ticket: asdasd

Before that, the FSE team created the new Kafka cluster tenant-lotus for bugfest-lotus env. and the kitfox team reconfigured the lbf cluster to use the new kaka cluster.

The problem was the re-index stuck at the 5M records except the 8M needed after the request:

POST [OKAPI_URL]/search/index/inventory/reindex


x-okapi-tenant: [tenant]

x-okapi-token: [JWT_TOKEN]


{

  "recreateIndex": true,

  "resourceName": "instance"

}

It took more than 8 hours and after that time we got only half of the published records.

After the investigation was found out that the Kafka cluster has the next configurations:

auto.create.topics.enable=true

log.retention.minutes=480

After that we checked the Kafka host and inside that we found out that the topics that need to be created with the right configurations were created with default ones because we have the Kafka configuration “auto.create.topics.enable=true”. Instead of the num.partitions=50 that was described in modules configurations (mod-search, mod-inventory-storage) we’ve got the num.partitions = 1 as default values. It means that the re-index with more than 8M record can’t be executed.

To fix that we’ve gone to MSK and changed the log.retention.minutes to 1 and then wait until the cluster will apply this configuration. After the cluster is running with a new configuration you need to change configurations back, and then wait again. It needs to reset all the topics. Then connect to the jumphost in the target account and manually reconfigure the numbers of partitions for needed topics. And in addition change the configurations for mod-search, mod-inventory-storage, and okapi by adding the resources and number of containers, because these modules don't have enough resources to work well:

  • mod-search 2x400cpu;
  • okapi 3x600cpu;
  • mod-inventory-storage 2x500cpu.

If you have any questions about working with Kafka you can get directly to Pavel Filippov and Oleksii Kuzminov.