MODQM-217 Spike: Determine a plan to address persistent 500 errors upon editing a quickMARC record

MODQM-217 - Getting issue details... STATUS

Objectives

  • what code change(s) maybe a contributing factor
  • how can we prevent 500 errors
  • Kafka
  • what additional Karate tests are needed 
  • how can we anticipate these issues rather than have customers identify
  • if the issue is related to multi-tenant versus single tenant 
  • if the issue is related to how the data is loaded (via data import or directly to storage)
  • if the issue is related to how the data is updated (via data import, inventory, or MARC authority app UI or directly to storage)  

Analysis

500 error causes

IssueCauseDescription

MODQM-169

MODQM-170

Kafka topics configuration

Kafka consumers try to connect to existing consumer groups that have different assignment strategies. This led to mod-quick-marc doesn't receive any confirmation about successful/failed update. mod-quick-marc has a timeout (the 30s) for confirmation receiving. If the timeout is exceeded then respond with a 500 error. 

MODQM-191

MODQM-192

Related modules are downThe process of update is started in SRM by sending a Kafka event to SRS and then to Inventory. If SRS or Inventory is down then the process couldn't be finished. Consequently, the timeout for confirmation receiving in mod-quick-marc is exceeded.

MODQM-197

Kafka topic does not existKafka consumers cannot connect and receive messages because topics do not exist. Consequently, the timeout for confirmation receiving in mod-quick-marc is exceeded.

MODQM-212

MODQM-213

MODSOURMAN-731

IDs of records are not consistentWhen an SRS record is created firstly it has 2 IDs: record ID and record source ID, and they are the same. After 1st record update record source ID is changed. On the mod-quick-marc side, we expect to receive an event that contains an ID that is equal to the record ID but receives a different ID, because the initial ID was changed. Consequently, the timeout for confirmation receiving in mod-quick-marc is exceeded.

MODQM-218

Optimistic locking response changedWhen mod-quick-marc is getting an optimistic locking error from mod-inventory, it expects that it has JSON-structure. But it was changed to a simple string, this causes an error during message processing.
  • the issue is NOT related to multi-tenant versus single tenant 
  • the issue is NOT related to how the data is loaded (via data import or directly to storage)
  • the issue is NOT related to how the data is updated (via data import, inventory, or MARC authority app UI)  

The issue could be related to updating data directly to storage in cases if inventory-storage record ID is not consistent or linked to source-storage record ID

Main problems

  1. quickMarc update flow is implemented on very similar to the data-import flow but is not updated with latest data-import improvements
  2. quickMarc update flow uses different from data-import Kafka settings
  3. async-to-sync approach that is used in quickMarc has no error handling (in any problem the result is 500 error) 

Plan

Migrate update flow to data-import

  1. Create default update profiles for each record type (which could be hidden)
  2. Use already implemented for derive MARC bib and create MARC holdings code base for data-import job initializing
  3. Using ReplyingKafkaTemplate instead of the combination of DeferredResult and cache
    1. Configure template to receive events from DI_COMPLETED and DI_ERROR topics
    2. Modify data-import payload to always populate correlationId Kafka header if it exists in the initial event.
    3. Start data-import process record by sending DI_RAW_RECORDS_CHUNK_READ instead of using POST /jobExecutions/{jobExecutionId}/records endpoint
    4. Specify a timeout for confirmation receiving (1 min ??)
    5. If the timeout is exceeded then use combination of GET /metadata-provider/jobLogEntries/{jobExecutionId} and GET /metadata-provider/jobLogEntries/{jobExecutionId}/records/{recordId} endpoints to get the status of job and error message if it failed.

Async-to-sync or status ping approach?   

async-to-sync

The async-to-sync approach is about waiting for update confirmation received and only after that respond to UI.

status ping

The status ping approach is about responding to UI with a status of the process, and UI ping status endpoint until the status changes to COMPLETED or ERROR.

All actions in quickMarc should be moved to one of these approaches.


ProsCons
All processes in quickMarc are consistent and easy to maintainAll future bugs in data-import will affect quickMarc also
Kafka configuration is inherited from data-import Kafka configurationA little performance degradation
All future improvements to data-import will affect quickMarc also

Some features could be easily implemented in the future based on these changes:

  • possibility to use custom data-import profiles
  • MARC records bulk-edit

Prevention plan

  1. Fix Karate tests for mod-quick-marc to make it possible to rely on them. Possible solution: use mod-srs and mod-inventory-storage endpoints to populate records instead of using data-import.
  2. Continuously update Karate, module integration, and unit tests each time a new edge-cased bug was founded and fixed.