Technical approach for update MARC Bib fields controlled by related Authority records

Overview

Concise problem sounds as next: "Find a solution to update MARC Bib fields controlled by related Authority records, w/o significant impact on performance". 

High performance is considered as an Architectural Significant Requirement for this technical approach.

Solution touches the next modules: mod-source-record-storage, mod-inventory-storage, mod-entities-links

ARCH-23 - Getting issue details... STATUS





Current Data Flow Diagram


Solutions

Option 1

Brief overview

Start update links on import of authorities completion (on mod-inventory-storage level).

Pros / Cons

ModulesProsCons
source-record-storage
Extra logic and computation efforts to consume and handle messages from entities links module one by one (by authority) to update corresponding bib records and notify inventory storage about instance updates
entities-links
Needs to determine and filter those authority records where specific fields are not updated (1XX or 010 $a) to exclude from further process flow [NEG. IMPACT: excessive computations on mod-entities-links side]
inventory-storageproceed with one generic message on update authorities for both mod-search and mod-entities-links
entities (not links) consistency between modules: SRS and inventory
Others / Allimport process could be divided on two phases - entities update (Phase 1) and links update (Phase 2). Thus links update could be considered as a separate process and there will be no impact on Phase 1 (actual import of entities)  

Sequence Diagrams

High Level Flow

Low Level Design


Messages Body

inventory.authority

{
"old": {...}, // the entity state before update or delete
"new": {...}, // the entity state update or create
"type": "UPDATE|DELETE|CREATE|DELETE_ALL", // type of the event
"tenant": "diku", // tenant id
"ts": "1664881538352" // timestamp of the event
}

links.authority.instance

{

"jobId": "...",   // TBD field to track context within which update of linked bib records performed
"authorityId": "...", // updated authority id
"instanceIds": [],    //   string list of linked instance ids to updated authority

"tenant": "diku", // tenant id

"updatedAuthorityFields": [

   "<Tag>": {       // e.g. "100"
"subfields": [

{
"a": "Mozart, Wolfgang Amadeus,"
},
{

"d": "1756-1791."
}

...
]
},
...
]
}

srs.authority.bib

{
"old": {...}, // the entity state before update or delete
"new": {...}, // the entity state update or create
"type": "UPDATE", // type of the event
"tenant": "diku", // tenant id 

"jobId": "...",   // field to track context within which update of linked bib records performed TBD
"ts": "1664881538352" // timestamp of the event

}

inventory.authority.bib.stats

{
"jobId": "...",   // field to track context within which update of linked bib records performed TBD

"tenant": "diku", // tenant id 
"success": {      //  instance ids that successfully updated
    "instanceIds": [] // string list of ids representation
},

"fail": {
   "instanceIdsCause": [     // list of mapped failed to update instance ids and cause 
         {
            "id": "...",  // id of instance that failed to update 
            "cause": "..."    // optional field that indicates cause of failure 
          }, 
          ...
   ]
 }

}

Implied Changes per module

  • mod-entities-links (2 sprints) 
    • consuming and processing authority messages coming from mod-inventory-storage
    • preparing and send messages to mod-srs with authority id, linked instance ids and actual changes done within authority specific fields (1XX or 010 $a)
  • mod-srs (1 sprint)
    • consume and process messages from mod-entities-links (links.authority.instance)
    • prepare and send message to inventory for each updated bib record (mb consider batching)
  • mod-inventory (1-2 sprint)
    • consume and process messages from mod-srs (links.authority.instance) with further updating corresponding instances in mod-inventory-storage
    • sending statistics on each completion of updating instance operation (success or failure) (messages inventory.authority.bib.stats)
  • Reporting mechanism (TBD how it's represented) - part of mod-entities-links
    • consume and handle messages (inventory.authority.bib.stats)
  • mod-quickMark - no changes

Current specifics

  • As long as update authority event published from mod-inventory is consumed by mod-search only, needs to rethink on approach with two consumer groups for the same event (as mod-entities-links considered as the second consumer here)
  • mod-source-record-storage should handle extra messages on controlled bib records updated coming from mod-entities-links (increasing resources consumption)

Concerns

  • It's not been specified yet how to track context within which update of links is performed.  As an option could be considered passing job id as parameter within which update of authorities and controlled bibs is done
  • communication of mod-entities-links and mod-srm for obtaining mapping rules is not defined (presuming mapping rules should be fetched per each new import job).  Authority mapping rules (as ones are not allowed to change in runtime) are fetched ones and stored within cache of mod-entities-links thus check for each coming authority with action UPDATE. If mapping rules are not in cache - pull ones from mod-srm.
  • decision on reusing existing or implementing new events for controlled MARC bib records update has not been done (mini POC would be nice to do)
  • reporting mechanism is vague described w/o any specific details
  • In case of updating big amount of bib records (e.g. edge case with 40,000 of authorities by 2,500 of controlled bib records per each authority = 100M of bib records to update) with existing baseline architecture, completion of entire update is not feasible to achieve in reasonable time as of excessive communication between mod-source-record-storage and mod-inventory-storage when updating bib records one by one. Way of optimization could be to proceed with batches processing and concise change delta-set put in Kafka message that is supposed to be compressed
  • mod-entities-links not passed TC
  • mod-entities-links according to the design is aware of mod-srm (for obtaining mapping rules) and mod-srs (for obtaining authority entity) 


Reporting

The reporting should be implemented using the job table, which is described in Linking MARC bib fields to MARC authority headings/references .

Phase 2. Calculation of the anticipated time for updating of bibs

Calculation of the count of affected bibs

The list of the changed authority id should be retrieved from mod-srm. Then the calculation of the count of the affected bibs should be performed by REST to the mod-entities-links with list of authority_ids. In mod-entities-links there should be an SQL query to the job table in the  where authority_id is in list of the changed authority.

Prediction of the time

When the process of job execution is in progress the difference of the time between the time of job stated and current time should be multiplied by the number of processed records and divided by the total number of records,

Defined Flows

Authority records

8. Update MARC Authority - Technical Designs and Decisions - FOLIO Wiki

9. Delete MARC Authority - Technical Designs and Decisions - FOLIO Wiki

6. Create MARC Authority - Technical Designs and Decisions - FOLIO Wiki

Update Marc bib records

5. Update MARC_BIB (Overlay) - Technical Designs and Decisions - FOLIO Wiki

2. Update Marc Bibs, update Instance, Holdings, Items - Technical Designs and Decisions - FOLIO Wiki

Delivery Plan

  • Phase 1 - implementation aligned to existing Folio standards (messaging) and baseline architecture. Thus performance trade-offs are introduced (especially in communication between mod-srs and mod-inventory on update each single controlled bib record). Initial Reporting included. Initially time anticipation is done by directing to Confluence page on measurements done for X amount of authorities and update of linked bibs [expected by Orchid]
  • Phase 2 - preprocessing component for counting affected linked bib records and anticipated time to perform overall update. Start with performance optimizations [expected by Poppy]
  • Phase 3 - performance optimizations (includes considering baseline architectures change to meet expectations of updating up to 100M bib records in a reasonable time). [expected by Queen Anne's Lace]

LOE

With Option 1 Phase 1 rough estimates for update logic on :

  • mod-inventory (2 sprints - still low level details are not clear and will be discovered during implementation that could lead to extra clarification sessions to rectify decisions - #1)
  • entities-links (2 sprints - #1)
  • mod-source-record-storage (2 sprints - #1)
  • evaluate and assess impact on performance with accompanying tunings/fixes (1 sprint)
  • embed evaluation and report generation logic on top of the solution (1 or 2 sprints - depends on readiness and maturity of current reporting feature)

3 - 5 Sprints


Phase 3. Performance optimization approach

Assumptions

During the research the next concerns identified about performance:

  • triggering REST API and publish single event to Kafka in case of each record update. It can lead to massive redundancy in Kafka events producing and thus lead to redundant latency [NEGATIVE PERFORMANCE IMPACT]
  • Updating ES indexes is not done in bulk (see Elastic Search bulk API for optimizations) [NEGATIVE PERFORMANCE IMPACT]
  • Lack of performance testing - hard to predict actual system capability
  • Scalability of components is not well defined (or unknown yet)
  • mod-entities-links module is out of data flow yet

Implementation

When the "Phase 1" is implemented, the performance tests of simultaneous execution of data-import and changing of a linked authority should be conducted. The results should show the degradation of the data-omport for such case. It worth to be mentioned, that the parallel business action that is executed in the same modules (e.g. SRS and mod-inventory-storage) and in the same database most likely would affect the performance of data-import.

But in terms of optimization of marc authority linking team can also improve performance of data-import (as a side effect), so the overall performance could be retained.

In order to optimize the writes to the database and hence increase the speed of Kafka message consuming the batch processing of messages should implemented. The writes to the DB in such case should be also done by batches (e.g. in one database transaction). This should be implemented once for both data-import bib records flow and bib records flow that is triggered by authority changes. The same approach is implemented in mod-search and could be leveraged.

If this changes didn't fix all performance problems, the adjustments of the number of instances for SRS and mod-inventory-storage along with provision of additional resources for database (e.g. vertical scaling) would be done according to the performance test results.

In the mod-inventory-storage this method should be changed to use one database transaction: https://github.com/folio-org/mod-inventory-storage/blob/13127d8ed0f5bc07c4441327301354bb8839d9a1/src/main/java/org/folio/rest/impl/InstanceStorageBatchAPI.java#L129

Exception Handling for bulk operations

If the batch database update/insert is failed, the messages should be procced one by one (e.g. one message in one database transaction). Such cases obviously should be logged.

Measurements [WIP]

PTF - Data Import Reports - Folio Development Teams - FOLIO Wiki

TBD

! Needs to develop Performance Tool for authority update data set creation.

! Needs to specify reporting mechanism for updating links per authority to display 1) final reports and 2) current progress

! Needs to define calculation mechanism on preprocessing phase to get ability to display users info about amount of bib records affected and anticipated time for the whole authorities update operation



Rationale

On PO and dev team (Spitfire) meeting it's been decided to proceed with solution from Option 1. It looks reasonable with 2 phased update (1st - authority update, 2nd - links update) as solution does not impact on current data import performance, links updates is built on top and can be considered as a separate subtask that is measured and accounted differently from import of entities (authorities update).