Overview of re-indexing feature
Inventory storage currently provides a special interface to pull all existing instances, more precise instance ids, from its database: /instance-storage/reindex
The interface has three methods:
- POST /instance-storage/reindex – to start getting instance records
- the pulling is handled by a special job which id is returned in response
- GET /instance-storage/reindex/{id} – to obtain process job details by its id
- DELETE /instance-storage/reindex/{id} – to cancel specified job by its id
It was initially designed to support full Instance index re-build in ElasticSearch, that is why the name is "reindex". At the moment the only known client of these endpoints is mod-search which initiates pulling of instance ids by making a call to POST /instance-storage/reindex endpoint and then consumes REINDEX events from inventory.instance Kafka topic. The detailed description of this processes can be found in The event processing on example of mod-search section of SPIKE: [MODINREACH-78]
Problem statement
Folio is integrating with external system, called Inn-Reach, which enables participating libraries to offer materials to other libraries in a consortial borrowing system. Patrons of participating libraries can directly request and borrow materials from other participating libraries through the union catalog. Libraries participating in an INN-Reach consortium first need to contribute records (items, instances) to the central union catalog. This process is called Initial record contribution
D2IR Record Contribution
Architectural vision of Record contribution flow to be implemented in Folio can be found on this page: D2IR Record Contribution flow
Similarly to ElasticSearch index re-building, Initial contribution involves all existing instance records. It was proposed to enumerate all instances and items existing in inventory via REINDEX functionality of mod-inventory-storage. But there are some limitations that don't allow to use re-indexing as is for Initial record contributions
Re-indexing limitations
- single topic (inventory.instance) is used for different types of events
- both regular changes (like CREATE/UPDATE/DELETE) to instance records and re-indexing events posted into the same topic. This leads to mixing of concerns which in turns causes some additional filtering to be implemented to separate processing of different types of events. The topic is also can be overloaded with millions of events that are not relevant to the consumers who's only interested in regular instance record change.
- current re-indexing interface is client oriented, meaning it serves only to the purpose of mod-search
- other module cannot call the same interface and initiate instance record re-iteration because it'll cause unwanted index re-building
- interface name (/reindex) and event type (REINDEX) is purpose specific
- simultaneous execution of several jobs is not possible because there is no way to distinguish events produced by different jobs
Proposed solution
The proposed solution consists of mandatory changes (phase 1) and optional changes (phase 2), the later are kind of nice to have but can be postponed to a later date.
The majority of changes should be done in mod-inventory-storage and includes the following:
- introduce new "Instance Iteration" API interface – (phase 1)
introduce new "Iteration" domain event with flexible domain type – (phase 1)
- rename underlying business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**" – (phase 2)
mod-search is supposed to eventually use Iteration API interface instead of Reindex interface. But to minimize the impact on mod-search and to allow it to gradually migrate to the new interface it's proposed to keep the existing interface for now and make the changes in phase 2.
Changes in mod-inventory-storage
Introduce new "Instance Iteration" API interface
The interface will provide similar functionality as Reindex interface does at the moment, with minor changes to naming and payloads.
Interface URL:
/instance-storage/instances/iteration
Methods:
POST |
|
---|
- initiate iteration of instance records
Request schema | Response schema |
---|---|
|
GET | /instance-storage/instances/iteration/{jobId} |
---|
- get iteration job by its id
Response schema |
---|
DELETE |
|
---|
- cancel iteration job with specified id
Running multiple iterations
There is no restriction to run multiple iterations.
If a client of the interface can handle or need to be able to trigger simultaneous iterations then it is possible. And otherwise it's up to the client to forbidden such cases: by knowing job id of a running process client can restrict another one
Support new "Iteration" domain event
Iteration job will produce and publish to Kafka new events with the following content:
{ "type": <event type>, "tenant": <tenant>, "jobId": <UUID of job> }
- event type – unlike "Reindex" domain event, "Iteration" event will have flexible even type provided by client of the interface during iteration triggering (POST /instance-storage/instances/iteration method). If the event type has not been provided in POST method then "ITERATE" type will be used as a default value.
- jod id – UUID of the job which produced this event. This new attribute will allow to verify on the client side that an event belongs to the job client has started and interested in, unexpected/unwanted events can be filtered out.
Alternatively job id can be transferred in event header and looks like it is already according to this code snippet:
REINDEX_JOB_ID_HEADER = "reindex-job-id"
Need to double check
All events related to particular job should be published to the topic specified in POST /instance-storage/instances/iteration method payload. This is different from Reindex job which always publishes to inventory.instance topic.
Rename business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**"
These changes are not required to support new Iteration interface and can be postponed
The list of affected classes:
- ReindexService
ReindexJobRepository
- ReindexJobRunner
- ReindexJob (generated from RAML)
Affected table:
reindex_job
Changes in mod-search
Migrate to "Instance iteration" API interface
These changes are not required to support new Iteration interface and can be postponed
Once Iteration interface is in place, mod-search can deprecate usage of Reindex interface and switch to Iteration interface. This will mostly affect the following classes:
- IndexService
- InstanceStorageClient
- KafkaMessageListener
3 Comments
Marc Johnson
I was asked to provide my thoughts on this by Dmytro Tkachenko
I support the idea of using the same mechanism for both Elastic Search based search and Inn Reach synchronisation (given that we've chosen to use general messages for this purpose already and so they intentionally use the same messages).
One of the side effects of the re-index API being implemented in mod-inventory-storage was that it leaked the search domain into that module. I think it would be preferable not to do that, which this approach would achieve.
How would mod-search be transitioned from the old API to the new API?
Alternative Approach
If I step back from this proposal slightly, I'd like to propose an alternative approach to this challenge.
The approach taken for the search to inventory integration, which this approach generalises introduces some trade-offs:
An alternative approach could be to make the Kafka topics that mod-inventory-storage retain the messages forever.
This would allow any client to re-process those messages whenever it needs to without involving mod-inventory-storage and without affecting any other modules synchronisation process.
Given that the messages published by mod-inventory-storage are snapshots of the whole state of the record (rather than specific events) this might be able to be made more space and process efficient by using log compaction so that only the latest snapshot for each record is kept.
I believe this approach alleviates the 3 limitations of the re-index API stated in the proposal:
Raman Auramau
Marc Johnson - thank you for this comment, it's a fresh idea which seems to be viable for our general database change subscription case. A few additional questions:
I feel now this is a very promising option required additional investigation and, probably, POC. What are your thoughts? What is your vision on the priority of this analysis and POC?
Meanwhile, I think we need to allow the Volaris team go ahead now with described proposal in order not to block the delivery of business feature. This proposal follows the already existing approach and needs the minimum possible efforts for implementing the requested behavior.
Dmytro Tkachenko Mikhail Fokanov Brooks Travis for overall awareness.
Marc Johnson
Raman Auramau Thank you for responding to my feedback
Yes, we might need a way to initial populate a topic. I would think that was an internal mechanism triggered upon a module upgrade
My suggestion was based upon the current contents of the topics. I believe you and Mikhail Fokanov have outlined a standard for topics structures in a separate document
What aspect of performance are you referring to?
My suggestion was offered specifically because Dmytro Tkachenko asked for my feedback, in the context of this proposal and work, not as a general endeavour or a PoC.
I believe the decision you shared (quoted below) to go ahead with the original proposal, is in effect, a decision not to prioritise exploring this suggestion.