Skip to end of metadata
Go to start of metadata

MODINREACH-80 - Getting issue details... STATUS

Overview of re-indexing feature

Inventory storage currently provides a special interface to pull all existing instances, more precise instance ids, from its database: /instance-storage/reindex

The interface has three methods:

It was initially designed to support full Instance index re-build in ElasticSearch, that is why the name is "reindex". At the moment the only known client of these endpoints is mod-search which initiates pulling of instance ids by making a call to POST /instance-storage/reindex endpoint and then consumes REINDEX events from inventory.instance Kafka topic. The detailed description of this processes can be found in The event processing on example of mod-search section of SPIKE: [MODINREACH-78]

Problem statement

Folio is integrating with external system, called Inn-Reach, which enables participating libraries to offer materials to other libraries in a consortial borrowing system. Patrons of participating libraries can directly request and borrow materials from other participating libraries through the union catalog. Libraries participating in an INN-Reach consortium first need to contribute records (items, instances) to the central union catalog. This process is called Initial record contribution

D2IR Record Contribution

Architectural vision of Record contribution flow to be implemented in Folio can be found on this page: D2IR Record Contribution flow

Similarly to ElasticSearch index re-building, Initial contribution involves all existing instance records. It was proposed to enumerate all instances and items existing in inventory via REINDEX functionality of mod-inventory-storage. But there are some limitations that don't allow to use re-indexing as is for Initial record contributions

Re-indexing limitations

  1. single topic (inventory.instance) is used for different types of events
    1. both regular changes (like CREATE/UPDATE/DELETE) to instance records and re-indexing events posted into the same topic. This leads to mixing of concerns which in turns causes some additional filtering to be implemented to separate processing of different types of events. The topic is also can be overloaded with millions of events that are not relevant to the consumers who's only interested in regular instance record change.
  2. current re-indexing interface is client oriented, meaning it serves only to the purpose of mod-search
    1. other module cannot call the same interface and initiate instance record re-iteration because it'll cause unwanted index re-building
    2. interface name (/reindex) and event type (REINDEX) is purpose specific
  3. simultaneous execution of several jobs is not possible because there is no way to distinguish events produced by different jobs

Proposed solution

The proposed solution consists of mandatory changes (phase 1) and optional changes (phase 2), the later are kind of nice to have but can be postponed to a later date.

The majority of changes should be done in mod-inventory-storage and includes the following:

  • introduce new "Instance Iteration" API interface – (phase 1)
  • introduce new "Iteration" domain event with flexible domain type – (phase 1)

  • rename underlying business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**" – (phase 2)

mod-search is supposed to eventually use Iteration API interface instead of Reindex interface. But to minimize the impact on mod-search and to allow it to gradually migrate to the new interface it's proposed to keep the existing interface for now and make the changes in phase 2.

Changes in mod-inventory-storage

Introduce new "Instance Iteration" API interface

The interface will provide similar functionality as Reindex interface does at the moment, with minor changes to naming and payloads.

Interface URL:

/instance-storage/instances/iteration

Methods:

POST

/instance-storage/instances/iteration

  •  initiate iteration of instance records
Request schemaResponse schema
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Start new instance iteration job",
    "type": "object",
    "properties":
    {
        "eventType":
        {
            "description": "Type of events to be published",
            "type": "string"
        },
        "topicName":
        {
            "description": "Name of Kafka topic to publish events to",
            "type": "string"
        }
    },
    "additionalProperties": false
}
  • eventType is optional. Default type – "Iterate" -- will be used, if omitted
  • topicName defines the name of Kafka topic where Iteration events will be published to by a new job
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Get job by id response",
    "type": "object",
    "properties":
    {
        "id":
        {
            "description": "Job id",
            "type": "string",
            "$schema": "http://json-schema.org/draft-04/schema#",
            "id": "uuid.schema",
            "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[1-5][a-fA-F0-9]{3}-[89abAB][a-fA-F0-9]{3}-[a-fA-F0-9]{12}$"
        },
        "published":
        {
            "description": "Number of records that was published so far",
            "type": "integer",
            "minimum": 0,
            "default": 0
        },
        "jobStatus":
        {
            "description": "Overall job status",
            "type": "string",
            "enum":
            [
                "In progress",
                "Id publishing failed",
                "Ids published",
                "Pending cancel",
                "Id publishing cancelled"
            ]
        },
        "submittedDate":
        {
            "description": "Timestamp when the job has been submitted",
            "type": "string",
            "format": "date-time"
        }
    },
    "additionalProperties": false
}


GET/instance-storage/instances/iteration/{jobId}
  • get iteration job by its id
Response schema
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "description": "Get job by id response",
  "type": "object",
  "properties": {
    "id": {
      "description": "Job id",
      "type": "string",
      "$schema": "http://json-schema.org/draft-04/schema#",
      "id": "uuid.schema",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[1-5][a-fA-F0-9]{3}-[89abAB][a-fA-F0-9]{3}-[a-fA-F0-9]{12}$"
    },
    "published": {
      "description": "Number of records that was published so far",
      "type": "integer",
      "minimum": 0,
      "default": 0
    },
    "jobStatus": {
      "description": "Overall job status",
      "type": "string",
      "enum": [
        "In progress",
        "Id publishing failed",
        "Ids published",
        "Pending cancel",
        "Id publishing cancelled"
      ]
    },
    "submittedDate": {
      "description": "Timestamp when the job has been submitted",
      "type": "string",
      "format": "date-time"
    }
  },
  "additionalProperties": false
}


DELETE

/instance-storage/instances/iteration/{jobId}

  • cancel iteration job with specified id


Running multiple iterations

There is no restriction to run multiple iterations.

If a client of the interface can handle or need to be able to trigger simultaneous iterations then it is possible. And otherwise it's up to the client to forbidden such cases: by knowing job id of a running process client can restrict another one

Support new "Iteration" domain event

Iteration job will produce and publish to Kafka new events with the following content:

Iteration event
{
  "type": <event type>,
  "tenant": <tenant>,
  "jobId": <UUID of job>
}
  • event type – unlike "Reindex" domain event, "Iteration" event will have flexible even type provided by client of the interface during iteration triggering (POST /instance-storage/instances/iteration method). If the event type has not been provided in POST method then "ITERATE" type will be used as a default value.
  • jod id – UUID of the job which produced this event. This new attribute will allow to verify on the client side that an event belongs to the job client has started and interested in, unexpected/unwanted events can be filtered out.

Alternatively job id can be transferred in event header and looks like it is already according to this code snippet:

REINDEX_JOB_ID_HEADER = "reindex-job-id"

(question) Need to double check

All events related to particular job should be published to the topic specified in POST /instance-storage/instances/iteration method payload. This is different from Reindex job which always publishes to inventory.instance topic.

Rename business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**"

These changes are not required to support new Iteration interface and can be postponed

The list of affected classes:

  • ReindexService
  • ReindexJobRepository

  • ReindexJobRunner
  • ReindexJob (generated from RAML)

Affected table:

  • reindex_job

Migrate to "Instance iteration" API interface

These changes are not required to support new Iteration interface and can be postponed

Once Iteration interface is in place, mod-search can deprecate usage of Reindex interface and switch to Iteration interface. This will mostly affect the following classes:

  • IndexService
  • InstanceStorageClient
  • KafkaMessageListener

List of related USs

Key Summary T Created Updated Due Assignee Reporter P Status Resolution
Loading...
Refresh

  • No labels

3 Comments

  1. I was asked to provide my thoughts on this by Dmytro Tkachenko


    I support the idea of using the same mechanism for both Elastic Search based search and Inn Reach synchronisation (given that we've chosen to use general messages for this purpose already and so they intentionally use the same messages).

    One of the side effects of the re-index API being implemented in mod-inventory-storage was that it leaked the search domain into that module. I think it would be preferable not to do that, which this approach would achieve.

    How would mod-search be transitioned from the old API to the new API?

    Alternative Approach

    If I step back from this proposal slightly, I'd like to propose an alternative approach to this challenge.

    The approach taken for the search to inventory integration, which this approach generalises introduces some trade-offs:

    • The consuming module must be aware of the publishing module, in the sense that it relies on an API that it provides
      • This creates a coupling beyond the Kafka topic name and message contracts
      • This also means that only a single module can be the producer
    • The re-indexing / iteration process places load upon mod-inventory-storage
    • The consuming module needs two mechanisms for synchronising it's state (via Kafka and via the iteration API)


    An alternative approach could be to make the Kafka topics that mod-inventory-storage retain the messages forever.

    This would allow any client to re-process those messages whenever it needs to without involving mod-inventory-storage and without affecting any other modules synchronisation process.

    Given that the messages published by mod-inventory-storage are snapshots of the whole state of the record (rather than specific events) this might be able to be made more space and process efficient by using log compaction so that only the latest snapshot for each record is kept.

    I believe this approach alleviates the 3 limitations of the re-index API stated in the proposal:

    • single topic - in this approach the existing topic separation is preserved (and no special message types are needed)
    • interface is client oriented - no interface is needed, this approach is as client-ignorant as the originally published messages
    • simultaneous execution - Kafka naturally supports the simultaneous independent consumption of topics, no specific mechanism is needed
    1. Marc Johnson - thank you for this comment, it's a fresh idea which seems to be viable for our general database change subscription case. A few additional questions:

      • Do you think a mechanism for re-iteration / re-load is still required (e.g. for initial filling the topic, or as a part of failure recovery process)?
      • What should be stored in such topic - all records for instances and items separately, or as a kind of generalized view (1 instance + all related holdings + all related items), or just inventory record IDs?
      • Since log compaction is an internal Kafka process which can happen somewhen, do we have any specific expectations regarding performance?
      • Probably, more questions may appear while deeper analysis.

      I feel now this is a very promising option required additional investigation and, probably, POC. What are your thoughts? What is your vision on the priority of this analysis and POC?

      Meanwhile, I think we need to allow the Volaris team go ahead now with described proposal in order not to block the delivery of business feature. This proposal follows the already existing approach and needs the minimum possible efforts for implementing the requested behavior.

      Dmytro Tkachenko Mikhail Fokanov Brooks Travis for overall awareness.

  2. Raman Auramau Thank you for responding to my feedback

     * Do you think a mechanism for re-iteration / re-load is still required (e.g. for initial filling the topic, or as a part of failure recovery process)?

    Yes, we might need a way to initial populate a topic. I would think that was an internal mechanism triggered upon a module upgrade


    * What should be stored in such topic - all records for instances and items separately, or as a kind of generalized view (1 instance + all related holdings + all related items), or just inventory record IDs?

    My suggestion was based upon the current contents of the topics. I believe you and Mikhail Fokanov  have outlined a standard for topics structures in a separate document


    * Since log compaction is an internal Kafka process which can happen /somewhen/, do we have any specific expectations regarding performance?

    What aspect of performance are you referring to?

    I feel now this is a very promising option required additional investigation and, probably, POC. What are your thoughts? What is your vision on the priority of this analysis and POC?

    My suggestion was offered specifically because Dmytro Tkachenko  asked for my feedback, in the context of this proposal and work, not as a general endeavour or a PoC.

    I believe the decision you shared (quoted below) to go ahead with the original proposal, is in effect, a decision not to prioritise exploring this suggestion.

    Meanwhile, I think we need to allow the Volaris team go ahead now with described proposal in order not to block the delivery of business feature. This proposal follows the already existing approach and needs the minimum possible efforts for implementing the requested behavior.