SPIKE: [MODINREACH-80] Record Contribution: Analyze Re-Index job implementation usage in mod-inventory-storage

MODINREACH-80 - Getting issue details... STATUS

Overview of re-indexing feature

Inventory storage currently provides a special interface to pull all existing instances, more precise instance ids, from its database: /instance-storage/reindex

The interface has three methods:

It was initially designed to support full Instance index re-build in ElasticSearch, that is why the name is "reindex". At the moment the only known client of these endpoints is mod-search which initiates pulling of instance ids by making a call to POST /instance-storage/reindex endpoint and then consumes REINDEX events from inventory.instance Kafka topic. The detailed description of this processes can be found in The event processing on example of mod-search section of SPIKE: [MODINREACH-78]

Problem statement

Folio is integrating with external system, called Inn-Reach, which enables participating libraries to offer materials to other libraries in a consortial borrowing system. Patrons of participating libraries can directly request and borrow materials from other participating libraries through the union catalog. Libraries participating in an INN-Reach consortium first need to contribute records (items, instances) to the central union catalog. This process is called Initial record contribution

D2IR Record Contribution

Architectural vision of Record contribution flow to be implemented in Folio can be found on this page: D2IR Record Contribution flow

Similarly to ElasticSearch index re-building, Initial contribution involves all existing instance records. It was proposed to enumerate all instances and items existing in inventory via REINDEX functionality of mod-inventory-storage. But there are some limitations that don't allow to use re-indexing as is for Initial record contributions

Re-indexing limitations

  1. single topic (inventory.instance) is used for different types of events
    1. both regular changes (like CREATE/UPDATE/DELETE) to instance records and re-indexing events posted into the same topic. This leads to mixing of concerns which in turns causes some additional filtering to be implemented to separate processing of different types of events. The topic is also can be overloaded with millions of events that are not relevant to the consumers who's only interested in regular instance record change.
  2. current re-indexing interface is client oriented, meaning it serves only to the purpose of mod-search
    1. other module cannot call the same interface and initiate instance record re-iteration because it'll cause unwanted index re-building
    2. interface name (/reindex) and event type (REINDEX) is purpose specific
  3. simultaneous execution of several jobs is not possible because there is no way to distinguish events produced by different jobs

Proposed solution

The proposed solution consists of mandatory changes (phase 1) and optional changes (phase 2), the later are kind of nice to have but can be postponed to a later date.

The majority of changes should be done in mod-inventory-storage and includes the following:

  • introduce new "Instance Iteration" API interface – (phase 1)
  • introduce new "Iteration" domain event with flexible domain type – (phase 1)

  • rename underlying business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**" – (phase 2)

mod-search is supposed to eventually use Iteration API interface instead of Reindex interface. But to minimize the impact on mod-search and to allow it to gradually migrate to the new interface it's proposed to keep the existing interface for now and make the changes in phase 2.

Changes in mod-inventory-storage

Introduce new "Instance Iteration" API interface

The interface will provide similar functionality as Reindex interface does at the moment, with minor changes to naming and payloads.

Interface URL:

/instance-storage/instances/iteration

Methods:

POST

/instance-storage/instances/iteration

  •  initiate iteration of instance records
Request schemaResponse schema
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Start new instance iteration job",
    "type": "object",
    "properties":
    {
        "eventType":
        {
            "description": "Type of events to be published",
            "type": "string"
        },
        "topicName":
        {
            "description": "Name of Kafka topic to publish events to",
            "type": "string"
        }
    },
    "additionalProperties": false
}
  • eventType is optional. Default type – "Iterate" -- will be used, if omitted
  • topicName defines the name of Kafka topic where Iteration events will be published to by a new job
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Get job by id response",
    "type": "object",
    "properties":
    {
        "id":
        {
            "description": "Job id",
            "type": "string",
            "$schema": "http://json-schema.org/draft-04/schema#",
            "id": "uuid.schema",
            "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[1-5][a-fA-F0-9]{3}-[89abAB][a-fA-F0-9]{3}-[a-fA-F0-9]{12}$"
        },
        "published":
        {
            "description": "Number of records that was published so far",
            "type": "integer",
            "minimum": 0,
            "default": 0
        },
        "jobStatus":
        {
            "description": "Overall job status",
            "type": "string",
            "enum":
            [
                "In progress",
                "Id publishing failed",
                "Ids published",
                "Pending cancel",
                "Id publishing cancelled"
            ]
        },
        "submittedDate":
        {
            "description": "Timestamp when the job has been submitted",
            "type": "string",
            "format": "date-time"
        }
    },
    "additionalProperties": false
}


GET/instance-storage/instances/iteration/{jobId}
  • get iteration job by its id
Response schema
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "description": "Get job by id response",
  "type": "object",
  "properties": {
    "id": {
      "description": "Job id",
      "type": "string",
      "$schema": "http://json-schema.org/draft-04/schema#",
      "id": "uuid.schema",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[1-5][a-fA-F0-9]{3}-[89abAB][a-fA-F0-9]{3}-[a-fA-F0-9]{12}$"
    },
    "published": {
      "description": "Number of records that was published so far",
      "type": "integer",
      "minimum": 0,
      "default": 0
    },
    "jobStatus": {
      "description": "Overall job status",
      "type": "string",
      "enum": [
        "In progress",
        "Id publishing failed",
        "Ids published",
        "Pending cancel",
        "Id publishing cancelled"
      ]
    },
    "submittedDate": {
      "description": "Timestamp when the job has been submitted",
      "type": "string",
      "format": "date-time"
    }
  },
  "additionalProperties": false
}


DELETE

/instance-storage/instances/iteration/{jobId}

  • cancel iteration job with specified id


Running multiple iterations

There is no restriction to run multiple iterations.

If a client of the interface can handle or need to be able to trigger simultaneous iterations then it is possible. And otherwise it's up to the client to forbidden such cases: by knowing job id of a running process client can restrict another one

Support new "Iteration" domain event

Iteration job will produce and publish to Kafka new events with the following content:

Iteration event
{
  "type": <event type>,
  "tenant": <tenant>,
  "jobId": <UUID of job>
}
  • event type – unlike "Reindex" domain event, "Iteration" event will have flexible even type provided by client of the interface during iteration triggering (POST /instance-storage/instances/iteration method). If the event type has not been provided in POST method then "ITERATE" type will be used as a default value.
  • jod id – UUID of the job which produced this event. This new attribute will allow to verify on the client side that an event belongs to the job client has started and interested in, unexpected/unwanted events can be filtered out.

Alternatively job id can be transferred in event header and looks like it is already according to this code snippet:

REINDEX_JOB_ID_HEADER = "reindex-job-id"

(question) Need to double check

All events related to particular job should be published to the topic specified in POST /instance-storage/instances/iteration method payload. This is different from Reindex job which always publishes to inventory.instance topic.

Rename business service(s), utility class(s), data structure(s) from "Reindex**" to "Iteration**"

These changes are not required to support new Iteration interface and can be postponed

The list of affected classes:

  • ReindexService
  • ReindexJobRepository

  • ReindexJobRunner
  • ReindexJob (generated from RAML)

Affected table:

  • reindex_job

Migrate to "Instance iteration" API interface

These changes are not required to support new Iteration interface and can be postponed

Once Iteration interface is in place, mod-search can deprecate usage of Reindex interface and switch to Iteration interface. This will mostly affect the following classes:

  • IndexService
  • InstanceStorageClient
  • KafkaMessageListener

List of related USs

key summary type created updated due assignee reporter priority status resolution
Loading...
Refresh