ARCH-8 - Getting issue details... STATUS

Problem statement

In current implementation of business transactions Folio does not provide a roll-back mechanism. This leads to inconsistent state of the system if any of participants of a business transaction failed to do it's job.

Example:

Data import

According to data import flow diagram there are following entities being modified during business transaction:

source record in mod-source-record-storage
instance in mod-inventory-storage
(optional) item in mod-inventory-storage
(optional) holding in mod-inventory-storage

If we fail on steps 2-4, source record still remains changed, leading to inconsistent state. If we fail on steps 3-4, we already need to roll back steps 1 and 2 to achieve consistency.

Same issues exist in other business processes like circulation (Vladimir Shalaev to add link to a page with data consistency issues currently discovered)

Possible solution

Distributed transactions

All participants of a business process are configured to use a distributed transaction controller to atomically either commit or roll-back the transaction.

This option is not considered as viable since it requires the following:

all interactions are synchronous
all participants must be available during the transaction

Distributed transactions can be implemented in homogeneous environments (i.e. all DBs are of the same type)

This solution decreases overall availability of the system and removes option to have an asynchronous communications.

SAGA pattern

This pattern is described in details here: https://microservices.io/patterns/data/saga.html

In general there are 2 options to implement SAGA and 2 options to implement the "roll-back" mechanism. The can be combined either way

Choreography-based SAGA vs Orchestration-based SAGA

Main difference in solutions is the existence of an "orchestrator" - a module to control business transaction's flow. The orchestrator decides if the transaction goes forward, needs a retry of some step or it requires the rollback.

This module could be either specifically created to maintain one business process, or there can be a single orchestrator for all business transactions. Creating one-for-all module increases coupling and should not be considered in Folio case.

By this time Data Import implements choreography-based approach:

each involved module processes whole payload for business process (the module itself does traversing over the import profile, which defines if more steps required)
each involved module has hard-coded knowledge of what needs to be done to continue business process (Kafka queue names and related behavior)

This leads to increasing the coupling of the system:

Example of Choregraphy-based pattern in FOLIO project is Data Import.

Despite the idea of Data Import being optional component of the system, the core modules (i.e. mod-inventory) still have hard-coded dependency to Data Import profile traversal library and to the knowledge of Data Import Kafka Topics

Alternative could be Orchestration-based approach:

SAGA control responsibility is moved to mod-data-import
All involved modules only provide CRUD APIs to work with business entities, and know nothing about being involved in a larger business process (mod-srm, mod-srs, mod-inventory, mod-inventory-storage, etc)
Process control (process start, cancel, progress tracking) is moved to coordinator

These changes decrease coupling of the system, allows Data Import to become really "pluggable" functionality and simplifies code-base of involved modules.

The drawback for orchestration if lowered availability of the system - business process is not available if controller is down or not working properly (that seems to be hypothetical since if we have any involved module being down/overloaded/etc - the process will be stuck anyway).

This potentially applies to other business processes, not only Data Import. Thou viability of a particular approach should be specifically considered by looking at requirements of a particular business process

Data consistency

Currently FOLIO project has a set of issues, related to data consistency, which need to be addressed.

One of existing issues is described in "Problem statement" paragraph.

FOLIO API approach does not support any kind of data isolation and data rollback. All changes done to business entities are applied and available immediately. This leads to a group of problems:

Business transaction can not be reverted completely or partially until compensating transactions or roll-back are introduced
Modifications are available immediately and can be read by other modules (and stored on their side). If a transaction gets rolled back, this data is not necessarily modified as well and it also leads to data inconsistency

Transactions reverting

There are two options to implement transaction reverting

Rollback
Compensation transaction

Rollback

Rollback is a procedure, which completely reverts the state of a business object to a consistent state. If concurrent modifications are allowed in the system, rollback procedure will always revert all changes done after "save point".

Compensation transaction

Compensation transaction is a modification of a business entity meant to only revert changed fields, not modifying any fields that were untouched in original transaction. It might behave better in concurrent environment, but not solving all the issues:

the merge mechanism might be unable to resolve conflict (i.e. if transactions done after "save point" modified same fields)
the changes done by the following transactions would not be possible if the state of original entity did not change (example below)

Example

There are two transactions going in parallel for book item:

Book returnal (changes state of a book to "available to borrow")
Book lease (checks status if book is available to borrow)

If first transaction is reverted by any reason, the result might become inconsistent - the same book was "borrowed" by the system by 2 users

At first look the solution could be "ok, modify book state as last step", but problem is that it still does not guarantee transaction completion or there might be more than one non-atomic modifications to business entities.

Isolation

This example can be applied to both rollback and compensation options. To avoid this situation there are two options available

Transaction data isolation
Modification barriers (i.e. pessimistic locking)

Those options do not exclude each other. The first one would probably include the second one to be properly implemented

Transaction data isolation

All modifications done within a business transaction are only available after business transaction is finished (commited)

This applies to parallel transactions only. Changes applied in transaction must only be visible for current transaction until it's commited.

This introduces new step in business flows - transaction commit. This step might also be broken or lost and the changes will be unavailable and entities involved might be kept locked for modifications.

This situation is an edge-case and is meant to be manually solved. Manual solution means providing the tool to monitor status, notify personnel and manually complete/rollback stuck transactions after operator investigation.

There could be an option of automatic transaction commit/rollback by timeout, which complicates the system, but still has options to not be completed and might require manual fix. This option should not be considered for its complexity vs its benefits .

Modification barriers

Modification barriers provided by mechanism, which only allows one transaction to modify an entity until the transactions is either commited or reverted. It's usually called "pessimistic lock".

It can be implemented in several ways:

Explicit locking

There is a table introduced to define if an entity is available for modifications or is locked. This approach will still require an "old" copy of entity to be available for read purposes (read Transaction Isolation block)

Pending versions (implicit locking)

Entity storage is organized to be able to store multiple versions of an entity and provide a pointer to a "commited" version of an entity. All "read" request always read commited versions (could read pending, if required, see "Dirty read", but this case is not covered in the document)

If any modification is to be applied to an entity, it's put as a "pending version" but the pointer is not adjusted. Absence of a "pending version" (should only be allowed zero to one "pending") defines if an entity is available for modifications.

When the transaction is commited/reverted, the pointer is adjusted to a "pending version" or pending version is removed accordingly.

This solution addresses both isolation and barrier issues

Data Import

Current implementation of Data Import does not include process reverting mechanism, as it was not stated in initial requirements.

As per requirements update by JIRA issues linked to the page, the rollback procedure and data consistency rework will be required.

Current implementation of Data Import follows Choreography-based approach:

there is no dedicated process coordinator
payload of the process contains "profile" which defines next steps, required to continue the process
payloads are processed by "executor" modules (i.e. mod-inventory for most scenarios)
mod-inventory contains a dependency to Data Import library to traverse the profile and perform required actions

Data import profile

Data Import process is driven by an entity called "Profile"

Profile is a hierarchical structure of objects:

matchers
actions

More details can be found here: Data Import

Each profile node is processed by a corresponding module (example: instance, holding, items are matched in mod-inventory, while MARC matches are done in mod-source-record-manager). Same applies to actions.

After "match" statement is processed the result is added to payload and a message with updated payload is generated in corresponding Kafka topic (example: DI_INVENTORY_INSTANCE_MATCHED). If no entities matched a message is generated to corresponding topics (ex: DI_INVENTORY_HOLDING_NOT_MATCHED)

Those topics are consumed by modules and next step from the profile is processed. Detailed flows are described here: Flow descriptions

Rollback control options

Both solutions (choreography and orchestration) can provide rollback options

To link all changes to a particular transaction each transaction receives a unique identifier at the very beginning. This identifier is added as an immutable part of the process' payload.

Choreography

When a module is unable to apply changes, it generates a message in "rollback" topic. There could also be a set of "retry-able" errors, so the module can decide to make few retries before initiating a rollback. A message contains error description and transaction identifier.
Only one rollback topic is required (per business process and tenant).
Every module involved in particular business transaction (i.e. has listeners to specific topics in Kafka, related to exact business transaction) listen to "rollback" topic
If a rollback message is received, the changes done by particular business transaction are reverted (either way - rollback or compensation)

There is also an option to create rollback topics "per transition", so every particular action can have it's own rollback topic.

This mechanism can be monitored and alerts can be generated when rollback is initiated. Since there's no single module to know all transaction steps, the rollback process can not be easily controlled and observed.

<<DI_ERROR>> messages could be used to trigger a rollback mechanism for modules, <<DI_COMPLETE>> could be used as "commit" message.

Orchestration

Major difference with choreography based rollback is that orchestrator service defines if rollback is needed.

Orchestrator calls every module to perform required actions and interprets the result (whether it's a match or an action)
If the module failed to perform action, orchestrator defines if an action can be retried or whole transaction requires a rollback
When rollback is initiated, orchestrator calls module APIs to perform action rollback, specifying transaction identifier. This can be done in both synchronous or asynchronous ways (direct API calls or a "rollback" queue")
Orchestrator waits for all rollbacks to be completed and returns the status of the transaction

This approach allows the system to track rollback process completely - orchestrator know exactly which actions were done and need a revert.

Rollback implementation options

Both compensation and rollback are possible to implement. Each module can have either one of them

Both old and new states (or version for rollback) are stored in modules' persistence storage

A logic to clean up rollback data needs to be implemented in every module (example for DI: clean up old state / shift current version on 'DI_COMPLETE' event)

Since every transaction requires a data isolation, there is no viable option to store "old state" externally (i.e. in controller)

Isolation

Changes isolation implementation is same (or very similar) for both solutions. Internal implementations of data access must also use transaction identifier to respect changes visibility.

Technical Designs and Decisions

Data consistency options

Problem statement

Data import

Possible solution

Distributed transactions

SAGA pattern

Choreography-based SAGA vs Orchestration-based SAGA

Data consistency

Transactions reverting

Rollback

Compensation transaction

Isolation

Transaction data isolation

Modification barriers

Explicit locking

Pending versions (implicit locking)

Data Import

Data import profile

Rollback control options

Choreography

Orchestration

Rollback implementation options

Isolation

Conclusion and thoughts