Automate linking

Automate linking in quickMARC app

Introduction

This document outlines the design of a backend feature, which will allow users to automatically validate/update/create links for MARC bib fields to an authority record when editing a MARC bib record.

More details on a feature: [UXPROD-3874] and spike story: [MODELINKS-79]

Requirements

Functional Requirements

  1. The API must allow finding all applicable MARC authorities to control the MARC bib record based on the $0 subfield of the bib's fields.
  2. The API must provide only suggestions for links, and must not save any new data, saving will be performed by a user.
  3. The API must allow saving the MARC bib record even if linking a MARC bib field to a MARC authority record was unsuccessful.
  4. The API must allow sending a MARC record for links assignment with already existing links.
  5. The API must allow sending a MARC record for links assignment with not saved links and changes to the bib record.
  6. The API must provide a notification of what fields were successfully linked and what fields that are applicable for linking failed to link.
  7. The API must provide different types of failures for failed fields.

Non-Functional Requirements

  1. Automate linking should take no longer than ~2 seconds

Architecture

An API endpoint will be implemented in mod-quick-marc to provide UI with suggested links for the record. mod-quick-marc will work as a proxy module, that will call the mod-entities-links newly created endpoint. All main linking logic will be implemented in mod-entities-links. mod-search will be used as a search service for finding applicable MARC authorities. mod-source-record-storage will be used for fetching data needed to construct controllable fields in the MARC bib record. All interaction between modules is via HTTP.

Component Diagram Source
@startuml
skinparam componentStyle rectangle

[User Interface]


package "Backend" {
  [Okapi] --> [mod-entities-links]
  [mod-entities-links] ..> [Okapi]
  [Okapi] --> [mod-search]
  [mod-search] ..> [Okapi]
  [Okapi] --> [mod-quick-marc]
  [mod-quick-marc] ..> [Okapi]
  [Okapi] --> [mod-source-record-storage]
  [mod-source-record-storage] ..> [Okapi]
}

[User Interface] --> [Okapi]
[Okapi] --> [User Interface]

database "PostgreSql" {
  [entities-links]
  [source-record-storage]
}

database "OpenSearch/ElasticSearch" {
  [authorities]
}

[mod-entities-links] <--> [entities-links]
[mod-source-record-storage] <--> [source-record-storage]
[mod-search] <--> [authorities]

@enduml

Data Flow and Processing

Sequence Diagram Source
@startuml
title QuickMarc Autolinking 

participant UI as ui
participant "quick-marc" as qm
participant "entities-links" as el
participant "search" as ms
participant "source-record-storage" as rs

autonumber
ui -> qm ++ : request record\nwith links
qm -> qm: convert
qm -> el ++ : request assign links
el -> el : get linking rules
el -> el : extract $0s
el -> ms ++ : search authorities (without count)
ms --> el -- : authorities
el -> rs ++ : get records by external ids
rs --> el -- : source records
el -> el : set links data
el --> qm -- : record with links
qm -> qm -- : convert
qm --> ui : record with links
@enduml
  1. The UI sends a request to the backend API.
  2. mod-quick-marc receives the request and converts the record into an SRS-like format.
  3. mod-quick-marc sends a request to the mod-entities-links API.
  4. mod-entities-links receives the request and fetches linking rules from the database using cache.
  5. From MARC bib fields that are applicable for linking according to linking rules, $0 subfield values are extracted.
  6. $0 values are used for search authorities in mod-search. The current mod-search endpoint also calculates a number of already existing links in the instance index, this should be omitted to speed up the process. TBD: authorities naturalId is exist in the internal database table 'authority_data'. Should we use this data before doing a search in mod-search? 
  7. mod-entities-links receives a collection of authority records and prepares a request to the mod-source-record-storage.
  8. mod-entities-links sends a request to the mod-source-record-storage bulk endpoint.
  9. mod-entities-links receives a collection of authority source records.
  10. mod-entities-links analyze results, prepare data for links according to linking rules, and set constructed links into the record.
  11. mod-quick-marc receives the record with links.
  12. mod-quick-marc converts the record into the appropriate format.
  13. UI receives the record with suggested links.

API Design

mod-quick-marc

POST /records-editor/links/suggestion

This endpoint will be used to find and provide UI with valid links for a record. The request will include a JSON payload with the record data:

Request body
{
  "marcFormat": "BIBLIOGRAPHIC",
  "leader": "01587ccm a2200361   4500",
  "fields": [
    {
      "tag": "001",
      "content": "393893"
    },
    {
      "tag": "100",
      "content": "$a 393893 $b test $0 n1234567890 $9 312da284-a8fd-4c84-ae90-927539d6df93",
      "indicators": [
        "1",
        "2"
      ],
      "link": {
        "authorityId": "312da284-a8fd-4c84-ae90-927539d6df93",
        "authorityNaturalId": "n1234567890",
        "linkingRuleId": 1,
        "status": "ACTUAL"
      }
    },
    {
      "tag": "100",
      "content": "$a 393893 $b test $0 n1234567890 $9 312da284-a8fd-4c84-ae90-927539d6df93",
      "indicators": [
        "1",
        "2"
      ],
      "link": {
        "authorityId": "312da284-a8fd-4c84-ae90-927539d6df93",
        "authorityNaturalId": "n1234567890",
        "linkingRuleId": 1,
        "status": "ERROR"
      }
    },
    {
      "tag": "600",
      "content": "$a 393893 $b test",
      "indicators": [
        "1",
        "2"
      ]
    }
  ]
}

The response will include suggested links with the status "NEW"; fixed data and status "ACTUAL" for links, that had the status "ERROR"; links with the status "ERROR" and cause type for fields where a link can't be assigned.

Response body
{
  "marcFormat": "BIBLIOGRAPHIC",
  "leader": "01587ccm a2200361   4500",
  "fields": [
    {
      "tag": "001",
      "content": "393893"
    },
    {
      "tag": "100",
      "content": "$a 393893 $b test $0 n1234567890 $9 312da284-a8fd-4c84-ae90-927539d6df93",
      "indicators": [
        "1",
        "2"
      ],
      "link": {
        "authorityId": "312da284-a8fd-4c84-ae90-927539d6df93",
        "authorityNaturalId": "n1234567890",
        "linkingRuleId": 1,
        "status": "ACTUAL"
      }
    },
    {
      "tag": "110",
      "content": "$a 393893 $b updated $0 n1234567890 $9 312da284-a8fd-4c84-ae90-927539d6df93",
      "indicators": [
        "1",
        "2"
      ],
      "link": {
        "authorityId": "312da284-a8fd-4c84-ae90-927539d6df93",
        "authorityNaturalId": "n1234567890",
        "linkingRuleId": 1,
        "status": "NEW"
      }
    },
    {
      "tag": "600",
      "content": "$a 393893 $b test",
      "indicators": [
        "1",
        "2"
      ],
      "link": {
        "status": "ERROR",
        "errorCauseCode": "101"
      }
    }
  ]
}

Error cause types:

Error cause codeDescription
101applicable authority was not found 
1022 or more applicable authorities were found
103auto linking feature is disabled
TBD
Request body
{
  "records": [
    {
      "fields": [
        {
          "001": "393893"
        },
        {
          "100": {
            "ind1": "/",
            "ind2": "/",
            "subfields": [
              {
                "a": "Mozart, Wolfgang Amadeus,"
              },
              {
                "d": "1756-1791."
              },
              {
                "0": "12345"
              },
              {
                "9": "b9a5f035-de63-4e2c-92c2-07240c88b817"
              }
            ],
            "linkStatus": "ACTUAL"
          }
        },
        {
          "110": {
            "ind1": "/",
            "ind2": "/",
            "subfields": [
              {
                "a": "Mozart"
              }
            ]
          }
        }
      ],
      "leader": "01706ccm a2200361   4500"
    }
  ]
}

The response will include suggested links with the status "NEW"; fixed data and status "ACTUAL" for links, that had the status "ERROR"; links with the status "ERROR" and cause type for fields where a link can't be assigned.

Request body
{
  "records": [
    {
      "fields": [
        {
          "001": "393893"
        },
        {
          "100": {
            "ind1": "/",
            "ind2": "/",
            "subfields": [
              {
                "a": "Mozart, Wolfgang Amadeus,"
              },
              {
                "d": "1756-1791."
              },
              {
                "0": "12345"
              },
              {
                "9": "b9a5f035-de63-4e2c-92c2-07240c88b817"
              }
            ],
            "linkStatus": "ACTUAL"
          }
        },
        {
          "110": {
            "ind1": "/",
            "ind2": "/",
            "subfields": [
              {
                "a": "Mozart"
              },
              {
                "0": "12345"
              },
              {
                "9": "b9a5f035-de63-4e2c-92c2-07240c88b817"
              }
            ],
            "linkStatus": "NEW"
          }
        },
        {
          "130": {
            "ind1": "/",
            "ind2": "/",
            "subfields": [
              {
                "a": "Mozart"
              }
            ],
            "linkStatus": "ERROR",
            "errorStatusCode": "101"
          }
        }
      ],
      "leader": "01706ccm a2200361   4500"
    }
  ]
}

mod-source-record-storage

POST /source-storage/batch/parsed-records/fetch
Request body
{
  "conditions": {
    "ids": [
      "312da284-a8fd-4c84-ae90-927539d6df93",
      "934fee76-89e5-4046-89f0-d812e5368e1c"
    ],
    "idType": "EXTERNAL"
  },
  "data": {
    "fieldsRange": "010,100-199"
  },
  "recordType": "MARC_AUTHORITY"
}

The response will include collection of records found by conditions, records will contains all related to a record ids and only fields that are included in fieldsRange field.

Response body
{
  "records": [
    {
      "id": "c56b70ce-4ef6-47ef-8bc3-c470bafa0b8c",
      "externalIdsHolder": {
        "authorityId": "b9a5f035-de63-4e2c-92c2-07240c89b817"
      },
      "recordType": "MARC_AUTHORITY",
      "recordState": "ACTUAL",
      "parsedRecord": {
        "id": "c9db5d7a-e1d4-11e8-9f32-f2801f1b9fd1",
        "content": {
          "fields": [
            {
              "010": {
                "ind1": " ",
                "ind2": " ",
                "subfields": [
                  {
                    "a": "2001000234"
                  }
                ]
              }
            },
            {
              "100": {
                "ind1": "/",
                "ind2": "/",
                "subfields": [
                  {
                    "a": "Mozart, Wolfgang Amadeus"
                  },
                  {
                    "d": "1756-1791"
                  }
                ]
              }
            },
            {
              "110": {
                "ind1": "1",
                "ind2": "0",
                "subfields": [
                  {
                    "a": "Works"
                  }
                ]
              }
            }
          ],
          "leader": "01706ccm a2200361   4500"
        }
      }
    }
  ],
  "totalRecords": 1
}

GET /search/authorities

New query parameter to add:

ParameterTypeNote
includeNumberOfTitlesboolean (default = true) If true do not perform a search for a number of linked instances


Performance

Considerations

  1. Using mod-search for searching by naturalId instead of just doing a search in mod-source-record-storage has to decrease response time when the number of records in the system is more than 1M. (Using mod-search will be needed for possible future requirements to have automated linking not only by $0 but by some other data)
  2. Disabling the linked instances counting for the mod-search authority request have to decrease the time of response.
  3. Having only required fields in the mod-source-record-storage response will decrease the size of data that has to be transferred via HTTP. The necessity of this should be tested to define if such processing will decrease performance. 2 options there: get the record as jsonb from the marc_records table and retain only needed fields or construct a record field-by-field from the marc_indexers partitioned table.
  4. Using mod-search and mod-source-record-storage bulk endpoints will decrease response time.

Testing

Performance testing has to be done on the environment with:

  • > 1M authority records
  • > 1M MARC-based instance records
  • Prepared MARC bib records that have >50 fields that are applicable for linking and all these fields should have $0 values matched to existing in the system authorities..

Tests are needed for:

  • 1 request/sec
  • 10 requests/sec
  • 100 requests/sec
  • 1000 requests/sec
Based on testing results some performance improvements could be suggested if it will be required.