Search in Source Record Storage

RolePersonComments
Solution Architect
Java LeadIgor Gorchakov
UI Lead
no UI planned

Product Owner

Jenn Colt (for this feature)

Ann-Marie Breaux (Deactivated) (for SRS)



MODSOURCE-215 - Getting issue details... STATUS

UXPROD-2791 - Getting issue details... STATUS

Introduction

The existing Source Record Storage (SRS) solution for FOLIO has a limited set of search capabilities. Several changes to SRS storage have been made to improve data structures and speed up queries with criteria based on identifiers (IDs) and metadata fields. But those changes did not affect MARC records, how they are represented and stored. As a consequence, any search requests based on criteria to MARC fields are not efficient especially on quite big volumes of data.

Problem statement

Current state

In the existing solution, MARC records are stored as JSON documents in JSONB fields. Possible approaches for indexing involve the creation of quite complex indexes, but the efficiency of using such indexes is poor in most cases. This is especially evident for large amounts of data.

Currently, search requests to SRS can be divided into several categories:

  • Single record search by internal Ids (fields that are not parts of the MARC record).
    This type of request can be served very efficiently because indexes required for that are already in place.
  • Requests from “Data Import” for MARC matching operations.
    Usually, such requests contain one or more exact match conditions based on values of MARC fields. There are also cases when values must be preprocessed in some way like skipping prefixes and/or suffixes, evaluating substrings for offset and length specified, and so on. It is expected that such requests return zero or one record. There is no efficient solution for such cases.
  • Requests with complex criteria based on values of MARC fields that can return quite big result sets.
    It is expected that such requests could be executed no longer than one minute. There is no efficient solution for such cases as well.

Desired state

It is expected for SRS to support all types of requests listed in the previous section and comply with the following requirements:

  • Functional requirements
    1. Support exact match search based on MARC fields that hold unique values, for example, a search for a particular value in the 001 fields. Other fields that can be used are 999, 010, 020, 022, 035 for example.
      1. Support additional conditions to the exact match search. These conditions could be applied to any MARC field and could use expressions like ‘=’, ‘>’, ‘<’, ‘contains’ ‘begins with’, ‘ends with’
    2. Support search for a list of MARC records that conform to some criteria using conjunctions ‘AND’ & ‘OR’. Possible expressions for a criterion could be:
      1. ‘=’, ‘>’, ‘<’
      2. ‘contains’ ‘begins with’, ‘ends with’
      3. A value of a field has a specified character at a specified position
      4. A value of a field has a specified substring starting from a specified position
    3. Negative conditions (NOT) must be supported.
    4. Support search for a list of MARC records that have specified values at specified positions in MARC leader field
  • Non-functional requirements
    1. The search must be performed in a consistent way (in terms of ACID), in other words, once changes to SRS are committed by the user/process ‘A’ those must be immediately available without delays for all other users/processes performing requests to SRS.
    2. The execution time for requests to retrieve a single MARC record should be no longer than half a second. This strict requirement arises because quite big files must be processed by the Data import within a reasonable time frame.
    3. The expectations for execution time for requests to retrieve multiple records must be discussed further because it highly depends on the business flows that will use such requests. The initial expectation for such execution time is no longer than 1 minute, but it raises additional questions such as how final data should be available to the user especially in cases where rather quite large result sets (thousands or millions of rows) are returned by the underlying persistent storage.

Proposed solution

Analysis of the current solution

MARC records are stored as JSON documents in the Source Record Storage. Since JSON is an open type, it allows us to work with loosely structured data types, an example of which is MARC. Using JSON it is quite simple to store, process, and represent data. But the main disadvantage of this approach is a search. The main concerns here are the volumes that must be indexed on the one hand and the inability to create simple and efficient indexes on the other hand. These two concerns must be addressed first. 

Rationale for the database approach (PostgreSQL)

The requirement for consistent search is significant and can’t be ignored. Since PostgreSQL database is used as persistence storage for FOLIO platform and it can’t be changed any additional storage could not be used for indexing and search because it would lead to eventual consistency that is not acceptable. Based on that the only way to achieve consistent search is to use PostgreSQL indexing capabilities.

Solution with partitioned tables

The MARC 21 format presents a generalized structure for records. Each record begins with a leader, which is a fixed field containing information for the processing of the record. Following the leader is the directory, which is an index to the location of the variable fields (control and data) within the record. The fields following the directory are all variable fields. The first variable field is the control number field, which contains an ASCII graphic character string uniquely associated with the record by the organization transmitting the record. Following the control number field are the rest of the control fields, which contain information useful or required for the processing of the record. Following the control fields are data fields, which contain general data. A field terminator (FT), ASCII control character 1E(hex), is used to terminate the directory and each variable field within the record. A record terminator (RT), ASCII control character 1D(hex), is used as the final character of the record, following the field terminator of the last data field. These elements of the record are described in more detail in the following sections.

Taking into account the structure of MARC records it makes sense to store different parts of MARC records separately. The intention is to add additional tables to store smaller volumes of data in each using a format suitable for simple indexing and do not change existing tables and structures. These new tables will be used for indexing and searching purposes only. At the same time, the unified structure must be used for those tables to simplify query generation and data access.

PostgreSQL supports a powerful feature like table partitioning. Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:

  • Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
  • When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of the sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
  • Seldom-used data can be stored in separate partitions and migrated to cheaper and slower storage media.

So the idea is to partition a table by “Field no” and change the table structure at the same time to use only simple types.

MARC leader has a very strict structure defined at https://www.loc.gov/marc/bibliographic/bdleader.html. So, it makes sense to store separately every part of the leader value to support search conditions addressed to the leader and make them simple. This means that a dedicated table should be created to store leader values.

Partitioned table


marc_indexers

field_no    varchar(3)

ind1        varchar(1)

ind2        varchar(1)

subfield_no varchar(1)

value       text

marc_id     uuid

Create table SQL statement

create table marc_indexers
(
    field_no    varchar(3) constraint nn_marc_indexers_field_no not null,
    ind1        varchar(1) constraint nn_marc_indexers_ind1 not null,
    ind2        varchar(1) constraint nn_marc_indexers_ind2 not null,
    subfield_no varchar(1) constraint nn_marc_indexers_subfield_no not null,
    value       text,
    marc_id     uuid constraint nn_marc_indexers_marc_id not null
) partition by list (field_no);

Create partitions plpg/sql block

do
$$
    declare
        ind integer;
        suffix text;
    begin
        for ind in 0 .. 999
            loop
                suffix = lpad(ind::text, 3, '0');
                execute 'create table marc_indexers_' || suffix || ' partition of marc_indexers for values in ('''|| suffix ||''');';
            end loop;
        create table marc_indexers_cat partition of marc_indexers for values in ('cat');
        create table marc_indexers_own partition of marc_indexers for values in ('own');
        create table marc_indexers_lkr partition of marc_indexers for values in ('lkr');
    end;
$$;


The main concern here is that apart from standard field numbers and leader some libraries can use their own identifiers as field_no as in the example above. So there must be a solution to handle such cases.

Leader table

marc_indexers_leader
p_00_04 varchar(5)
p_05    varchar(1)
p_06    varchar(1)
p_07    varchar(1)
p_08    varchar(1)

p_09    varchar(1)

p_10    varchar(1)
p_11    varchar(1)
p_12_16 varchar(5)
p_17    varchar(1)
p_18    varchar(1)
p_19    varchar(1)
p_20    varchar(1)
p_21    varchar(1)
p_22    varchar(1)
marc_id uuid

Create table SQL statement

create table marc_indexers_leader
(
    p_00_04 varchar(5) constraint nn_marc_indexers_leader_p_00_04 not null,
    p_05    varchar(1) constraint nn_marc_indexers_leader_p_05 not null,
    p_06    varchar(1) constraint nn_marc_indexers_leader_p_06 not null,
    p_07    varchar(1) constraint nn_marc_indexers_leader_p_07 not null,
    p_08    varchar(1) constraint nn_marc_indexers_leader_p_08 not null,
    p_09    varchar(1) constraint nn_marc_indexers_leader_p_09 not null,
    p_10    varchar(1) constraint nn_marc_indexers_leader_p_10 not null,
    p_11    varchar(1) constraint nn_marc_indexers_leader_p_11 not null,
    p_12_16 varchar(5) constraint nn_marc_indexers_leader_p_12_16 not null,
    p_17    varchar(1) constraint nn_marc_indexers_leader_p_17 not null,
    p_18    varchar(1) constraint nn_marc_indexers_leader_p_18 not null,
    p_19    varchar(1) constraint nn_marc_indexers_leader_p_19 not null,
    p_20    varchar(1) constraint nn_marc_indexers_leader_p_20 not null,
    p_21    varchar(1) constraint nn_marc_indexers_leader_p_21 not null,
    p_22    varchar(1) constraint nn_marc_indexers_leader_p_22 not null,
    marc_id uuid       constraint nn_marc_indexers_marc_id not null
);

Filling in data.

There are two events when data must be added or changed in that new table.

The first one is the initial data loading once this feature is added. It can be done with a single plpg/sql block that parses and inserts all data for all MARC records.

do
$$
    <<fill_marc_indexers_simple_block>>
        declare
        rec record;
    begin
        for rec in (select id, content from marc_records_lb x)
            loop
                insert into marc_indexers (field_no, ind1, ind2, subfield_no, value, marc_id)
                    (with vals as (select value
                                   from jsonb_array_elements((
                                       select value
                                       From jsonb_each(rec.content) x
                                       where key = 'fields')) y),
                          fields as (select x.key as field_no, x.value as field_value
                                     From vals,
                                          jsonb_each(vals.value) x),
                          fields_subfields as (
                              select field_no,
                                     trim(field_value ->> 'ind1'::text) ind1,
                                     trim(field_value ->> 'ind2')       ind2,
                                     field_value -> 'subfields'         subfields,
                                     field_value
                              From fields),
                          marc_raw as (
                              select fs.field_no, fs.ind1, fs.ind2, fs.field_value, null::text subfield_no, null::text subfield_value
                              from fields_subfields fs
                              where subfields is null
                              union all
                              select fs.field_no, fs.ind1, fs.ind2, fs.field_value, subfs.key::text subfield_no, subfs.value::text subfield_value
                              From fields_subfields fs,
                                   jsonb_array_elements(fs.subfields) sx,
                                   jsonb_each(sx.value) subfs
                              where subfields is not null),
                          marc as (
                              select m.field_no,
                                     CASE WHEN ind1 IS NULL or ind1 = '' THEN '#' ELSE ind1 END                            as ind1,
                                     CASE WHEN ind2 IS NULL or ind2 = '' THEN '#' ELSE ind2 END                            as ind2,
                                     CASE WHEN subfield_no IS NULL or trim(subfield_no) = '' THEN '0' ELSE subfield_no END as subfield_no,
                                     trim(both '"' from coalesce(subfield_value, field_value::text))                       as value
                              From marc_raw m)
                     select distinct lower(field_no) field_no, ind1, ind2, subfield_no, value, rec.id marc_id
                     from marc);
                --
                insert into marc_indexers_leader(p_00_04, p_05, p_06, p_07, p_08, p_09, p_10, p_11, p_12_16, p_17, p_18, p_19, p_20, p_21, p_22, marc_id)
                    (select substring(value from 1 for 5)  p_00_04,
                            substring(value from 6 for 1)  p_05,
                            substring(value from 7 for 1)  p_06,
                            substring(value from 8 for 1)  p_07,
                            substring(value from 9 for 1)  p_08,
                            substring(value from 10 for 1) p_09,
                            substring(value from 11 for 1) p_10,
                            substring(value from 12 for 1) p_11,
                            substring(value from 13 for 5) p_12_16,
                            substring(value from 18 for 1) p_17,
                            substring(value from 19 for 1) p_18,
                            substring(value from 20 for 1) p_19,
                            substring(value from 21 for 1) p_20,
                            substring(value from 22 for 1) p_21,
                            substring(value from 23 for 1) p_22,
                            marc_id
                     from (select replace(lower(trim(both '"' from value::text)), ' ', '#') as value,
                                  rec.id                                                       marc_id
                             From jsonb_each(rec.content) x
                            where key = 'leader') y);
            end loop;
    end fill_marc_indexers_simple_block
$$;

The second one is when data is changed in the original table where MARC records are stored. The database trigger must be used for consistent tracking of all the changes and reflecting those into the new table. 

create or replace function fill_in_marc_indexers(p_marc_id uuid, p_marc_content jsonb)
    returns void
as
$fill_in_marc_indexers$
begin
    insert into marc_indexers (field_no, ind1, ind2, subfield_no, value, marc_id)
        (with vals as (select value
                       from jsonb_array_elements((
                           select value
                           From jsonb_each(p_marc_content) x
                           where key = 'fields')) y),
              fields as (select x.key as field_no, x.value as field_value
                         From vals,
                              jsonb_each(vals.value) x),
              fields_subfields as (
                  select field_no,
                         trim(field_value ->> 'ind1'::text) ind1,
                         trim(field_value ->> 'ind2')       ind2,
                         field_value -> 'subfields'         subfields,
                         field_value
                  From fields),
              marc_raw as (
                  select fs.field_no, fs.ind1, fs.ind2, fs.field_value, null::text subfield_no, null::text subfield_value
                  from fields_subfields fs
                  where subfields is null
                  union all
                  select fs.field_no, fs.ind1, fs.ind2, fs.field_value, subfs.key::text subfield_no, subfs.value::text subfield_value
                  From fields_subfields fs,
                       jsonb_array_elements(fs.subfields) sx,
                       jsonb_each(sx.value) subfs
                  where subfields is not null),
              marc as (
                  select m.field_no,
                         CASE WHEN ind1 IS NULL or ind1 = '' THEN '#' ELSE ind1 END                            as ind1,
                         CASE WHEN ind2 IS NULL or ind2 = '' THEN '#' ELSE ind2 END                            as ind2,
                         CASE WHEN subfield_no IS NULL or trim(subfield_no) = '' THEN '0' ELSE subfield_no END as subfield_no,
                         trim(both '"' from coalesce(subfield_value, field_value::text))                       as value
                  From marc_raw m)
         select distinct lower(field_no) field_no, ind1, ind2, subfield_no, value, p_marc_id marc_id
         from marc);
    --
    insert into marc_indexers_leader(p_00_04, p_05, p_06, p_07, p_08, p_09, p_10, p_11, p_12_16, p_17,
                                                                    p_18, p_19, p_20, p_21, p_22, marc_id)
        (select substring(value from 1 for 5)  p_00_04,
                substring(value from 6 for 1)  p_05,
                substring(value from 7 for 1)  p_06,
                substring(value from 8 for 1)  p_07,
                substring(value from 9 for 1)  p_08,
                substring(value from 10 for 1) p_09,
                substring(value from 11 for 1) p_10,
                substring(value from 12 for 1) p_11,
                substring(value from 13 for 5) p_12_16,
                substring(value from 18 for 1) p_17,
                substring(value from 19 for 1) p_18,
                substring(value from 20 for 1) p_19,
                substring(value from 21 for 1) p_20,
                substring(value from 22 for 1) p_21,
                substring(value from 23 for 1) p_22,
                marc_id
         from (   select replace(lower(trim(both '"' from value::text)), ' ', '#') as value,
                         p_marc_id                                                    marc_id
                  From jsonb_each(p_marc_content) x
                  where key = 'leader') y);
end;
$fill_in_marc_indexers$ language plpgsql;

create or replace function insert_marc_indexers()
    returns trigger
as
$insert_marc_indexers$
begin
    if (TG_OP = 'UPDATE') then
        delete from marc_indexers where marc_id = NEW.id;
        delete from marc_indexers_leader where marc_id = NEW.id;
    end if;

    perform fill_in_marc_indexers(NEW.id, NEW.content);
    return NEW;
end;
$insert_marc_indexers$ language plpgsql;

create or replace function delete_marc_indexers()
    returns trigger
as
$delete_marc_indexers$
begin
    if (TG_OP = 'DELETE') then
        delete from marc_indexers where marc_id = OLD.id;
        delete from marc_indexers_leader where marc_id = OLD.id;
    end if;
    return OLD;
end;
$delete_marc_indexers$ language plpgsql;

DROP TRIGGER IF EXISTS process_marc_records_lb_delete_trigger ON marc_records_lb CASCADE;
DROP TRIGGER IF EXISTS process_marc_records_lb_insert_update_trigger ON marc_records_lb CASCADE;

create trigger process_marc_records_lb_delete_trigger
before delete on marc_records_lb for each row execute function delete_marc_indexers();

create trigger process_marc_records_lb_insert_update_trigger
after insert or update on marc_records_lb for each row execute function insert_marc_indexers();

Indexing data

Since every “Field no” has its own defined format separate indexes can be created for partitions that contain significant volumes of data. Example of indexes:

create index idx_marc_indexers_marc_id on marc_indexers(marc_id);
create index idx_marc_indexers_leader_marc_id on marc_indexers_leader(marc_id);

create index idx_marc_indexers_001_vals on marc_indexers_001 using gin(regexp_match(value, '\W*((?:[a-zA-Z]*)|(?:[0-9]*))\W*((?:[a-zA-Z]*)|(?:[0-9]*))\W*((?:[a-zA-Z]*)|(?:[0-9]*))\W*') array_ops);
create index idx_marc_indexers_035_vals on marc_indexers_035 using gin(regexp_match(value, '\(*([a-zA-Z]*?)\)*(\w+)') array_ops);
create index idx_marc_indexers_005_vals on marc_indexers_005(value);
create index idx_marc_indexers_999_vals on marc_indexers_999(value);

Based on test runs it is not necessary to create indexes for partitions with less than 1.5 million rows because even sequential scans work less than half a second for such partitions.

Querying data

The SQL query to retrieve data can be generated on the fly based on the criteria provided. Since partition naming is fixed and each partition can be treated as a separate table it is quite easy to generate SQL queries like 

select m.content
from marc_indexers_005 i005
         inner join marc_indexers_338 i338 on i005.marc_id = i338.marc_id
         inner join marc_records_lb m on i338.marc_id = m.id
where i005.value between ? and ?
  and i338.subfield_no = ?
  and i338.value = ?;

This query is for the request like “As batch processing staff I want to be able to search certain MARC fixed fields as dates, for export purposes. For instance, I want to search for a week worth of records based on the first eight characters of the 005 and for tag 338 with a subfield ‘a’ that equals "volume".”

Metadata dictionary.

It is worth to have some sort of metadata dictionary to store information regarding indexes created for partitions as well as expressions that could be used to generate conditions for the “Where” clause.

REST Api

Based on the requirements only two new end-points should be created. The first one is to retrieve a single MARC record based on an exact match search. The second one is to retrieve records that conform to the criteria provided. Since expressions for criteria can be quite complex it makes sense to use POST requests with criteria provided as JSON in the request body. Streaming to the HTTP response also should be supported for the second end-point to allow to retrieve huge volumes of data. Another advantage of streaming is that sorting is not required for such queries that also improves performance.

Entities returned by these end-points must be MARC records without any additional metadata or internal fields.