Page tree
Skip to end of metadata
Go to start of metadata

Jira links

ARCH-30 - Getting issue details... STATUS

Overview

Needs to support reporting that allows cataloger to know about updates on linked Bibliographic fields and Authorities over a period of time. Phase 2: Reporting updates - Technical Designs and Decisions - FOLIO Wiki

As it's pointed in the related document (Technical approach for update MARC Bib fields controlled by related Authority records - Technical Designs and Decisions - FOLIO Wiki) there are two phases for updating entities (1st - update entity itself and 2nd - update links). Hence two types of reports to be retrieved by specified filter criteria.

Scope

The documentation covers actual implementation details on interaction with Reporting mechanism to retrieve insights on updated authorities and bibliographic records fields linkage by specific date range.

  • interaction with Reporting mechanism via API on push statistic data (its structure) and request to form report by specified filter criteria.
  • Filter criteria to form proper report on user's request
  • interaction with mod-exporter to generate .csv report and potential changes required
  • interaction of mod-exporter with mod-entities-links to retrieve data for the report
  • variation of output (.csv file or error to process request)
  • Data flow overview

Solution

Flyover diagram

Discussed questions

  1. Format of Input statistics data for authorities and linked bibs
  2. success / error result payload on each step
  3. Actual component details for Reporting Mechanism
  4. determine interaction via new API of mod-data-export and Reporting mechanism
  5. Determine retention policy of reports generated and stored to cloud storage
  6. Determine sufficient / insufficient filtering criteria
  7. Define actual changeset on mod-data-export and Reporting mechanism 
  8. API to check state on .csv report generation by exportJobId


Implementation details

Reporting mechanism overview

Is abstraction that can be presented for instance by Metadata Provider of mod-source-record-manager API. Under the hood there are journal records stored within Postgres DB (journal table). 

Current API lets fetch journal records by jobExecutionId.

Option 1: to extend API to fetch journal records by filter criteria - (type, date range, status, action, heading fields updated or retained).

Concern: statistic data of linked entities will be separately aggregated and retrievable from module with different role.

Recommendation is to consider alternative storing stats data of linkage within mod-entities-links. 

Alternative: elaborate reporting mechanism functionality on the side of mod-entities-links.

Rationale is that linked bibliographic records data to authority and actual mapping is placed within DB of mod-entities-links. Therefore, implementation of saving and retrieving stats data from actual source should reduce performance and space cost. 

Input Data

Schema of reporting stats data on authority and linked bibliographic records updates.


In case of bulk updates of bibliographic records, writes to Reporting mechanism are performed in batch to reduce IO cost on the Database side when persisting new records.

Database layer

Is represented by two tables.

links_stats_authorities - stores all records as events occurred with modifications of authorities (on authority update, creation, deletion). It must be initialized with the already existing data in instance_link.

(NOTE: Avoid duplication of data by potential re-initializations of the module).

Contains status, error cause, entity id, date created, action type, data import job id, authority heading (see fields from table described in related Issue).

links_stats_bibs - stores all records as events occurred with modified linkage of instances (linked bibliographic records link(s) added, changed (updated / removed)).

This ER diagram represents the data model for the module.

  • authority_data - this table contains the data of authority records stored in the mod-source-record-storage that are or were referenced by the inventory instances. The fields in this table contain only the authority data needed for audit trail and reporting.
  • instance_data - this table contains the data of instance records that have references to the authority records in mod-source-record-storage. The fields in this table contain only the instance data needed for audit trail and reporting.
  • instance_authority_link -  this table is an association between instances and authorities representing many-to-many relations between these entities.      
  • authority_data_stats - this table acts as an audit trail for authority_data and contains changes applied to the authority_data records.
  • instance_data_stats - this table contains records that reflect changes applied to instance records as a consequence of an authority record update.

Reports generation

To expose functionality of Reports generation changes required on mod-data-export-spring and mod-export-worker required to provide functionality on generation a new export Job to retrieve authority control reporting data.

On mod-data-export-spring side after receiving data export Job request via REST API it stores Job request and sends export commands to mod-data-export-worker via Kafka.

The mod-data-export-worker module retrieves data from other Folio modules via their REST API and adds it to CSV file parts. Once all required data is retrieved, the worker uploads the file parts to the Folio Object storage.

Once the file is uploaded, the module generates a download URL and sends it back to mod-data-export-spring via Kafka. 


Efforts: mod-data-export-worker requires changes to support interaction with mod-entities-links.

Addressed Points

  • Support reporting that allows a cataloger to know 
    • Which Authority headings (1XX) have changed over a period of time
      By interacting with /links/stats API (mod-entities-links) it is feasible to query by params startDate, endDate, updatedByTag
    • Which Linked Bibliographic fields failed to update (including reason why) when linked Authority 1XX/010 $a updated over a period of time
      By interacting with /links/stats API (mod-entities-links) it is feasible to query by params startDate, endDate, updatedByTag=1XX,010$a , status=error, type=bib
    • When an authority heading (1XX) is not linked to any bib field over a period of time
      ! It is NOT feasible by only interacting with mod-entities-links schema as it is not possible to know which authorities exist outside of mod-entities-links as only linked ones are stored. So it is proposed to use the mod-search to find such authority records. The prerequisite for this is the MSEARCH-485 - Getting issue details... STATUS spike because one more field should be added to the index structure that represents the number of instance records linked to a particular authority record.
  • These reports may be accessible from Inventory app or Authority app or Export Manager or Data export (depends on technical discussion) 
    universal point of access is supposed to be mod-data-exporter-spring that performs interaction with mod-entities-links (Reporting) and mod-data-export-worker facilitates retention of .csv reports within Cloud Storage.

  • These reports may be available as a csv export 
    feasible according to functionality Data export by using Spring Batch (aka Export Manager) - Technical Designs and Decisions - FOLIO Wiki
  • In addition Authority app may have additional facets/filters to allow a user
    • To filter Authority records based on whether the record is linked to a bib field/record or not. The prerequisite for this is the MSEARCH-485 - Getting issue details... STATUS spike because one more field should be added to the index structure that represents the number of instance records linked to a particular authority record. 


Reporting mechanism API

POST /links/stats (alternative: consuming via Kafka inventory.authority.bib.stats topic)

request body or Events structure

in stats
{   "type": "authority", /* or "instance" indicates type of entity */
    "entity_id": "<UUID>", /* id of instance or authority  */
    "diJobId": "<UUID>", /* data import job id. Is nullable as updates could be done manually outside the di process*/
    "linkedAuthorityId": "<UUID>", /* id of authority within which updating of linked bibs is done*/
    "authorityStatsId": "<UUID>", /* id of record from links_stats_authorities table. Used for correlation between bibs stats and authority stats */
    "status": "fail", /* or "success" result of action (e.g.: update, delete)*/
    "errorCause": "failed to update authority - Illegal subfield $y of tag 1XX value", /* describes reason of error on action*/
    "actionType": "UPDATE|DELETE|CREATE", /* type of the event*/
    "linkageOn": "010$a", /* indicates linking on specific field and subfield for type "instance". (Alternatives to be considered)*/
    "tenant": "diku", /* tenant id"*/
    "ts": "1625832841003" /* timestamp of the event*/
}



GET /links/stats

parameter namevaluesdescription
typeauthority|instanceentity type to retrieve. If missed will return both authorities and instances stats
statussuccess|failstatus of finished action (update, create or remove). If missed will return any
startDatedate format
endDatedate format
actionTypecreate|update}deleteaction performed on linkage for bibs (in case of type instance) or on
linkedOnstring (formatted)

points linkage by specific fields, subfields. Could be setup as follows: 1XX,010$a,any,none.

If missed should return all not linked entities

offsetinteger
limitinteger

Examples:

Fetch first 500 unlinked authorities by specified date range:

GET /links/stats?limit=500&type=authority&linkedOn=none&startDate=today&endDate=today

Get all failed bibliographic records to update for specific date range:

GET /links/stats?type=instance&status=fail&startDate=05/07/24&endDate=11/07/24

Get all authorities where heading's been changed over a period of time

GET /links/stats?type=authority&linkedOn=1XX&actionType=update&startDate=05/07/24&endDate=11/07/24

Delivery Plan

  1. mod-entities-links modifications (accept stats and query stats API)
  2. modification on mod-export-worker (interaction with mod-entities-links to retrieve data in report by query)
  3. mod-data-export-spring modifications (to create export job for authority control reporting data)
  4. apply changes on mod-search, ui-quick-mark and mod-quick-marc according to the research results
  5. writing performance tests on forming multiple reports. Research and document bottlenecks
  6. apply optimizations after perf analysis (if required)

LOE

  • mod-entities-links
    • implement ingestion and storing of statistic data (2 sprints)
    • implement filter query API to fetch corresponding stats records (1 sprint)
  • mod-export-worker
    • implement integration with mod-entities-links API to fetch data (1 sprint)
    • implement functionality to store and provision .csv file report (1 sprint)
  • mod-data-export-spring
    • implement API to get report by linked data query (` sprint)
  • Not yet defined scope (about 3 sprints. In the scope modifications on mod-search and ui corresponding changes)

Rationale

To address requirements on authority control reporting, it's decided to proceed with ingestion and storing statistic data for linking within mod-entities-links with providing corresponding API to obtain statistic data by extensible filter criteria. Supposedly it will keep data consistency and provide better performance of generation report from actual data without necessity to interact with other module like mod-source-record-manager to aggregate data that is not specific to one. 



  • No labels

3 Comments

  1. From Pavlo review:

    needs to add fields:

    original_heading,

    new_heading,

    authority_lccn,

    authority_source_file_name,

    link_status,

    updater_name

  2. Yet to be defined:

    •  facets/filters functionality from functional requirement description
    1. Blind headings report
    2. Requirement "When an authority heading (1XX) is not linked to any bib field over a period of time".
    3. Could you describe more expectation about period of time?
    4. Does it mean that if some authority record was linked to bib a week ago and user requesting this report for today, then he will see this authority in the report?

                A. This report will only return authority headings not linked to a bib record at the time the user selects the Blind headings report option

    1. Requirement "Which Authority headings (1XX) have changed over a period of time".

    2.1. In this report if some authority heading was changed 3 times, then we will see 3 rows in the report? A. Yes  
            2.2. What if just LCCN changed, how the row in the report will look like? A. Lets not worry about LCCN (010 $a) updates for this report.  We only worry about 1XX updates. 
            2.3 Could it be that an identifier is not an LCCN? A Yes. Can be 001 or 010 $a based on Authority source file logic. 



     3. Which Linked Bibliographic fields failed to update (including reason why) when linked Authority 1XX/010 $a updated over a period of time
           3.1 Could we have Instance ID instead of HRID? A. Yes use Inventory UUID 
           3.2 Title - is Instance title? A. Yes and it should be included in report.