Spike: Make records with bad data available to the harvester

As per the result of investigation in the previous spike, once batch processing has failed, each record of the batch is processed separately collecting (logging) the errors along with bad data. In this spike, the investigation is concentrated on how to make a bad data and errors available to the harvester.

There are at least two possible ways how to show errors with bad data to the harvester: adding errors exactly to the response, or save such errors to the DB and show them by request on UI side.

Add errors to the response

Errors can be included in the response at the end of the records (<ListRecords> tag):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:oai-identifier="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <responseDate>2023-03-15T09:13:19Z</responseDate>
    <request verb="ListRecords" metadataPrefix="marc21_withholdings">http://folio.org/oai</request>
    <ListRecords>
        <record>
            <header>
                <identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
                <datestamp>2023-03-14T12:52:38Z</datestamp>
                <setSpec>all</setSpec>
            </header>
            <metadata>
                <marc:record>
                    <marc:leader>17171cjm a2201609 a 4500</marc:leader>                 
                    
                   ...

                    <marc:datafield tag="999" ind1="f" ind2="f">
                        <marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
                    </marc:datafield>
                </marc:record>
            </metadata>
        </record>
        <errors>
            <error>Statistical code ID not found: {UUID}/error>
            <error>Error log</error>
            ...
        </errors>
    </ListRecords>
</OAI-PMH>

In this case, error records are appended to <ListRecords> as additional tag, and all errors are listed one by one without any reference to instance ID. Optionally, instance ID can be included into <error> tag.

There is another way where errors can be included in the response after each record (<metadata> tag):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:oai-identifier="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <responseDate>2023-03-15T09:13:19Z</responseDate>
    <request verb="ListRecords" metadataPrefix="marc21_withholdings">http://folio.org/oai</request>
    <ListRecords>
        <record>
            <header>
                <identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
                <datestamp>2023-03-14T12:52:38Z</datestamp>
                <setSpec>all</setSpec>
            </header>
            <metadata>
                <marc:record>
                    <marc:leader>17171cjm a2201609 a 4500</marc:leader>                 
                    
                   ...

                    <marc:datafield tag="999" ind1="f" ind2="f">
                        <marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
                    </marc:datafield>
                </marc:record>
            </metadata>
            <errors>
                <error>Statistical code ID not found: {UUID}</error>
                <error>Some other error</error>
                ...
            </errors>
        </record>
        <record>
            <header>
                <identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
                <datestamp>2023-03-14T12:52:38Z</datestamp>
                <setSpec>all</setSpec>
            </header>
            <metadata>
                <marc:record>
                    <marc:leader>17171cjm a2201609 a 4500</marc:leader>                 
                    
                   ...

                    <marc:datafield tag="999" ind1="f" ind2="f">
                        <marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
                    </marc:datafield>
                </marc:record>
            </metadata>
            <errors>
                <error>The following control character cannot be parsed: {character}</error>
                <error>Some other error</error>
                ...
            </errors>
        </record>
        ...
    </ListRecords>
</OAI-PMH>

In this case, errors are bound to the specific instance and it is clearly shown to the user.

This approach supposes to store every next error locally or in-memory that may affect the performance in case of large amount of errors and bad data. In addition, every record in the batch has to be processed and validated separately, so that it increases the waiting time to response. However, it is unlikely that one instance being processed can contain so many bad data and error logs.

Advantages:

  • Errors are always shown to the user immediately and it may be useful in some cases
  • No need work on UI side

Disadvantages:

  • Response can be significantly increased due to the possible large amount of errors
  • Saving errors requires additional memory and affects performance

Save errors into DB and display them on UI side

This approach requires adding a new table in the oai-pmh schema, or using existing one to save errors. In addition, it needs to create a separate endpoint to retrieve records from DB and introduce a new UI page to display errors. This UI page may look like the following:

Advantages:

  • Separate thread can be used to save the errors to minimize the impact on performance
  • Errors are shown to user only by request

Disadvantages:

  • Work on UI side
  • New endpoint implementation on back-end side

Save errors into S3

In this approach, it is assumed saving every next error into local file, and store the file into S3 at the end of the harvesting. Link to S3 allows the user accessing the error logs and bad data right after the harvest is done. For example, through UI, or directly using the link. Link can be appended to the response.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:oai-identifier="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <responseDate>2023-03-15T09:13:19Z</responseDate>
    <request verb="ListRecords" metadataPrefix="marc21_withholdings">http://folio.org/oai</request>
    <ListRecords>
        <record>
            <header>
                <identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
                <datestamp>2023-03-14T12:52:38Z</datestamp>
                <setSpec>all</setSpec>
            </header>
            <metadata>
                <marc:record>
                    <marc:leader>17171cjm a2201609 a 4500</marc:leader>                 
                    
                   ...

                    <marc:datafield tag="999" ind1="f" ind2="f">
                        <marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
                    </marc:datafield>
                </marc:record>
            </metadata>
        </record>
        <record>
            <header>
                <identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
                <datestamp>2023-03-14T12:52:38Z</datestamp>
                <setSpec>all</setSpec>
            </header>
            <metadata>
                <marc:record>
                    <marc:leader>17171cjm a2201609 a 4500</marc:leader>                 
                    
                   ...

                    <marc:datafield tag="999" ind1="f" ind2="f">
                        <marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
                    </marc:datafield>
                </marc:record>
            </metadata>
        </record>
        ...
    </ListRecords>
    <errors>
        {link to S3 file storage}
    </errors>
</OAI-PMH>

Advantages:

  • Minimal impact on performance
  • Accessing the error logs immediately after harvesting

Disadvantages:

  • Possible work on UI side if direct usage of link somehow is not suitable for the user