Recommended Maximum File Sizes and Configuration

Please work with your systems office or hosting to provider to ensure these configurations are in place. After more performance, reliability, and stability work, these recommendations will be re-evaluated.

For additional information, please see the following:

MARC Bibs: Performance Testing Data Import (Nolana)

Folijet - Morning Glory Snapshot Performance testing

MARC Bibs: Performance Rancher testing (Lotus)

MARC Bibs: Performance Testing Data Import (Juniper/Kiwi)

MARC Bibs: Performance Testing Data Import (Iris)

MARC Bibs: Profiles used for PTF Testing

MARC Bibs: PTF Test results

Folijet - Orchid Performance Testing Results

Maximum File Size (MG) no background activity, no concurrent import jobs, single tenant

  • CREATE (SRS MARC, Instances, Holdings, Items): 100,000 MARC Bib records (details)
  • UPDATE (SRS MARC, Instances, Holdings, Items): 50,000 MARC Bib records
  • For results with background activities (Check-in/Check-out), concurrent jobs and multi-tenant set up please refer to results from PTF

Import Statistics, with recommended values no background activity, no concurrent import jobs, single tenant


OrchidMorning GloryLotusJuniper and Kiwi

2 instances2 instances1 instance2 instances2 instances
MARC Records

CREATE Duration

UPDATE Duration

CREATE DurationUPDATE DurationCREATE DurationUPDATE DurationCREATE DurationUPDATE DurationCREATE DurationUPDATE Duration
50005 min
7min11min13min16min8min13min15 minutes20-30 minutes 
10,00014 min
16min22min31min40min19min25min

30,00046 min


1h 31min1h 4 min45min1h 36min

50,00054 min
59min1h42min1h 34min2h 17min1h 21min1h 51min2h 30min+11+ hours
100,0003h37min  (ongoing investigation)
2h20min2h49min3h 21min
3h 10min4h 10min

Maximum File Sizes (Juniper and Kiwi)

  • CREATE Import (SRS MARC, Instances, Holdings, Items): 50,000 MARC records max
  • UPDATE Import: 5,000 MARC records max

Import Statistics, with background activity

  • With 5-users check-in/out background activities
  • And concurrent imports by different tenants

    • Check-in/out time increases by 50%-100% depending on number of concurrent users.
    • Data import takes 2x longer to complete.
MARC RecordsCREATE DurationUPDATE Duration
1,00010 minutes10 minutes
5,00030 minutes40-60 minutes (depending on complexity of profile)
25,0002-3 hours8+ hours
50,0005+ hours22+ hours

Import Statistics,  no background activity, no concurrent import jobs

MARC RecordsCREATE DurationUPDATE Duration
1,0005 minutes5 minutes
5,00015 minutes20-30 minutes (depending on complexity of profile)
25,00060-80 minutes4+ hours
50,000150+ minutes11+ hours

Note: For Lotus 500,000 Create Duration -15h 37min (details)

Key Settings and Configurations


mod-data-import

Properties related to file upload that should be set at mod-configuration are described in the doc  https://github.com/folio-org/mod-data-import#module-properties-to-set-up-at-mod-configuration

System property that can be adjusted

Default value

file.processing.marc.raw.buffer.chunk.size

50

file.processing.marc.json.buffer.chunk.size50
file.processing.marc.xml.buffer.chunk.size10
file.processing.edifact.buffer.chunk.size10

For releases prior to Kiwi it is recommended to set the file.processing.buffer.chunk.size property to in order to prevent mod-source-record-storage from crashing with OOM during an Update import of 5,000 records. The property can be set to to allow Update import of 10,000 records.

mod-source-record-manager 

System property that can be adjusted

Default value

Comment
srm.kafka.RawMarcChunkConsumer.instancesNumber1

srm.kafka.StoredMarcChunkConsumer.instancesNumber1

srm.kafka.DataImportConsumersVerticle.instancesNumber1

srm.kafka.DataImportJournalConsumersVerticle.instancesNumber1

srm.kafka.RawChunksKafkaHandler.maxDistributionNum100

srm.kafka.CreatedRecordsKafkaHandler.maxDistributionNum100

srm.kafka.DataImportConsumer.loadLimit5

security.protocolPLAINTEXT

ssl.protocolTLSv1.2

ssl.key.password-

ssl.keystore.location-

ssl.keystore.password-

ssl.keystore.typeJKS

ssl.truststore.location-

ssl.truststore.password-

ssl.truststore.typeJKS

di.flow.control.max.simultaneous.records50Defines how many records can be processed by the system simultaneously Morning Glory
di.flow.control.records.threshold25Defines how many records from previous batch must be processed before throwing new records to pipeline Morning Glory
di.flow.control.enabletrueAllows for single record imports to be processed while larger file imports are being processed Morning Glory

di.flow.control.reset.state.interval

PT5M

Time between resetting state of flow control. By default it triggered each 5 mins. Morning Glory
kafka.producer.batch.size16*1024Producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both the client and the server. This configuration controls the default batch size in bytesNolana 
kafka.producer.linger.ms0Producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched togetherNolana
kafka.producer.request.timeout.ms30000The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhaustedNolana
kafka.producer.delivery.timeout.ms120000An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures.
kafka.producer.delivery.timeout.ms must be equal to or larger than kafka.producer.linger.ms + kafka.producer.request.timeout.ms
Nolana
kafka.producer.retry.backoff.ms100Time to wait before attempting to retry a failed request to a given topic partitionNolana

                                                                 

mod-source-record-storage

System property that can be adjusted

Default value

srs.kafka.ParsedMarcChunkConsumer.instancesNumber1
srs.kafka.DataImportConsumer.instancesNumber1
srs.kafka.ParsedRecordChunksKafkaHandler.maxDistributionNum100
srs.kafka.DataImportConsumer.loadLimit5
srs.kafka.DataImportConsumerVerticle.maxDistributionNum100
srs.kafka.ParsedMarcChunkConsumer.loadLimit5
security.protocolPLAINTEXT
ssl.protocolTLSv1.2
ssl.key.password-
ssl.keystore.location-
ssl.keystore.password-
ssl.keystore.type-
ssl.truststore.locationJKS
ssl.truststore.password-
ssl.truststore.typeJKS

   

mod-inventory

System property that can be adjusted

Default value

inventory.kafka.DataImportConsumerVerticle.instancesNumber3
inventory.kafka.MarcBibInstanceHridSetConsumerVerticle.instancesNumber3
inventory.kafka.DataImportConsumer.loadLimit5
inventory.kafka.DataImportConsumerVerticle.maxDistributionNumber100
inventory.kafka.MarcBibInstanceHridSetConsumer.loadLimit5
security.protocolPLAINTEXT
ssl.protocolTLSv1.2
ssl.key.password-
ssl.keystore.location-
ssl.keystore.password-
ssl.keystore.typeJKS
ssl.truststore.location-
ssl.truststore.password-
ssl.truststore.typeJKS

                                                                                 

mod-invoice

System property that can be adjusted

Default value

mod.invoice.kafka.DataImportConsumerVerticle.instancesNumber1
mod.invoice.kafka.DataImportConsumer.loadLimit5
mod.invoice.kafka.DataImportConsumerVerticle.maxDistributionNumber100
dataimport.consumer.verticle.mandatory

false 

should be set to true in order to fail the module at start-up if data import Kafka consumer creation failed

security.protocolPLAINTEXT
ssl.protocolTLSv1.2
ssl.key.password-
ssl.keystore.location-
ssl.keystore.password-
ssl.keystore.typeJKS
ssl.truststore.location-
ssl.truststore.password-
ssl.truststore.typeJKS

More information on configuring modules involved in Data Import process can be found at this link.

  • Kafka (MSK):
    • auto.create.topics.enable = true
    • log.retention.minutes = 70-300
    • Broker’s disk space: 300 GB
    • 4 brokers, replication factor = 3, DI topics partition = 2
    • Version 2.7 is 30% faster than version 1.6

JVM Settings 

modulerunning instancesCPUmemorymemoryReservationmaxMetaspaceSizeXmx
mod-data-import125620481844512m1292m
mod-source-record-manager2102420481844800m1024m
mod-source-record-storage2102415361440512m1024m
mod-di-converter-storage21281024896128m768m
mod-inventory2102428802592512m1814m
mod-inventory-storage2102422081952512m1440m

Key improvement Delivered for Orchid:

  • Added indexes for 035 and 010 MARC fields
  • Tested with increased connection pool (DB_MAXPOOLSIZE) for mod-source-record-manager, mod-source-record-storage - default is set to 15, on high loads (>5000 records for udpate) should be increased to 30 for mod-source-record-manager and mod-source-record-storage. 

    Also provided new "DB_CONNECTION_TIMEOUT" env variable to 40 for mod-source-record-storage

    {
     "name":"DB_CONNECTION_TIMEOUT",
     
    "value": "40"
    }

Key Improvements Delivered for Morning Glory:

  • Implemented Flow control mechanism that allows OCLC single record imports to be prioritized over imports of large files
  • Alleviated eventloop blocking during batch save of records
  • Reduced conversion of parsed content into a marc4j record
  • Improved performance of sql query for retrieving log entries
  • Fixed race conditions during mapping parameter initialization
  • Removed JobExecutionCache
  • Optimised functionality to create Instance records

Key Improvements Delivered for Kiwi/Lotus:

  • Performance
    • Improve speed and number of records for CREATEs and UPDATEs
    • Reduce Kafka message size: 6 Data Import topics messages >200KB/ea
    • Use less CPU for (de)serialization
    • Significantly improve speed of loading UI Landing page
    • Improve and optimize slow DB queries
  • Remove events_cache topic and replace it by DB deduplication solutions (causes spikes in mod-inventory and the brokers, leads to instability and unpredictable outcomes)
  • Improve resiliency and error handling to prevent import job from getting stuck

Key Improvements Delivered as of Iris Hotfix 3 and Juniper:

  • No more accidental creation of duplicate records
  • Ability to consistently make repeated updates on existing records
  • CPU usage of multiple instances of the Data Import landing page now consumes 10% CPU, instead of maxing out at 100%
  • When idle, mod-source-record-manager consumes 50% CPU instead of 90%
  • Vertical scaling improves performance, especially with Iris HF2 Data Import modules