Overview

Test goal is to assess performance of CICO, data import, bulk edits, and reindexing of instsances scenarios with decreased number of partitions in Kafka topics from 10 or 50 partitions down to 2.

Topics setup can be found here.

Ticket: PERF-400 - Getting issue details... STATUS

Summary

Load tests results comparison showed that there is no significant degradation in CICO response times with decreased number of partitions.
Resource consumption of server, database and Kafka instances also didn't change with decreased number of partitions.
Data Import time is also +/- 10% duration on average of baseline results.
Bulk Edits response times are also within seconds of the baseline results.
Reindexing is faster and consumed less CPU with 2 partitions each in all topics compared to 10 or 50 partitions in several search topics.

Test Runs

Test #

Test Conditions

Duration

Load generator size (recommended)

Load generator Memory(GiB) (recommended)

Notes

1.

Checkin/Ckeckout with 8, 20, 25 users

30 min

t3.medium

3

ncp3 environment
topics setup

2.

Data import with 5K, 25K, 50K, 100K Create imports

x

Results

Check In, Check Out

Response Times (CICO)

10/50 Partitions

2 Partitions

Response time comparison CICO + DI

Transaction	Response time, 95 percentile		Degradation, s	Degradation, %
Transaction	10/50 Partitions	2 Partitions	Degradation, s	Degradation, %
Check-in Controller 8us	0.555 s	0.576 s	0.021 s	4%
Check-out Controller 8us	0.899 s	0.897 s	-0.002 s	0%
Check-in Controller 20us	0.554 s	0.592 s	0.038 s	7%
Check-out Controller 20us	0.852 s	0.860 s	0.008 s	1%
Check-in Controller 25us	0.553 s	0.589 s	0.036 s	7%
Check-out Controller 25us	0.834 s	0.894 s	0.06 s	7%
Data import 5K	2m 8 s	2m 24 s	16 s	13%
Data import 25K	10 min 41 s	11 min 27 s	46 s	7%
Data import 50K	21 min 11 s	19 min 16 s	-115 s	-9%
Data import 100K	42 min 35 s	40 min 24 s	-131 s	-5%

Instance CPU Utilization (CICO)

10/50 Partitions

2 Partitions

Service CPU Utilization (CICO)

10/50 Partitions

2 Partitions

Memory Utilization (CICO)

10/50 Partitions

mod-inventory-storage memory usage increased from 57 to 65 during the test. This behaviour was also reproduced for the tests with 2 partitions.

2 Partitions

RDS CPU Utilization (CICO)

10/50 Partitions

2 Partitions

There is a 5% increase in CPU utilization for 25 users test, but this behaviour wasn't reproduced during retesting. It can be caused by external factors.

RDS DB connections (CICO)

10/50 Partitions

2 Partitions

Kafka CPU load (CICO)

10/50 Partitions

2 Partitions

RDS DB connections (DI)

2 Partitions

Kafka CPU (DI)

10/50 Partitions

2 Partitions

Database Load (DI)

2 Partitions

Bulk Edits

Jobs Duration comparison

Transaction	Job duration		Degradation, s	Degradation, %
Transaction	10/50 Partitions	2 Partitions	Degradation, s	Degradation, %
Users 1000 records	43 s	44 s	1 s	2%
Users 2500 records	1 min 49 s	1 min 45 s	- 4 s	-4%
Items 1000 records	3 min 8 s	2 min 49 s	-19 s	-10%
Items 10k records	22 min 44 s	19 min 13 s	-3 min 31 s	-15%
Holdings 1000 records	1 min 52 s	1 min 51 s	-1 s	-0.8%
Holdings 10k records	11 min 14 s	10 min 46 s	-28 s	-4%
Holdings 10k +Items 10k + Users 2500	10 min 39 s 19m 10 s 1 min 42 s	10 min 31 s 18 min 56 s 1 min 40 s	-8 s -14 s -2 s	-1% -1% -2%
Holdings 1000 +Items 1000 + Users 1000	1 min 47 s 2 min 43 s 42 s	1 min 44s 2 min 41 s 41 s	-3 s -2 s -1 s	-3% -1% -2%

Instance CPU Utilization (Bulk Edit)

Service CPU Utilization )Bulk Edit)

Memory Utilization (Bulk Edit)

RDS CPU Utilization (Bulk Edit)

RDS DB connections (Bulk Edit)

Kafka CPU (Bulk Edit)

Database Load (Bulk Edit)

Reindexing

Reindexing of instances with the flag recreateIndex = true

Duration Comparison

	10/50 Partitions	2 Partitions
Reindexing 1	14hr 30m (1/29 4:30 UTC - 1/29 19:00 UTC)
Reindexing 2		11hr 20m (2/3 21:30 UTC - 2/4 8:50 UTC)
Reindexing 3		11hr (2/4 18:30 UTC - 2/5 5:30 UTC)
Reindexing 4	13hr 45m (2/19 17:30 UTC - 2/20 7:15 UTC)

Reindexing 1

OpenSearch Graphs (Reindexing 1 with 10/50 Partitions)

Indexing Data Rate graph 1 shows a spike of up to 126K/min for about 8 hours, then it tailed off for another 6+ hours

Indexing Data Rate graph 2 shows the tail end of the reindexing where the indexing rate drops to below 5K/min and drags out until 19:00.

Service CPU Utilization (Reindexing 10/50 Partitions)

The main services, mod-inventory-storage and mod-search, doesn't show much if any activity after 12:30 (and through 19:00) when the indexing rate dropped to a few operations/min from 60K+/min

Database CPU Utilization (Reindexing 10/50 Partitions)

The database CPU utilization graph also shows the same story. Note that the time here is +5 UTC

MSK Cluster CPU Utilizations (Reindexing 10/50 Partitions)

Not much happening, only a bump in the CPU for the entire duration after the initial short-lived spikes.

Reindexing 2

OpenSearch Graphs (Reindexing 2 w/2 Partitions)

Comparing to Reindexing 1 which there had been 10/50 partitions each, reindexing 2 with topics having 2 partitions had a shorter burst of a high rate of indexing, between 32K and 65K operations per minute, but even then this number is about half of Reindexing 1 (between 64K and 120K). The initial high rate burst lasted for about 4 hours compared to 9 hours and the long tail of lower indexing rate is about 8 hours long compared to 6.5 hours in Reindexing 1.

Service CPU Utilization (Reindexing 2 w/2 Partitions)

Service CPU utilization is also lowered for mod-search and mod-inventory-storage (well below 50%) compared to reindexing 1 (around 30%-50%)

Database CPU Utilization (Reindexing 2 w/2 Partitions)

Database CPU Utilization is also lowered in reindexing 2 during the initial 4 hours spikes, to about 5% on average compared to 10-13% average in reindexing 1

MSK Cluster CPU Utilizations (Reindexing 2 w/2 Partitions)

MSK cluster CPU Utilization graph also doesn't show a noticeable bump in utilization during te initial 4 hours burst.

Reindexing 3

OpenSearch Graphs (Reindexing 3 w/2 Partitions)

Service CPU Utilization (Reindexing 3 w/2 Partitions)

Reindexing 3 with all topics having 2 partitions have the same indexing rate pattern as Reindexing 2 with the initial high rate burst lasted for about 4 hours and a 7 hours tail end.

Reindexing 3's service CPU utilization graph also shows the same behavior as in Reindexing 2 with the main modules' CPU utilization is well less than 505.

Database CPU Utilization (Reindexing 3 w/2 Partitions)

Reindexing 3's DB CPU utilization graph also shows the same behavior as in Reindexing 2 with the DB CPU utilization around 7%..

MSK Cluster CPU Utilization (Reindexing 3 w/2 Partitions)

Reindexing 3's service MSK Cluster utilization graph also shows the same behavior as in Reindexing 2 with a little bump throughout the reindexing.

Appendix

Infrastructure

PTF -environment ncp3

10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
2 instances of db.r6.xlarge database instances: Writer & reader instances
MSK ptf-kakfa-3 [ kafka configurations]
- 4 kafka.m5.2xlarge brokers in 2 zones
- Apache Kafka version 2.8.0
- EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3

Modules memory and CPU parameters:

Modules	Version	Task Definition	Running Tasks	CPU	Memory (Soft/Hard limits)	MaxMetaspaceSize	Xmx
mod-data-import	2.6.2	4	1	256	1844/2048	512	1292
mod-data-import-cs	1.15.1	1	2	128	896/1024	128	768
mod-source-record-storage	5.5.2	4	2	1024	1440/1536	512	908
mod-source-record-manager	3.5.6	4	2	1024	3688/4096	512	2048
mod-inventory	19.0.2	7	2	1024	2592/2880	512	1814
mod-inventory-storage	25.0.3	3	2	1024	1952/2208	512	1440
mod-quick-marc	2.5.0	3	1	128	2176/2288	512	1664
okapi	4.14.7	1	3	1024	1440/1684	512	922
mod-feesfines	18.1.1	3	2	128	896/1024	128	768
mod-patron-blocks	1.7.1	4	2	1024	896/1024	128	768
mod-pubsub	2.7.0	4	2	1024	1440/1536	512	922
mod-authtoken	2.12.0	3	2	512	1152/1440	128	922
mod-circulation-storage	15.0.2	3	2	1024	1440/1536	512	896
mod-circulation	23.3.2	3	2	1024	896/1024	128	768
mod-configuration	5.9.0	3	2	128	896/1024	128	768
mod-users	19.0.0	4	2	128	896/1024	128	768
mod-remote-storage	1.7.1	3	2	128	1692/1872	512	1178

Topics setup

Topic	Partitions number
Topic	Baseline	Verification
ncp3.fs09000000.circulation.check-in	10	2
ncp3.fs09000000.circulation.loan	10	2
ncp3.fs09000000.circulation.request	10	2
ncp3.fs09000000.data-export.job.command	50	2
ncp3.fs09000000.data-export.job.update	50	2
ncp3.fs09000000.inventory.async-migration	50	2
ncp3.fs09000000.inventory.authority	50	2
ncp3.fs09000000.inventory.bound-with	50	2
ncp3.fs09000000.inventory.holdings-record	50	2
ncp3.fs09000000.inventory.instance	50	2
ncp3.fs09000000.inventory.instance-contribution	50	2
ncp3.fs09000000.inventory.item	50	2
ncp3.fs09000000.search.instance-contributor	50	2

Methodology/Approach

Conduct necessary commands to return the database to the initial state. Do this before each test run. Wait several minutes before the test start.
Conduct CICO load tests with different number of users + data import.
Change partitions number from 10/50 to 2 for all necessary topics.
Repeat tests.
Compare test results.

Grafana dashboard

CICO tests, 10/50 partitions: http://carrier-io.int.folio.ebsco.com/grafana/d/elIt9zCnz/jmeter-performance-test-copy?orgId=1&var-percentile=95&var-test_type=baseline&var-test=circulation_checkInCheckOut_nolana&var-env=int&var-grouping=1s&var-low_limit=250&var-high_limit=750&var-db_name=jmeter&var-sampler_type=All&from=1673881085512&to=1673890405928

CICO tests, 2 partitions: http://carrier-io.int.folio.ebsco.com/grafana/d/elIt9zCnz/jmeter-performance-test-copy?orgId=1&var-percentile=95&var-test_type=baseline&var-test=circulation_checkInCheckOut_nolana&var-env=int&var-grouping=1s&var-low_limit=250&var-high_limit=750&var-db_name=jmeter&var-sampler_type=All&from=1674032625543&to=1674047619658

Please note that dashboards will expire in 6 weeks since test run.