Users need to perform full-text search over instances, items, holdings. The following requirements should be fulfilled:
- Assumption: For over 20+ millions of records search request should be executed in less than 500ms
- Multi-tenant search (e.g. for consortia) should allow to share metadata view permission from one tenant to another
- Search should be efficient in terms of multi-tenancy, namely the amount of data in each tenant shouldn't affect request execution time for other tenants
- Result counts should be precise for any query and amount of data
- Auto-complete should be based on titles. The response time should be less than 100ms
- Facets (e.g. how many Print-books are found for the search request) should be precise
- Assumption: Rich full-text functionality which will be beneficial for user, should be provided, namely:
- Stemming for words (e.g. find record with term "books" for query "book" )
- Stop-words (e.g. and. or for English language) should not affect relevancy
- Relevancy scoring should be based on TF-IDF frequencies, in order to provide the most relevant records at the top
- Didyoumean for input string spelling correction should be implemented and showed as tip if there is more significantly more relevant query
- All language depended features should support certain predefined list of languages
- There should be support for CQL queries from input string
Purpose of this page
The purpose of this page is to capture good ideas that have been discussed regarding introduction of a search engine, so that they would not be lost. In addition to that, since these are highly relevant to the possible blueprint item they could provide the beginning of that conversation and therefore this page will be linked to that list.
Main points of the data flow:
- The architecture guarantees fault-tolerance, as even if search microservice is down or cannot process the requests at the concrete moment of time the message will be consumed later when search microservice will be up
- The architecture doesn't slow down the performance of inventory, as sending of Kafka messages costs virtually nothing
- The at least one strategy is enough in such case, as the process of indexing in idempotent
- When the data is modified in the inventory service id of the modified resource and tenantId are sent in the Kafka message to the Kafka topic for this type of resources
In terms of overall architecture it is reasonable to have golden source (e.g. central reliable source of data) for each type of data. For now both inventory and SRS are sources of metadata information. Their roles should be clearly separated, as inventory has additional information (items, holdings), but SRS have actual marc files. Possibly, inventory could be the golden source of metadata information and SRS is storage for corresponding marc files.
In order to implement all the requirements, the new module should be implemented in Folio. Primarily, its responsibiliy should be to index instances, holdings and items, which are stored in inventory microservice.
The data store for it should be Elasticsearch, as it satisfies all the mentioned requirements and is purely linear scalable and therefore multi-tenancy friendly (see corresponding section below).
CQRS principle should be implemented: inventory should still be used for any write queries and get all by date range queries and should still be the golden source of this information. Contrariwise all full-text requests from UI should be processed only by this module for these type of data. Other data from various components could be also indexed and searched by this microservice, but every particular case should be considered separately, in terms of data size (e.g. more than 1 mln) and search type (e.g. only for full-text search). No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency.
CQRS in such case doesn't mean that it will be the same API on okapi side for both reads and writes. In order to keep it simple UI will send query requests to search microservice and data modification requests to inventory microservice.
For providing of updates from other microservices to search microservice Apache Kafka should be used (see corresponding section below). On any change of data (e.g. instances, holdings, items are created/updated/deleted) in inventory storage, it should send Kafka message containing id of the modified resource and tenantId to the Kafka topic for this type of resources (e.g. inventory.instance.ids topic). Search microservice subscribes to the topic and when the message is received, search makes a GET request inventory for the data. If there would be requirements specific representation of data in Search or indexing instances along with its holding information in one index, it can be implemented by means of using database views while preparing response and embedding this information in the instances in Elasticsearch index.
The existing schema (json) from inventory can be used as the golden source of metadata for search microservice in order to build scalable mappings for indexes. However, in such case it should be expanded because there should be additional attributes which are relevant only for Elasticsearch.
Searching over linked entities fields
In order to search over linked entities fields (e.g. search for item barcode and find related instance) this fields should be embedded into the instances. If related entity is updated the instance should be reindexed. "Partially updating" the instance with the only new data (item/hodling) is impossible due to impossibility to maintain consistency of the data (considering multiple indexer module instances) in such case. For every change inside the inventory-storage (e.g. item is changed), the flow should be the following:
- Message with id and type should be sent to Kafka-topic.
- Indexer module should receive the message and find all related indexed entities ids (e.g. instance ids)
- These ids (instance ids) along with types should be sent as messages to the Kafka-topic
- Relations could be described in metadata json files
- Indexer module could be a part of search module or a separate module.
- There could be different topics for different kind of entities
All tenant data should be stored in following indices: instance, holding, item, one index per each entity type. Every index should have 32 shards (partitions), routing is done by tenantId, so the request for every tenant will go to one shard. In case of hash collisions, some tenants can share same shard. For each tenant aliases should be created for each index. So, for tenant with id = uuid1 and instance index, instance_uuid1_resource_search alias should be created which contain routing parameter and filter by tenantId field inside. All search and indexing operations should be done via aliases, search never uses index directly. So, when we search over some tenant, search is done over tenant alias like instance_uuid1_resource_search. Since alias contains routing and filtering parameters inside, search is done on tenant shard only (and hence only on node containing this shard), then filtering parameter is applied for case when multiple tenants are present in one shard. Every shard is replicated to 3 nodes, so if some node goes down, data is still available for searching and updating.
This approach provides horizontal scalability for search. Data is spread over cluster nodes, search queries are performed on different nodes. If tenant number becomes too big, we just add nodes to cluster and data will be rebalanced over all nodes.
Apache Kafka configuration
Existing Apache Kafka cluster, which is used for mod PubSub can be used. For every resource type there should be separate Kafka topic. In case of multiple search microservices are deployed, the same consumer group should be set for all of them. Id of the record should be used as partition key of the message.
The proposed solution provides out-of-the-box solution for ES aliases handling cross-tenant searching (for consortia) in case when permission of viewing tenant metadata is granted from one tenant to another. In order to search over several tenants their ES aliases should be used for request as string concatenation joined by comma.
Aliases should be automatically created for all tenants and should have filter for "tenant-id" and "routing" parameter set to tenant-id in it. For every entity _routing field should have value of tenant_id.
Elasticsearch in such case will use appropriate routing parameter (and therefore shards) and filters. If there is extremely huge tenant, separate index could be created for it and alias could be switched to it without restart, downtime or any code manipulations.
Reindex in case of data structure changes
In some cases indexes should be recreated and data should be reindexed by sending all ids of existing records from inventory to the Kafka topic. It shouldn't be ordinary operation and in best case scenario shouldn't happen in production.
No need to reindex data:
- New column is added and there were no record in database with these information
- Some column is deleted (if the aggregated fields are not implemented)
Reindex should be performed:
- The existing data, which was excluded from indexing should be indexed
- Some column is renamed or its data-type is changed