Problem statement
Users need to perform full-text search over instances, items, holdings. The following requirements should be fulfilled:
- Assumption: For over 20+ millions of records search request should be executed in less than 500ms
- Multi-tenant search (e.g. for consortia) should allow to share metadata view permission from one tenant to another
- Search should be efficient in terms of multi-tenancy, namely the amount of data in each tenant shouldn't affect request execution time for other tenants
- Result counts should be precise for any query and amount of data
- Auto-complete should be based on titles. The response time should be less than 100ms
- Facets (e.g. how many Print-books are found for the search request) should be precise
- Assumption: Rich full-text functionality which will be beneficial for user, should be provided, namely:
- Stemming for words (e.g. find record with term "books" for query "book" )
- Stop-words (e.g. and. or for English language) should not affect relevancy
- Relevancy scoring should be based on TF-IDF frequencies, in order to provide the most relevant records at the top
- Didyoumean for input string spelling correction should be implemented and showed as tip if there is more significantly more relevant query
- All language depended features should support certain predefined list of languages
- There should be support for CQL queries from input string
Purpose of this page
The purpose of this page is to capture good ideas that have been discussed regarding introduction of a search engine, so that they would not be lost. In addition to that, since these are highly relevant to the possible blueprint item they could provide the beginning of that conversation and therefore this page will be linked to that list.
Proposed solution
Overall architecture
Main points of the data flow:
- The architecture guarantees fault-tolerance, as even if search microservice is down or cannot process the requests at the concrete moment of time the message will be consumed later when search microservice will be up
- The architecture doesn't affect performance of create/update/delete operations in Inventory directly, as sending of Kafka messages costs virtually nothing
- The at least one strategy is enough in such case, as the process of indexing in idempotent
- When the data is modified in the inventory service id of the modified resource and tenantId are sent in the Kafka message to the Kafka topic for this type of resources
Note:
In terms of overall architecture it is reasonable to have golden source (e.g. central reliable source of data) for each type of data. For now both inventory and SRS are sources of metadata information. Their roles should be clearly separated, as inventory has additional information (items, holdings), but SRS have actual marc files. Possibly, inventory could be the golden source of metadata information and SRS is storage for corresponding marc files.
Demo 2020-10-20
https://youtu.be/DW_P93DjZ94 Benevolence team presents the Elastic Search FOLIO modules developed for Shanghai Library from 3:00 - 18:43.
Some documents about this development is available on the Benevolence Benevolence Projects page in the "Elastic Search" row.
Search microservice
In order to implement all the requirements, the new module should be implemented in Folio. Primarily, its responsibility should be to index instances, holdings and items, which are stored in inventory microservice.
The data store for it should be Elasticsearch, as it satisfies all the mentioned requirements and is purely linear scalable and therefore multi-tenancy friendly (see corresponding section below).
CQRS principle should be implemented: inventory should still be used for any write queries and get all by date range queries and should still be the golden source of this information. Contrariwise all full-text requests from UI should be processed only by this module for these type of data. Other data from various components could be also indexed and searched by this microservice, but every particular case should be considered separately, in terms of data size (e.g. more than 1 mln) and search type (e.g. only for full-text search). No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency.
CQRS in such case doesn't mean that it will be the same API on okapi side for both reads and writes. In order to keep it simple UI will send query requests to search microservice and data modification requests to inventory microservice.
For providing of updates from other microservices to search microservice Apache Kafka should be used (see corresponding section below). On any change of data (e.g. instances, holdings, items are created/updated/deleted) in inventory storage, it should send Kafka message containing id of the modified resource and tenantId to the Kafka topic for this type of resources (e.g. inventory.instance.ids topic). Search microservice subscribes to the topic and when the message is received, search makes a GET request inventory for the data. If there would be requirements specific representation of data in Search or indexing instances along with its holding information in one index, it can be implemented by means of using database views while preparing response and embedding this information in the instances in Elasticsearch index.
The existing schema (json) from inventory can be used as the golden source of metadata for search microservice in order to build scalable mappings for indexes. However, in such case it should be expanded because there should be additional attributes which are relevant only for Elasticsearch.
High Level Elastic Search Client should be used for doing the requests: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high.html.
Inventory domain events
Inventory should send notifications, when there is any change of domain entities (instance/holding/item). For the documentation of the architectural pattern please see: https://microservices.io/patterns/data/domain-event.html .
The pattern means that every time when an instance/item is created/updated/removed a message is posted to kafka topic:
inventory.instance
- for instances;inventory.item
- for items;inventory.holdings-record
- for holdings records.
The event payload has following structure:
{ "old": {...}, // the instance/item before update or delete "new": {...}, // the instance/item after update or create "type": "UPDATE|DELETE|CREATE|DELETE_ALL", // type of the event "tenant": "diku" // tenant name }
X-Okapi-Url
and X-Okapi-Tenant
headers are set from the request to the kafka message.
Kafka partition key for all the events is instance id (for items it is retrieved from associated holding record).
Domain events for items
The new
and old
records also includes instanceId
property, on the same level with other item properties, which defined in the schema:
{ "instanceId": "<the instance id>", // all other properties that defined in the schema }
Domain events for delete all APIs
There are delete all APIs for items instances and holding records. For such APIs we're issuing a special domain event:
- Partition key:
00000000-0000-0000-0000-000000000000
- Event payload:
{ "type": "DELETE_ALL", "tenant": "<the tenant name>" }
Searching over linked entities fields
In order to search over linked entities fields (e.g. search for item barcode and find related instance) this fields should be embedded into the instances. If related entity is updated the instance should be reindexed. "Partially updating" the instance with the only new data (item/holding) is impossible due to impossibility to maintain consistency of the data (considering multiple indexer module instances) in such case. For every change inside the inventory-storage (e.g. item is changed), the flow should be the following:
- Message with id and type should be sent to Kafka-topic.
- Indexer module should receive the message and find all related indexed entities ids (e.g. instance ids)
- These ids (instance ids) along with types should be sent as messages to the Kafka-topic
Notes:
- Relations could be described in metadata json files
- Indexer module could be a part of search module or a separate module.
- There could be different topics for different kind of entities
Elasticsearch configuration
For Elasticsearch deployment there are the follwing options:
- AWS (or other cloud/on-premises) instances (e.g. AWS m5.xlarge with 500 Gb EBS type GP-3 for storage) with deployed Elasticsearch 7.10 docker container
- AWS Elasticsearch service
For all tenants data should be stored in tenantId_instance index. Every index should have 4 shards. The replication factor 2 should be setup for each shard. To make the system high-available for writing replication factor should be 3, so if some node goes down, data is still available for searching and updating. But it would need more resouces. Therefore having replication factor of 2 is a reasonble tradeoff.
This approach provides horizontal scalability for search. Data is spread over cluster nodes, search queries are performed on different nodes. If the number of tenants becomes too big, we just add nodes to cluster and data will be rebalanced over all nodes.
Apache Kafka configuration
Existing Apache Kafka cluster, which is used for mod PubSub can be used. For every resource type there should be separate Kafka topic. In case of multiple search microservices are deployed, the same consumer group should be set for all of them. Id of the record should be used as partition key of the message.
Multi-tenancy
The same Elasticsearch clsuter should be used for all tenants. But separate index should be created for each tenant. All the requiests must contain standard OKAPI_TENANT header. This header is used to use appropriate tenant index for each query.
If mod-search is not enabled for the tenant, then mod-search should skip the Kafka messages for it (otherwise search would wait forever if tenant was not created forever and failed to process messages for other tenants).
NB: When the consortia cross-tenant search requirements are clear, the proposed solution could be enhanced with out-of-the-box solution for ES aliases handling cross-tenant searching (for consortia) in case when permission of viewing tenant metadata is granted from one tenant to another. In order to search over several tenants their ES aliases should be used for request as string concatenation joined by comma. Elasticsearch in such case will use appropriate routing parameter (and therefore shards) and filters. If there is extremely huge tenant, separate index could be created for it and alias could be switched to it without restart, downtime or any code manipulations. Aliases should be automatically created for all tenants and should have filter for "tenant-id" and "routing" parameter set to tenant-id in it. For every entity _routing field should have value of tenant_id.
Reindex in case of data structure changes
In some cases indexes should be recreated and data should be reindexed by sending all ids of existing records from inventory to the Kafka topic. It shouldn't be ordinary operation and in best case scenario shouldn't happen in production.
No need to reindex data:
- New column is added and there were no record in database with these information
- Some column is deleted (if the aggregated fields are not implemented)
Reindex should be performed:
- The existing data, which was excluded from indexing should be indexed
- Some column is renamed or its data-type is changed
Calling modules from asynchronous job
In order to make calls to other modules (e.g. find campus by library_id) there should be a system user for the module with the corresponding permissions. This user should be created on module deployment (it shouldn't be created if it exists). The same strategy was used in mod-data-import.
Security
There should be authentication to make elasticsearch requests. The same approach, which is used for database access should be used for elasticsearch. The values should be a env variables, which are used in application.yml
Java library (maven dependency)
Java library should be created in order to provide ability to send Kafka messages for indexing in case of data modification. Declarative approach could be leveraged: methods, which are supposed to modify entities could have corresponding java annotations. Common message format (e.g. id, type fields) should be presented in this library.
38 Comments
Marc Johnson
Mikhail Fokanov Who is the audience for this proposal, is it intended for peer feedback from the technical leads?
Mikhail Fokanov
Vince Bareau proposed to add link to this page to summary of ABI-011 at Folio Architectural Blueprint Strategic Changes.
Marc Johnson
Mikhail Fokanov
Does that mean this is a strategic proposal intended for the Technical Council (and not for peer review by the technical leads)?
cc: Craig McNally Jakub Skoczen
Mikhail Fokanov
My intent is to discuss this initiative as widely as possible. I told Vince Bareau, that I have good idea how to address all problems, which I've heard so far, that they exist in Folio (e.g. accurate counts, multi-tenancy etc - see problem statement section). I prepared the page. Vince said that it can be mentioned on Folio Architectural Blueprint, as there is corresponding row (ABI-011) in the table.
So, your and community comments would be greatly appreciated. Also we could schedule a meeting for this discussion.
Mikhail Fokanov
Marc Johnson , In order to clarify it, I added "Purpose of this page" section.
Brandon Tharp
From a SysOps perspective, I have several concerns with utilizing elasticsearch. First and foremost are performance concerns. It takes a lot of memory to run elasticsearch. Performance tuning elasticearch can be extremely challenging. Second, adding an additional dependency to run folio is always a concern. Right now we have to run postgres for the database, minio for object storage, and kafka for event streaming. Adding these types of dependencies makes folio more difficult and cumbersome to run. This raises the bar for libraries who want to use it. It also makes folio upgrades more difficult by adding another layer to ensure gets upgraded properly. Third, running, maintaining and updating elasticsearch is extremely difficult. Having worked with elasticsearch in the past, I can testify that upgrades are challenging, re-indexing data is time consuming and the user interface is not easy. Architecting an elasticsearch solution that is performant, highly available and easy to run is a demanding challenge.
Theodor Tolstoy (One-Group.se)
Regarding your first concern, I believe we are heading for a lot more memory consumption anyway, trying to meet the requirements for better search in Inventory ( FOLIO-2573 - Getting issue details... STATUS )
For your third concern, in my experience, adding a lot of records to Postgres can be even worse if you have giant indexes set up. But I agree that it will be a demanding challenge.
Tod Olson
So there are three parts to this proposal:
The need to add a text search engine in order to address scale is expected. We (Chicago) saw this with Horizon and with OLE, RDBMSes can't be expected to meet our search needs at the scale of several million records. This seems uncontroversial.
The basic approach, keeping a queue of records to be indexed, is familiar and in our experience robust. In Horizon, while the queuing method was slightly different it was effective. In OLE, the search engine is intertwined with primary storage in a more complex way and the result is not as robust. In Horizon, the one weakness we saw in the queuing approach was that when we did bulk updates of a very large number of records, at some point the indexing would fall behind and take too long to catch up. In those cases it was more effective to temporarily disable the queue and schedule a complete reindex. And of course a change to the indexing configuration required a complete reindex. We would do maintenance on this scale maybe a couple times a year. The implementation needs to allow for the occasional complete reindex. And perhaps the ability to manually add a set of records to the queue for reindexing. In any event, the basic architectural approach of a using a queue also seems uncontroversial.
The specific choice of search engine seems like the only point or controversy. As Brandon Tharppoints out, the addition of a search engine has an effect on operations, as does the choice of engine. In a perfect world, maybe we could plug in our choice of search engine (or maybe chose between a couple options, Elasticsearch and Solr, for example), but that does not seem realistic. Maybe good to ask the SysOps SIG and the project DevOps for relevant experiences from the operations side.
Marc Johnson
Tod Olson I'm curious, please could you expand upon the different synchronisation approaches that Horizon and OLE took and how those affected the robustness. I'd be keen to understand what experience folks already have of doing this.
Tod Olson
Marc Johnson, this is all from memory. Dale Arntson can correct me if I have anything important wrong.
Horizon maintained a queue. (IIRC the scope of full-text searching was the bib records.) To keep the full text indexes updated there were database triggers on the table(s) with bib data. Any create, update, or delete would add the record to a queue kept in a table. That table had just enough information for the indexer to look up the record and take an action and for some troubleshooting. I think it recorded bib number, a code to indication the basic operation (create/update or delete), and a timestamp. (I do not recall whether there was a uniqueness constraint on the bib number.)
The indexer was a separate process. It looked into the table with the queue, grabbed the first N rows, pulled the bib data, put it into the full-text indexes, and after success removed those rows from the table. As mentioned, there were some occasions where we would need to do a full reindex. As I recall, we would stop the indexer process and run the full indexer. That would zero out the queue, march through all of the bibs, and then start working on anything that had accumulated in the queue.
That queue table would also give us some insight into how the indexing was going. If it's backed up, stalled, etc.
OLE is more complicated. There was a design goal of a "docstore" to fuction as an abstraction layer to store and search documents, i.e. records, as a service to the business logic modules. The implementation took a different approach. After the first technology attempt did not scale, OLE switched used MySQL to store the transactional data and Solr for text indexing. (This might be a place where Dale Arntson should chime in, as he was closely involved with some of the debugging.) I'm actually not entirely certain of the current state of the implementation, I've been further away from OLE and more on FOLIO these last few years, and there have been efforts to it more robust. The basic approach was to receive and update on the SOA bus, make the transaction in the database and send an update to Solr. Skipping a little history, the strategy was to relay of the near-real-time indexing capabilities in Solr, and assume that the indexes would be updated quickly enough that they could drive displays in OLE. It kind of worked, except when it didn't due to timing errors, failure to send the update, or...
The problem that emerged is that sometimes the data is Solr would be stale for some reason. The update message was not properly created, never arrived, commit didn't happen in time, whatever. (Insert aside on the need for useful, diagnostic logging.) So the database has a record in one state. Solr has a record in some other state, some fields are different. A user of the system opens a record to complete some workflow. Could be placing an order, or updating an item record, or changing a loan record. The information in Solr would be used to populate the user's screen and to initiate the transaction. So the transaction would carry this older state of the record with it. As stated elsewhere, this caused much mischief and undermined trust in the system. Which is hard to regain.
This last paragraph is a little bit of a digression from Marc Johnson's question, but it's an important point.
Marc Johnson
Thanks Tod Olson for explaining further.
I haven't had chance to digest this properly yet, I'll follow up when I do.
Mikhail Fokanov
Tod Olson thanks a lot for sharing your experience and for such comprehensive explanation.
I do agree, that ES data shouldn't be involved into any transactions. ES should be used only for searching and displaying the info. Speaking about outdated data on UI due to 3-5 seconds delay and it is valuable concern, we should consider it. However there is the same problem, when the data is changed after user had opened the instance view page.
Actually, I also have experience of design and implementation of similar library software. It was single cloud SaaS solution for a lot of customers. That platform provided querying within less than 500 ms for 1000 concurrent queries (most of them are search and facets queries) over 200 millions bibframe instances. We faced almost the same problems (consistency etc.), but eventually we managed to resolve all problems. There was no possibility to see stale data due to errors. Properly configured Kafka cluster guarantees delivery in any case, race condition was avoided by partition the queue by id (each instance of search microservice deals with indexing of certain bib records). Also if the indexing request to ES failed due to infrastructure issue it is being repeated until it is success.
There was slight delay due to indexing time, but our POs were OK with it. In case of Folio if we use ES, we will need to consider pros and cons of both options (use ES or Postgres for single instance view).
Marc Johnson
Tod Olson Mikhail Fokanov
What is meant by transactions in this context? Would checking out of an item to a borrower or the processing of an imported MARC file be considered transactions?
Mikhail Fokanov
yes.
Marc Johnson
Does that mean that an Elastic Search based API should not be used during these operations?
If that is the case:
how would a client know that a particular API should or should not be used?
will the system need to retain it's existing search APIs for use in these kinds of transactions?
Mikhail Fokanov
My statement "ES data shouldn't be involved into any transactions" means that ES search is not supposed to be used for any internal Folio operarions (circulation, import marc etc.). The new search module should be used only for search from UI. So all existing endpoints, which are used by other module should be retained and still used.
Marc Johnson
In effect, that means that contexts like inventory will be offering two separate search APIs. How will clients know which search API to choose?
Mikhail Fokanov
No, for searching from UI there should be different context (e.g. URL: search/instances). And UI should be switched to it.
Marc Johnson
How will clients know which one they should use?
Does this mean that this approach will not reduce the need for PostgreSQL database indexes if they are used for CQL queries involved in transactional processes?
Mikhail Fokanov
Brandon Tharp I understand the concerns of SysOps, because every new technology in a stack implies additional efforts for deployment and maintenance. However there is tools that cannot be omitted in the systems, that has certain functionality. If there is a requirement for full-text search over significant amount of data (considering accurate counts and possibly facets and autocomplete) there is no way to do it without dedicated tool (search engine). I have great experience of configuring the ElasticSearch for the similar platform. We considered both ElasticSearch and Solar and after our investigation, we see that functionality of ElasticSearch is broader and its sharding and routing is more robust. Also, we didn't find any significant drawbacks in its use compared to Solar.
Regarding the deployment, there should be solid "one-button" solution for deployment of the whole platform to the on-premises cluster and another button to deploy to certain cloud (e.g. AWS). From my perspective, in both cases k8s is suitable. If it the case, including of ElasticSearch won't involve any efforts in terms of every on-premises deployment (only computing resources will be affected).
Tod Olson From my experience I implemented some tactics for "bulk updates of a very large number of records" and in our case it was only a bit behind the main storage. Also we implemented kind of "blue-green" cluster reindex in order to make zero downtime changing of indexing configuration (as it was cloud SaaS). "Ability to manually add a set of records to the queue for reindexing" - was implemented as re-index by sql query.
Theodor Tolstoy (One-Group.se)
Perhaps not relevant for the things the proposal wants to solve, but this also opens up for a lot of other things, right out of the box:
For example the ability to have proper sorting and searching for users of different nationalities:
There are other areas, like in inventory, where things that are hard to solve in the DB, could be very easy to solve in other ways, like allowing searches on both Valid and Invalid ISBNs, but ranking the invalid ones lower.
Tod Olson
Seems like this is also relevant:
Mike Gorrell
Note that there are additional use cases beyond Inventory including cross-module requirements that should be considered.
Marc Johnson
Mike Gorrell Mikhail Fokanov
I think that is a good question, is this proposal intended for searching within the boundaries of a single module or for cross-module searching?
Mikhail Fokanov
Mike Gorrell, Marc Johnson , Julian Ladisch The proposed new search microservice should be responsible for indexing and full-text searching of data. All kind of entities, for which there is huge data and actual result counts (or some other full-text features e.g. autocomplete, facets etc.) are mandatory.
Julian Ladisch Primarily, it should be used by staff searching (for Inventory screen), as it can provide exact counts. autocomplete and is DoS friendly (as any queries will be executed in less than 1 seconds). But it would be also suitable for patron UI, if there is any.
Marc Johnson
Mikhail Fokanov
Does that mean this proposal is for a single, centralised, search service that is used by all modules needing these kinds of features?
Mikhail Fokanov
We should consider each particular case separately. But I suppose, that this microservice should be a general one.
Julian Ladisch
Is this about staff searching in the inventory app, or about patrons searching in some discovery system? Can the title adjusted accordingly and, if staff searching in inventory, this page moved below Inventory?
Jakub Skoczen
Mikhail Fokanov
Mikhail Fokanov
Jakub Skoczen
Mikhail Fokanov
Btw, did you look at https://github.com/zombodb/zombodb?
Mikhail Fokanov
Jakub Skoczen
When you say "batch" requests do you mean requests with offset (paging) or requesting a list of UUIDs?
Mikhail Fokanov
By "actually there is really small overhead versus doing it (a lot of get requests) via http-streaming, especially if they are batch requests (e.g. for 100 items)", I meant that requests from search to inventory-storage should be "batch". As Kafka messages will contain only UUID of changed instance, this "batch" request will contain list of UUIDs.
Mikhail Fokanov
I took a look on zombodb. I see that for some use cases it could be quick solution for the problem. But the first problem I noticed for applying it for our case is that we have no columns in tables, but single jsonb for the data and zombodb automatically map json/jsonb columns to dynamic nested objects, but it is generally accepted that usage of nested objects is extremely bad for ES performance. It's likely that this issue could be somehow hacked around, but I think there are still a lot of limitations of zombodb (for example usage of certain version of Postgres and ES). I think that it just isn't worth the hassle to have such a mediator between UI and ES for searching, as we already have Apache Kafka and implementation of search microservice is pretty simple job.
Tod Olson
Mikhail Fokanov I share that concern about serving individual record data a from ES. This is how parts of OLE were implemented. The effect is that transactions and search indexing become entangled. Stale data is served up and subsequent transactions are attempted based on stale data, which leads to inconsistent states and data integrity issues. Our experience has been that these can be subtle issues that linger undiscovered an accumulate until they cause real trouble. The data in the index cannot be treated as the source of truth.
Mikhail Fokanov
"The data in the index cannot be treated as the source of truth."
You are completely right, ES isn't real time solution. I stated it in this document as: "No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency." The indexing delay can be configured (index.refresh_interval, the default is 1 second), but anyway it still cannot be completely avoided.
In my opinion, in most of the use cases this near realtime data can be used (e.g. with delay < 3 sec). But for such things like circulation ES definitely cannot be used anyhow.
That's why I say that using elastic for presentation of the pages should be considered after some spike story.
Marc Johnson
Jakub Skoczen Mikhail Fokanov
It seems to me that there are two different conversations going on in this proposal (as I believe Tod Olson has already stated above):
Could it be worthwhile separating out those two?
Tod Olson comment suggests that the first of these may have already been decided. Has FOLIO decided it is going to introduce a secondary representation of inventory information for the purposes of searching?
If that decision hasn't been made, whilst considering the general architectural trade-offs (like consistency, increased operational complexity etc) and subsequent decisions about scope (like Tod Olsonand Mike Gorrell have asked about which operations will use this), is it worthwhile getting into the details of how this could be implemented?
If we do want to discuss that, then there are lots of aspects (like the need for full rebuilding of the index) to it and lots of different approaches (there are at least three different kinds of messages we could use for synchronisation for example).
I personally think it would be useful to frame that conversation with explicit decisions about whether FOLIO is doing this and the scope of the initial work (e.g. only search in inventory).