Page tree
Skip to end of metadata
Go to start of metadata

Problem statement

Users need to perform full-text search over instances, items, holdings. The following requirements should be fulfilled:

  • Assumption: For over 20+ millions of records search request should be executed in less than 500ms
  • Multi-tenant search (e.g. for consortia) should allow to share metadata view permission from one tenant to another
  • Search should be efficient in terms of multi-tenancy, namely the amount of data in each tenant shouldn't affect request execution time for other tenants
  • Result counts should be precise for any query and amount of data
  • Auto-complete should be based on titles. The response time should be less than 100ms
  • Facets (e.g. how many Print-books are found for the search request) should be precise
  • Assumption: Rich full-text functionality which will be beneficial for user, should be provided, namely:
    • Stemming for words (e.g. find record with term "books" for query "book" )
    • Stop-words (e.g. and. or for English language) should not affect relevancy
    • Relevancy scoring should be based on TF-IDF frequencies, in order to provide the most relevant records at the top
    • Didyoumean for input string spelling correction should be implemented and showed as tip if there is more significantly more relevant query
    • All language depended features should support certain predefined list of languages
  • There should be support for CQL queries from input string

Purpose of this page

The purpose of this page is to capture good ideas that have been discussed regarding introduction of a search engine, so that they would not be lost. In addition to that, since these are highly relevant to the possible blueprint item they could provide the beginning of that conversation and therefore this page will be linked to that list.

Proposed solution

Overall architecture

Main points of the data flow:

  • The architecture guarantees fault-tolerance, as even if search microservice is down or cannot process the requests at the concrete moment of time the message will be consumed later when search microservice will be up
  • The architecture doesn't slow down the performance of inventory, as sending of Kafka messages costs virtually nothing
  • The at least one strategy is enough in such case, as the process of indexing in idempotent
  • When the data is modified in the inventory service id of the modified resource and tenantId are sent in the Kafka message to the Kafka topic for this type of resources


In terms of overall architecture it is reasonable to have golden source (e.g. central reliable source of data) for each type of data. For now both inventory and SRS are sources of metadata information. Their roles should be clearly separated, as inventory has additional information (items, holdings), but SRS have actual marc files. Possibly, inventory could be the golden source of metadata information and SRS is storage for corresponding marc files.

Search microservice

In order to implement all the requirements, the new module should be implemented in Folio. Primarily, its responsibiliy should be to index instances, holdings and items, which are stored in inventory microservice.

The data store for it should be Elasticsearch, as it satisfies all the mentioned requirements and is purely linear scalable and therefore multi-tenancy friendly (see corresponding section below). 

CQRS principle should be implemented: inventory should still be used for any write queries and get all by date range queries and should still be the golden source of this information. Contrariwise all full-text requests from UI should be processed only by this module for these type of data. Other data from various components could be also indexed and searched by this microservice, but every particular case should be considered separately, in terms of data size (e.g. more than 1 mln) and search type (e.g. only for full-text search). No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency.

CQRS in such case doesn't mean that it will be the same API on okapi side for both reads and writes. In order to keep it simple UI will send query requests to search microservice and data modification requests to inventory microservice.

For providing of updates from other microservices to search microservice Apache Kafka should be used (see corresponding section below). On any change of data (e.g. instances, holdings, items are created/updated/deleted) in inventory storage, it should send Kafka message containing id of the modified resource and tenantId to the Kafka topic for this type of resources (e.g. inventory.instance.ids topic). Search microservice subscribes to the topic and when the message is received, search makes a GET request inventory for the data. If there would be requirements specific representation of data in Search or indexing instances along with its holding information in one index, it can be implemented by means of using database views while preparing response and embedding this information in the instances in Elasticsearch index. 

The existing schema (json) from inventory can be used as the golden source of metadata for search microservice in order to build scalable mappings for indexes. However, in such case it should be expanded because there should be additional attributes which are relevant only for Elasticsearch.

Searching over linked entities fields

In order to search over linked entities fields (e.g. search for item barcode and find related instance) this fields should be embedded into the instances. If related entity is updated the instance should be reindexed. "Partially updating" the instance with the only new data (item/hodling) is impossible due to impossibility to maintain consistency of the data (considering multiple indexer module instances) in such case. For every change inside the inventory-storage (e.g. item is changed), the flow should be the following:

  1. Message with id and type should be sent to Kafka-topic.
  2. Indexer module should receive the message and find all related indexed entities ids (e.g. instance ids)
  3. These ids (instance ids) along with types should be sent as messages to the Kafka-topic


  • Relations could be described in metadata json files
  • Indexer module could be a part of search module or a separate module.
  • There could be different topics for different kind of entities

Elasticsearch configuration

All tenant data should be stored in following indices: instance, holding, item, one index per each entity type. Every index should have 32 shards (partitions), routing is done by tenantId, so the request for every tenant will go to one shard. In case of hash collisions, some tenants can share same shard. For each tenant aliases should be created for each index. So, for tenant with id = uuid1 and instance index, instance_uuid1_resource_search alias should be created which contain routing parameter and filter by tenantId field inside. All search and indexing operations should be done via aliases, search never uses index directly. So, when we search over some tenant, search is done over tenant alias like instance_uuid1_resource_search. Since alias contains routing and filtering parameters inside, search is done on tenant shard only (and hence only on node containing this shard), then filtering parameter is applied for case when multiple tenants are present in one shard. Every shard is replicated to 3 nodes, so if some node goes down, data is still available for searching and updating.
This approach provides horizontal scalability for search. Data is spread over cluster nodes, search queries are performed on different nodes. If tenant number becomes too big, we just add nodes to cluster and data will be rebalanced over all nodes.

Apache Kafka configuration

Existing Apache Kafka cluster, which is used for mod PubSub can be used. For every resource type there should be separate Kafka topic. In case of multiple search microservices are deployed, the same consumer group should be set for all of them. Id of the record should be used as partition key of the message.


The proposed solution provides out-of-the-box solution for ES aliases handling cross-tenant searching (for consortia) in case when permission of viewing tenant metadata is granted from one tenant to another. In order to search over several tenants their ES aliases should be used for request as string concatenation joined by comma.

Aliases should be automatically created for all tenants and should have filter for "tenant-id" and "routing" parameter set to tenant-id in it. For every entity _routing field should have value of tenant_id.

Elasticsearch in such case will use appropriate routing parameter (and therefore shards) and filters. If there is extremely huge tenant, separate index could be created for it and alias could be switched to it without restart, downtime  or any code manipulations.

Reindex in case of data structure changes

In some cases indexes should be recreated and data should be reindexed by sending all ids of existing records from inventory to the Kafka topic. It shouldn't be ordinary operation and in best case scenario shouldn't happen in production.

No need to reindex data:

  • New column is added and there were no record in database with these information
  • Some column is deleted (if the aggregated fields are not implemented)

Reindex should be performed:

  • The existing data, which was excluded from indexing should be indexed
  • Some column is renamed or its data-type is changed
  • No labels


  1. Mikhail Fokanov Who is the audience for this proposal, is it intended for peer feedback from the technical leads?

    1. Vince Bareau proposed to add link to this page to summary of ABI-011 at Folio Architectural Blueprint Strategic Changes.

      1. Mikhail Fokanov

        Vince Bareau proposed to add link to this page to summary of ABI-011 at Folio Architectural Blueprint Strategic Changes

        Does that mean this is a strategic proposal intended for the Technical Council (and not for peer review by the technical leads)?

        cc: Craig McNally Jakub Skoczen

        1. Does that mean this is a strategic proposal intended for the Technical Council (and not for peer review by the technical leads)?

          My intent is to discuss this initiative as widely as possible. I told Vince Bareau, that I have good idea how to address all problems, which I've heard so far, that they exist in Folio (e.g. accurate counts, multi-tenancy etc - see problem statement section). I prepared the page. Vince said that it can be mentioned on Folio Architectural Blueprint, as there is corresponding row (ABI-011) in the table.

          So, your and community comments would be greatly appreciated. Also we could schedule a meeting for this discussion.

          1. Marc Johnson , In order to clarify it, I added "Purpose of this page" section.

  2. From a SysOps perspective, I have several concerns with utilizing elasticsearch.  First and foremost are performance concerns.  It takes a lot of memory to run elasticsearch.  Performance tuning elasticearch can be extremely challenging.  Second, adding an additional dependency to run folio is always a concern.  Right now we have to run postgres for the database, minio for object storage, and kafka for event streaming.  Adding these types of dependencies makes folio more difficult and cumbersome to run. This raises the bar for libraries who want to use it. It also makes folio upgrades more difficult by adding another layer to ensure gets upgraded properly.  Third, running, maintaining and updating elasticsearch is extremely difficult.  Having worked with elasticsearch in the past, I can testify that upgrades are challenging, re-indexing data is time consuming and the user interface is not easy.  Architecting an elasticsearch solution that is performant, highly available and easy to run is a demanding challenge. 

    1. Regarding your first concern, I believe we are heading for a lot more memory consumption anyway, trying to meet the requirements for better search in Inventory ( FOLIO-2573 - Getting issue details... STATUS

      For your third concern, in my experience, adding a lot of records to Postgres can be even worse if you have giant indexes set up. But I agree that it will be a demanding challenge.

  3. So there are three parts to this proposal:

    1. the need for a text search engine that stands apart from the main data storage,
    2. a basic architectural approach, and 
    3. the specific choice of search engine.

    The need to add a text search engine in order to address scale is expected. We (Chicago) saw this with Horizon and with OLE, RDBMSes can't be expected to meet our search needs at the scale of several million records. This seems uncontroversial.

    The basic approach, keeping a queue of records to be indexed, is familiar and in our experience robust. In Horizon, while the queuing method was slightly different it was  effective. In OLE, the search engine is intertwined with primary storage in a more complex way and the result is not as robust. In Horizon, the one weakness we saw in the queuing approach was that when we did bulk updates of a very large number of records, at some point the indexing would fall behind and take too long to catch up. In those cases it was more effective to temporarily disable the queue and schedule a complete reindex. And of course a change to the indexing configuration required a complete reindex. We would do maintenance on this scale maybe a couple times a year. The implementation needs to allow for the occasional complete reindex. And perhaps the ability to manually add a set of records to the queue for reindexing. In any event, the basic architectural approach of a using a queue also seems uncontroversial.

    The specific choice of search engine seems like the only point or controversy. As Brandon Tharppoints out, the addition of a search engine has an effect on operations, as does the choice of engine. In a perfect world, maybe we could plug in our choice of search engine (or maybe chose between a couple options, Elasticsearch and Solr, for example), but that  does not seem realistic. Maybe good to ask the SysOps SIG and the project DevOps for relevant experiences from the operations side.

    1. Tod Olson I'm curious, please could you expand upon the different synchronisation approaches that Horizon and OLE took and how those affected the robustness. I'd be keen to understand what experience folks already have of doing this.

      1. Marc Johnson, this is all from memory. Dale Arntson can correct me if I have anything important wrong.

        Horizon maintained a queue. (IIRC the scope of full-text searching was the bib records.) To keep the full text indexes updated there were database triggers on the table(s) with bib data. Any create, update, or delete would add the record to a queue kept in a table. That table had just enough information for the indexer to look up the record and take an action and for some troubleshooting. I think it recorded bib number, a code to indication the basic operation (create/update or delete), and a timestamp. (I do not recall whether there was a uniqueness constraint on the bib number.)

        The indexer was a separate process. It looked into the table with the queue, grabbed the first N rows, pulled the bib data, put it into the full-text indexes, and after success removed those rows from the table. As mentioned, there were some occasions where we would need to do a full reindex. As I recall, we would stop the indexer process and run the full indexer. That would zero out the queue, march through all of the bibs, and then start working on anything that had accumulated in the queue.

        That queue table would also give us some insight into how the indexing was going. If it's backed up, stalled, etc.

        OLE is more complicated. There was a design goal of a "docstore" to fuction as an abstraction layer to store and search documents, i.e. records, as a service to the business logic modules. The implementation took a different approach. After the first technology attempt did not scale, OLE switched used MySQL to store the transactional data and Solr for text indexing. (This might be a place where Dale Arntson should chime in, as he was closely involved with some of the debugging.) I'm actually not entirely certain of the current state of the implementation, I've been further away from OLE and more on FOLIO these last few years, and there have been efforts to it more robust. The basic approach was to receive and update on the SOA bus, make the transaction in the database and send an update to Solr. Skipping a little history, the strategy was to relay of the near-real-time indexing capabilities in Solr, and assume that the indexes would be updated quickly enough that they could drive displays in OLE. It kind of worked, except when it didn't due to timing errors, failure to send the update, or...

        The problem that emerged is that sometimes the data is Solr would be stale for some reason. The update message was not properly created, never arrived, commit didn't happen in time, whatever. (Insert aside on the need for useful, diagnostic logging.) So the database has a record in one state. Solr has a record in some other state, some fields are different. A user of the system opens a record to complete some workflow. Could be placing an order, or updating an item record, or changing a loan record. The information in Solr would be used to populate the user's screen and to initiate the transaction.  So the transaction would carry this older state of the record with it. As stated elsewhere, this caused much mischief and undermined trust in the system. Which is hard to regain.

        This last paragraph is a little bit of a digression from Marc Johnson's question, but it's an important point.

        1. Thanks Tod Olson for explaining further.

          I haven't had chance to digest this properly yet, I'll follow up when I do.

        2. Tod Olson thanks a lot for sharing your experience and for such comprehensive explanation. 

          "So the transaction would carry this older state of the record"

          I do agree, that ES data shouldn't be involved into any transactions. ES should be used only for searching and displaying the info. Speaking about outdated data on UI due to 3-5 seconds delay and it is valuable concern, we should consider it. However there is the same problem, when the data is changed after user had opened the instance view page.

          Actually, I also have experience of design and implementation of similar library software. It was single cloud SaaS solution for a lot of customers. That platform provided querying within less than 500 ms for 1000 concurrent queries (most of them are search and facets queries) over 200 millions bibframe instances. We faced almost the same problems (consistency etc.), but eventually we managed to resolve all problems. There was no possibility to see stale data due to errors. Properly configured Kafka cluster guarantees delivery in any case, race condition was avoided by partition the queue by id (each instance of search microservice deals with indexing of certain bib records). Also if the indexing request to ES failed due to infrastructure issue it is being repeated until it is success.

          There was slight delay due to indexing time, but our POs were OK with it. In case of Folio if we use ES, we will need to consider pros and cons of both options (use ES or Postgres for single instance view).

  4. Brandon Tharp I understand the concerns of SysOps, because every new technology in a stack implies additional efforts for deployment and maintenance. However there is tools that cannot be omitted in the systems, that has certain functionality. If there is a requirement for full-text search over significant amount of data (considering accurate counts and possibly facets and autocomplete) there is no way to do it without dedicated tool (search engine). I have great experience of configuring the ElasticSearch for the similar platform. We considered both ElasticSearch and Solar and after our investigation, we see that functionality of ElasticSearch is broader and its sharding and routing is more robust. Also, we didn't find any significant drawbacks in its use compared to Solar.

    Regarding the deployment, there should be solid "one-button" solution for deployment of the whole platform to the on-premises cluster and another button to deploy to certain cloud (e.g. AWS). From my perspective, in both cases k8s is suitable.  If it the case, including of ElasticSearch won't involve any efforts in terms of every on-premises deployment (only computing resources will be affected).

    Tod Olson From my experience I implemented some tactics for "bulk updates of a very large number of records" and in our case it was only a bit behind the main storage. Also we implemented kind of "blue-green" cluster reindex in order to make zero downtime changing of indexing configuration (as it was cloud SaaS). "Ability to manually add a set of records to the queue for reindexing" - was implemented as re-index by sql query.

  5. Perhaps not relevant for the things the proposal wants to solve, but this also opens up for a lot of other things, right out of the box:
    For example the ability to have proper sorting and searching for users of different nationalities: 

    • UXPROD-745 - Getting issue details... STATUS  
    • UISE-70 - Getting issue details... STATUS  
    • UISE-69 - Getting issue details... STATUS  
    • UISE-68 - Getting issue details... STATUS  

    There are other areas, like in inventory, where things that are hard to solve in the DB, could be very easy to solve in other ways, like allowing searches on both Valid and Invalid ISBNs, but ranking the invalid ones lower.

    1. Seems like this is also relevant:

  6. Note that there are additional use cases beyond Inventory including cross-module requirements that should be considered.

    1. Mike Gorrell  Mikhail Fokanov

      I think that is a good question, is this proposal intended for searching within the boundaries of a single module or for cross-module searching?

      1. Mike GorrellMarc JohnsonJulian Ladisch  The proposed new search microservice should be responsible for indexing and full-text searching of data. All kind of entities, for which there is huge data and actual result counts (or some other full-text features e.g. autocomplete, facets etc.) are mandatory. 

        Julian Ladisch Primarily, it should be used by staff searching (for Inventory screen), as it can provide exact counts. autocomplete and is DoS friendly (as any queries will be executed in less than 1 seconds). But it would be also suitable for patron UI, if there is any.

        1. Mikhail Fokanov

          The proposed new search microservice should be responsible for indexing and full-text searching of data. All kind of entities, for which there is huge data and actual result counts (or some other full-text features e.g. autocomplete, facets etc.) are mandatory. 

          Does that mean this proposal is for a single, centralised, search service that is used by all modules needing these kinds of features?

          1. We should consider each particular case separately. But I suppose, that this microservice should be a general one.

  7. Is this about staff searching in the inventory app, or about patrons searching in some discovery system? Can the title adjusted accordingly and, if staff searching in inventory, this page moved below Inventory?

  8. Mikhail Fokanov 

    1. Did you consider how the ES index would be initialised when data is loaded into the system for the first time?
    2. Would the data for a single record read (e.g GET /instances/123, GET /items/678, etc) be served to the client (e.g UI) from ES or PG?
      1. Yes, for this purpose there should be reindex process (method in inventory-storage), which will send all ids from database to the indexing Kafka topic. The same reindex mechanism should be used if index structure need to be changed dramatically (see Reindex in case of data structure changes section).
      2. It should be considered after some spike. I vote for serving it from ES, because in such case there will be no problems with scalability. Also in such case if postgres is highly loaded (by some background process) it won't affect user experience. But in such case we need to save all information about instance/holding/item in ES. But this info will be stored in _source field of ES, and won't affect ES search performance.
  9. Mikhail Fokanov

    1. I don't think sending millions of IDs through Kafka and then executing millions of GET operations against mod-inventory-storage would scale. More likely we need a dedicated signal sent through Kafka to "reindex all" and then stream all records out of mod-inventory-storage using a streaming GET for bulk /instances, /holdings, etc
    2. My concern about serving individual record data from ES is that it would increase chances of serving stale data – the user will not see his or others people changes until the change is passed down to ES. If we only use ES for searching and not retrieval, it is only discoverability that suffers.

    Btw, did you look at

      1. Even though sending millions of IDs through Kafka and then executing millions of GET operations against mod-inventory-storage looks inefficiently, actually there is really small overhead versus doing it via http-streaming, especially if they are batch requests (e.g. for 100 items). The actual indexing ES query (even bulk) takes much more time, than bulk get request from inventory. So get queries won't be a bottleneck. At the same time leveraging Kafka messaging has certain advantages:
        1. Indexing process can be scaled. As the Kafka topic is partitioned, there can be several search microservices, which do indexing in parallel. 
        2. Indexing process can be recovered. In case of http-streaming, if http-connection is broken or some network error occurs millions of items should be indexing once again, because there is no opportunity to recover and finish in consistent the process, if offset isn't used (which cannot be used due to performance reasons).
        3. There shouldn't be any downtime for the system. In case of concurrent modification of data (e.g. user change something from UI during the reindex process) there will be no problems with indexed data consistency (e.g. race-condition).
        4. Completely the same mechanism is used for ordinary indexing and reindex.
      2. Please see my answer to Tod. I do agree that for some cases serving even 3 seconds outdated data isn't applicable, we need to identify these cases and only then make a decision, that's why I said that we need a spike for it.
      1. When you say "batch" requests do you mean requests with offset (paging) or requesting a list of UUIDs?

        1. By "actually there is really small overhead versus doing it (a lot of get requests) via http-streaming, especially if they are batch requests (e.g. for 100 items)", I meant that requests from search to inventory-storage should be "batch". As Kafka messages will contain only UUID of changed instance, this "batch" request will contain list of UUIDs.

    1. I took a look on zombodb. I see that for some use cases it could be quick solution for the problem. But the first problem I noticed for applying it for our case is that we have no columns in tables, but single jsonb for the data and zombodb automatically map json/jsonb columns to dynamic nested objects, but it is generally accepted that usage of nested objects is extremely bad for ES performance. It's likely that this issue could be somehow hacked around, but I think there are still a lot of limitations of zombodb (for example usage of certain version of Postgres and ES). I think that it just isn't worth the hassle to have such a mediator between UI and ES for searching, as we already have Apache Kafka and implementation of search microservice is pretty simple job.

  10. Mikhail Fokanov I share that concern about serving individual record data a from ES. This is how parts of OLE were implemented. The effect is that transactions and search indexing become entangled. Stale data is served up and subsequent transactions are attempted based on stale data, which leads to inconsistent states and data integrity issues. Our experience has been that these can be subtle issues that linger undiscovered an accumulate until they cause real trouble. The data in the index cannot be treated as the source of truth.

    1. "The data in the index cannot be treated as the source of truth."

      You are completely right, ES isn't real time solution. I stated it in this document as: "No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency." The indexing delay can be configured (index.refresh_interval, the default is 1 second), but anyway it still cannot be completely avoided.

      In my opinion, in most of the use cases this near realtime data can be used (e.g. with delay < 3 sec). But for such things like circulation ES definitely cannot be used anyhow.

      That's why I say that using elastic for presentation of the pages should be considered after some spike story.

  11. Jakub Skoczen Mikhail Fokanov

    It seems to me that there are two different conversations going on in this proposal (as I believe Tod Olson has already stated above):

    • Do we want to use a secondary representation of inventory information to improve searching (with the kinds of trade offs that Tod Olson and Brandon Tharp raise)?
      • What is the scope of this, is it only for inventory searching?
    • How could that be implemented
      • What is the general architecture?
      • What are the technology choices?

    Could it be worthwhile separating out those two?

    Tod Olson comment suggests that the first of these may have already been decided. Has FOLIO decided it is going to introduce a secondary representation of inventory information for the purposes of searching?

    The need to add a text search engine in order to address scale is expected. We (Chicago) saw this with Horizon and with OLE, RDBMSes can't be expected to meet our search needs at the scale of several million records. This seems uncontroversial.

    If that decision hasn't been made, whilst considering the general architectural trade-offs (like consistency, increased operational complexity etc) and subsequent decisions about scope (like Tod Olsonand Mike Gorrell have asked about which operations will use this), is it worthwhile getting into the details of how this could be implemented?

    If we do want to discuss that, then there are lots of aspects (like the need for full rebuilding of the index) to it and lots of different approaches (there are at least three different kinds of messages we could use for synchronisation for example).

    I personally think it would be useful to frame that conversation with explicit decisions about whether FOLIO is doing this and the scope of the initial work (e.g. only search in inventory).