Spike - investigate options for improving data export performance

The main idea of the improvement is to replace endpoint invocations of mod-inventory-storage and mod-source-record-storage modules with the direct calls to DB. This approach can be applied to one of the resource intensive part of the data export flow, such that getting instances by IDs. In the current implementation the instances can be retrieved by IDs either from Inventory, or SRS source. In both cases, endpoints are invoked through the HTTP client with the specific number of IDs being passed per call (currently 50).  Number of calls depends on the number of IDs in the input CSV file. Following the new approach, the number of calls can be significantly decreased cause it is possible to pass much more IDs (5000 in the tests) to the SQL query rather than to endpoint of module, especially it concerns GET requests with URL size restriction.

Query to Inventory can be made like the following:

private static final String QUERY_INVENTORY_RECORDS = "SELECT jsonb FROM %s_mod_inventory_storage.instance WHERE%s;";

Where first %s is tenant id, and last %s is a condition which consists of the set of IDs. The final query can be built through the following method:

private static String buildQueryInventory(List<String> ids, String tenantId) {
    StringBuilder sb = new StringBuilder();
    ids.forEach(id-> sb.append(" id = '").append(id).append("'").append(" OR "));
    sb.setLength(sb.length() - 4);
    return String.format(QUERY_INVENTORY_RECORDS, tenantId, sb);
  }

Query to SRS can be done in a similar way, but in this case it is more complex since it includes 2 tables:

  private static final String QUERY_MARC_RECORDS = "SELECT rec_lb.id, snapshot_id, record_type, content, rec_lb.order, " +
    "external_id, external_hrid, suppress_discovery, created_by_user_id, created_date, updated_by_user_id, updated_date\n" +
    "FROM %s_mod_source_record_storage.records_lb rec_lb\n" +
    "JOIN %s_mod_source_record_storage.marc_records_lb marc_rec_lb\n" +
    "ON   rec_lb.id = marc_rec_lb.id\n" +
    "WHERE%s";

Where first two %s are tenant ids and last %s is a condition which consists of the set of IDs (here, external_id from records_lb table). The final query can be build through the following method:

  private static String buildQueryMarc(List<String> ids, String tenantId) {
    StringBuilder sb = new StringBuilder();
    ids.forEach(id-> sb.append(" external_id = '").append(id).append("'").append(" OR "));
    sb.setLength(sb.length() - 4);
    var res = String.format(QUERY_MARC_RECORDS, tenantId, tenantId, sb);
    return res;
  }

Query to preceding titles can be expressed similarly:

private static final String QUERY_PRECEDING_TITLES = "SELECT jsonb ->> 'id', jsonb ->> 'title', jsonb ->> 'identifiers', succeedinginstanceid " +
    "FROM %s_mod_inventory_storage.preceding_succeeding_title WHERE%s;";

Where first  %s is tenant id and last %s is a condition which consists of the set of IDs (here, succeedinginstanceid from preceding_succeeding_title table). The final query can be build through the following method:

  private static String buildQueryPrecedingTitles(List<String> ids, String tenantId) {
    StringBuilder sb = new StringBuilder();
    ids.forEach(id-> sb.append(" succeedinginstanceid = '").append(id).append("'").append(" OR "));
    sb.setLength(sb.length() - 4);
    return String.format(QUERY_PRECEDING_TITLES, tenantId, sb);
  }

Query to succeeding titles can be expressed similarly:

private static final String QUERY_SUCCEEDING_TITLES = "SELECT jsonb ->> 'id', jsonb ->> 'title', jsonb ->> 'identifiers', precedinginstanceid " +
    "FROM %s_mod_inventory_storage.preceding_succeeding_title WHERE%s;";

Where first  %s is tenant id and last %s is a condition which consists of the set of IDs (here, precedinginstanceid from preceding_succeeding_title table). The final query can be build through the following method:

  private static String buildQuerySucceedingTitles(List<String> ids, String tenantId) {
    StringBuilder sb = new StringBuilder();
    ids.forEach(id-> sb.append(" precedinginstanceid = '").append(id).append("'").append(" OR "));
    sb.setLength(sb.length() - 4);
    return String.format(QUERY_SUCCEEDING_TITLES, tenantId, sb);
  }

Changes for Inventory can be applied to the InventoryClient#getInstancesByIds method and instead of ClientUtil#getByIds we can use the following method:

public Optional<JsonObject> getInventoryByIds(List<String> ids, String tenantId) {
    JsonObject jo = new JsonObject();
    try {
      var conn = connect();
      try (Statement stmt = conn.createStatement();
      ResultSet rs = stmt.executeQuery(buildQueryInventory(ids, tenantId))) {
        JsonArray ja = new JsonArray();
        jo.put("instances", ja);
        while (rs.next()) {
          var next = rs.getString(1);
          JsonObject json = new JsonObject(next);
          ja.add(json);
        }
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
    return Optional.of(jo);
  }

*Please note, since the integration with PostgresClient from domain-models-runtime library is still open question, there was used the plain JDBC connection for testing.

The same approach can be used for SRS in the SourceRecordStorageClient#getRecordsByIds and apply the following method instead of HTTP Post:

public Optional<JsonObject> getMarcByIds(List<String> ids, String tenantId) {
    JsonObject jo = new JsonObject();
    try {
      var conn = connect();
      try (Statement stmt = conn.createStatement();
      ResultSet rs = stmt.executeQuery(buildQueryMarc(ids, tenantId))) {
        JsonArray ja = new JsonArray();
        jo.put("sourceRecords", ja);
        while (rs.next()) {
          JsonObject json = new JsonObject();
          var recordId = rs.getString(1);
          json.put("recordId", recordId);
          var snapshotId = rs.getString(2);
          json.put("snapshotId", snapshotId);
          var recordType = rs.getString(3);
          json.put("recordType", recordType);
          var content = rs.getString(4);
          JsonObject parsedRecJson = new JsonObject();
          parsedRecJson.put("id", recordId);
          JsonObject contentJson = new JsonObject(content);
          parsedRecJson.put("content", contentJson);
          json.put("parsedRecord", parsedRecJson);
          json.put("deleted", false);
          var order = rs.getInt(5);
          json.put("order", order);
          var externalId = rs.getString(6);
          var instanceHrid = rs.getString(7);
          JsonObject externalIdsHolderJson = new JsonObject();
          externalIdsHolderJson.put("instanceId", externalId);
          externalIdsHolderJson.put("instanceHrid", instanceHrid);
          json.put("externalIdsHolder", externalIdsHolderJson);
          var suppressDiscovery = rs.getBoolean(8);
          JsonObject additionalInfoJson = new JsonObject();
          additionalInfoJson.put("suppressDiscovery", suppressDiscovery);
          json.put("additionalInfo", additionalInfoJson);
          var createdByUserId = rs.getString(9);
          var createdDate = rs.getTimestamp(10);
          var updatedByUserId = rs.getString(11);
          var updatedDate = rs.getTimestamp(12);
          JsonObject metadataJson = new JsonObject();
          metadataJson.put("createdDate", createdDate);
          metadataJson.put("createdByUserId", createdByUserId);
          metadataJson.put("updatedDate", updatedDate);
          metadataJson.put("updatedByUserId", updatedByUserId);
          json.put("metadata", metadataJson);
          ja.add(json);
        }
        jo.put("totalRecords", ja.size());
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
    return Optional.of(jo);

As for the preceding/succeeding titles, method InventoryClient#enrichInstancesByPrecedingSucceedingTitles can be rewritten in the following way:

  private Optional<JsonObject> enrichInstancesByPrecedingSucceedingTitles(Optional<JsonObject> instanceStorageInstancesOpt, List<String> ids, OkapiConnectionParams params, int partitionSize) throws HttpClientException {
    if (instanceStorageInstancesOpt.isPresent()) {
      JsonObject instanceStorageInstances = instanceStorageInstancesOpt.get();
      var precedingTitlesOpt = instancesDao.getPrecedingTitles(ids, params.getTenantId());
      var succeedingTitlesOpt = instancesDao.getSucceedingTitles(ids, params.getTenantId());
      if (precedingTitlesOpt.isPresent()) {
        var precedingTitles = precedingTitlesOpt.get();
        var titleArr = precedingTitles.getJsonArray("precedingTitles");
        Map<String, JsonArray> map = new HashMap<>();
        titleArr.forEach(title -> {
          var titleJson = (JsonObject) title;
          var titleId = titleJson.getString("instanceId");
          titleJson.remove("instanceId");
          map.computeIfAbsent(titleId, k -> new JsonArray()).add(titleJson);
        });
        instanceStorageInstances.getJsonArray(INSTANCES).forEach(instance -> {
          ((JsonObject)instance).put(PRECEDING_TITLES, map.getOrDefault(((JsonObject)instance).getString("id"), new JsonArray()));
        });
      }
      if (succeedingTitlesOpt.isPresent()) {
        var succeedingTitles = succeedingTitlesOpt.get();
        var titleArr = succeedingTitles.getJsonArray("succeedingTitles");
        Map<String, JsonArray> map = new HashMap<>();
        titleArr.forEach(title -> {
          var titleJson = (JsonObject) title;
          var titleId = titleJson.getString("instanceId");
          titleJson.remove("instanceId");
          map.computeIfAbsent(titleId, k -> new JsonArray()).add(titleJson);
        });
        instanceStorageInstances.getJsonArray(INSTANCES).forEach(instance -> {
          ((JsonObject)instance).put(SUCCEEDING_TITLES, map.getOrDefault(((JsonObject)instance).getString("id"), new JsonArray()));
        });
      }
      return Optional.of(instanceStorageInstances);
    }
    return instanceStorageInstancesOpt;
  }

Please pay attention the all the methods above were created just for testing purpose and can be refactored during the actual implementation.

Results of the tests in the https://folio-testing-sprint-fs09000000.ci.folio.org environment (from master branch)


Average timeFile size
Loading CSV file with 8M of IDs7 - 8 minutes313MB
Execution time for 8M (7_919_373 from SRS, 260_025 from Inventory)up to 13 hours
Expected time of loading CSV file with 22M of IDs*20 - 21 minutes860MB
Expected time for 22M records with the same proportion of SRS and Inventory records*up to 34 hours

Results of the tests in the https://folio-testing-sprint-fs09000000.ci.folio.org environment (NEW APPROACH)


Average timeFile size
Loading CSV file with 8M of IDs7 - 8 minutes313MB
Execution time for 8M (7_919_373 from SRS, 260_025 from Inventory)up to 2.5 hour
Expected time of loading CSV file with 22M of IDs*20 - 21 minutes860MB
Expected time for 22M records with the same proportion of SRS and Inventory records*up to 8 hours

*All the tests above were performed in the environment with up to 8M records and the expectation time for 22M is based only on assumption that after 8M the growth will still be linear as before.

Open questions and possible solutions

1) Create views in mod-data-export schema through the Liquibase to make a call to other schemas and set conditions dynamically (currently it says that not enough permissions).

Possible solution

Direct queries are enough and views are not actually needed. However, problem with permissions is still open.


2) The output mrc file at the end of the export appears to be too big and cannot be saved to Minio.

Possible solution

Divide the file into parts and save them one by one to Minio.