DR-000005 - Platform agnostic object storage for Exports

Submitted Date	01 Apr 2020
Approved Date	24 Jun 2020
Status	ACCEPTED
Impact	MEDIUM

Overrides/Supersedes

This decision was migrated from the Tech Leads Decision Log as part of a consolidation process. The original decision record can be found here.

RFC

NA

Stakeholders

#sys-ops
#development

Contributors

Kruthi Vuppala

Approvers

This decision was made by the Tech Leads group prior to the adoption of current decision making processes within the FOLIO project.

Background/Context

The initial implementation of data export utilized AWS S3 object storage. Since FOLIO shouldn't rely on specific cloud vendors, we need to decide on a platform agnostic storage solution for storing the files generated by the data export process. The following technologies were considered:

MinIO
OpenIO
Databases

See also: MDEXP-19 - Getting issue details... STATUS

Assumptions

NA

Constraints

The following considerations/requirements were taken into account:

- There is already process in place for storing the generated files to AWS S3, so it is better to find a solution that requires minimal changes
- The solution must be able to support cloud as well as on-premises storage, since FOLIO can be expected to be hosted on various platforms
- Easy to retrieve the files for download, currently the module uses AWS S3 presigned URLs, to stream files directly for download, instead of stressing the server or adding load on the UI
- The file sizes can vary from few KB to GB, based on the number of UUIDs being exported, the solution should be able to support varied file sizes

MinIO

POC: https://github.com/KVupp/folio-export-aws/blob/master/src/main/java/org/folio/folio_export_aws/MinIo.java

Pros:

- - Can be setup with multiple backend storage solutions: Acts as a gateway for storage solutions like AWS S3, Azure blob store, HDFS etc. Or can be set up on a local persistent storage volume
  - Provides an S3 compatible API
  - Minimal code changes required to current code with ability to switch between solutions easily with command line parameters
  - Hardware agnostic and works on a variety of physical and virtual/container environments.
  - A UI and CLI are also available for accessing the files

Cons:

- - Doesn't seem to support credential chain like AWS SDK, so need to send key and access key to module via parameters
  - Doesn’t have multi file upload like AWS S3, but we can upload files one by one

Reference: https://docs.min.io/

OpenIO

Pros:

- - S3 compatible API

Cons:

- - No API support to fetch presigned URL

Reference: https://www.openio.io/

Databases

Relation or Non relational databases have an overhead of storing static files either as large files or when stored in chunks.

Costs are higher compared to object storage, and backing them up becomes more difficult as the file sizes grow.

Though database can be logical because all the data is present in a single area, but the disadvantages that come with it overshadow the advantages

Specifics related to postgresDB identified in a different spike are documented here: (https://folio-org.atlassian.net/wiki/x/OQ4V)

Other options Considered:

- - CEPH (https://ceph.io/) - This combines object/file and block storage, and has different daemons for monitoring, storage and gateways. Ceph seems like a good option for more complex and large object stores. But in case of smaller stores like data export, this option is more complex
  - S3 Proxy: https://github.com/gaul/s3proxy - This is another option that can be used, but there is not enough documentation and from a maintainability perspective did not pursue further.

Rationale

MINIO with it's advantages seems like the go to option with it's easy setup and minimal code changes required for data-export

Decision

Data export will use MinIO for cloud-agnostic object storage

Implications

Pros
- Greater flexibility - not tied to a specific cloud vendor
- Backwards compatibility - data export still works with AWS S3
- Few code changes are required
Cons
- If not using AWS S3 (or compatible service), hosting providers will need to setup/maintain additional infrastructure (MinIO)

Other Related Resources

-