Spike: Investigate limiting document size on upload

Requirements: MODINVOICE-125 - Getting issue details... STATUS

A solution for enforcement a size limit for invoice documents should be developed.

HTTP/REST API File Uploads

  1. Base64 encode the file and add processing overhead in both the server and the client for encoding/decoding.
  2. Send the file and metadata both in a multipart/form-data POST.
  3. Send the file first in a multipart/form-data POST, and return an ID to the client. The client then sends the metadata with the ID, and the server re-associates the file and the metadata.
  4. Send the metadata first, and return an ID to the client. The client then sends the file with the ID, and the server re-associates the file and the metadata.


MethodAdvantagesDisadvantages
1Base64 encode the file and add processing in both the server and the client for encoding/decoding
  • Simple in use
  • Simple control of document size (in a schema)
  • Increasing the data size by around 33%;
  • processing overhead in both the server and the client for encoding/decoding;
  • in particular FOLIO case this will not work because the body of the request will be read into memory for ALL PUT requests  and for POST requests
2aSend the file and metadata both in a multipart/form-data POST
  • Supported by RMB
  • Simple control of document size (in a raml)
  • There is no possibility to send metadata as validated json, only as primitive types;
  • schema changes
  • originally planned to send the file and metadata as part of a composite request, but raml does not support json type in properties [] (only built-in types).
2bSend the file and metadata both in a application/octet-stream POST
  • Supported by RMB
  • Working approach, easy to implement
  • How to control uploaded file size?
3

Send the file first in a multipart/form-data POST, and return an ID to the client. The client then sends the metadata with the ID, and the server re-associates the file and the metadata

  • Supported by RMB
  • Simple control of document size (in a raml)
  • Two separate requests with file and metadata;
  • logic changes both on UI and BE side
  • schema changes
4Send the metadata first, and return an ID to the client. The client then sends the file with the ID, and the server re-associates the file and the metadata

Limiting document size

The most interesting are the following options for particular FOLIO case.

Base64 encode the file (current implementation)

In this approach, it is enough to introduce in the schema a limit on the Base64-string size taking into account the fact that the overhead for the file size is about 30%.

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "description": "Object with base64 encoded file data",
  "type": "object",
  "properties": {
    "data": {
      "description": "Base64 encoded file data",
      "type": "string",
      "maxLength": 1.3 * DOCUMENT_SIZE_LIMIT
    }
  },
  "additionalProperties": false,
  "required": [
    "data"
  ]
}


But! From the RMB code:

// IMPORTANT!!!
// the body of the request will be read into memory for ALL PUT requests
// and for POST requests with the content-types below ONLY!!!
// multipart, for example will not be read by the body handler as vertx saves
// multiparts and www-encoded to disk - hence multiparts will be handled differently
// see uploadHandler further down

This means that we can't solve OOM issue if file size is larger that module heap size. 

Multipart/form-data POST

This approach provides precise file size control and better performance. However, it requires considerable effort in changing the schemes and logic both on the client side and on the service side. Nevertheless, it may be implemented in the future if the issue of file upload performance and unification of file loading mechanisms becomes important.

types:
  invoice_document_file:
    type: file
    fileTypes: ['*/*']
    maxLength: DOCUMENT_SIZE_LIMIT
  fileUpload:
    properties:
      invoiceDocument:
		type: string
      invoiceDocumentFile:
        description: The file to be uploaded
        required: true
        type: invoice_document_file

................

	/documents:
       displayName: Document
       description: Manage documents associated with invoice
       post:
         description: Post document attachment/link;
         is: [validate]
         body:
           multipart/form-data:
             fileUpload


Upload by 2 requests

Solution can be implemented but required big changes of current implementation. We need to send one ordinary invoice document request with metadata and other with file, combine this data to save file.

Summary

Option 1 - not acceptable;

Option 2 - looks well but has implementation troubles;

Option 3/4 - looks well but require uploading by two requests.

UPD 04/14/2020: As far as only application/octet-stream uploading is available on FOLIO for now the previous approaches looks not so easy and effective. PoC based on application/octet-stream approach is presented below.

PoC based on application/octet-stream approach

Application/octet-stream mechanism allows to control sending request size in the following manner. For this approach we need to know approximate size of json request content size, metadata and Base64-encoded file size (original invoice document size (file size) + 30%). For example, file limit is 10 Mb, other json content size is 1 Mb. As a result MAX_DOCUMENT_SIZE limit parameter should be approx. 14Mb. This looks not so precise but should be enough for request size control.

  1. Define POST API in raml-file with using application/octet-stream content type: 

          post:
            description: Create a new <<resourcePathName|!singularize>> item.
            body:
              application/octet-stream:
  2. Refactor existing code for POST request processing:

      private byte[] requestBytesArray = new byte[0];
      private static final int MAX_DOCUMENT_SIZE = 350000000; 
    
     @Validate
      @Stream
      @Override
      // This method will be executed for each chunk of stream in scope of sole exemplar of public API interface implementation.
      // Stream contains the stream contains the entire file with Base64-encoded file.
      public void postInvoiceInvoicesDocumentsById(String id, String lang, InputStream stream, Map<String, String> okapiHeaders, Handler<AsyncResult<Response>> asyncResultHandler, Context vertxContext) {
        DocumentHelper documentHelper = new DocumentHelper(okapiHeaders, vertxContext, lang);
        try {
          // This code will be executed as many times as there are chunks in the stream until RMB adds header "complete"
          // to indicate "end-of-stream".
          if(Objects.isNull(okapiHeaders.get("complete"))) {
            // Control oversize situation
            if (requestBytesArray.length < MAX_DOCUMENT_SIZE) {
              // If there are no oversize situation just add stream bytes to array
              requestBytesArray = ArrayUtils.addAll(requestBytesArray, IOUtils.toByteArray(stream));
            } else {
              // Set request bytes array to null for clear memory in case of oversize to prevent memory overloading
              requestBytesArray = null;
            }
          } else {
            // This code will be executed one time after all chunks processing
            if (Objects.isNull(requestBytesArray)) {
              // Complete code with limit document oversize error
              documentHelper.addProcessingError(DOCUMENT_IS_TOO_LARGE.toError());
              asyncResultHandler.handle(succeededFuture(documentHelper.buildErrorResponse(422)));
            } else {
              // If there are no oversize case just process ordinary logic
              InvoiceDocument entity = new JsonObject(IOUtils.toString(requestBytesArray, String.valueOf(StandardCharsets.UTF_8))).mapTo(InvoiceDocument.class);
              if (!entity.getDocumentMetadata().getInvoiceId().equals(id)) {
                documentHelper.addProcessingError(MISMATCH_BETWEEN_ID_IN_PATH_AND_BODY.toError());
                asyncResultHandler.handle(succeededFuture(documentHelper.buildErrorResponse(422)));
              } else {
                documentHelper.createDocument(id, entity)
                  .thenAccept(document -> {
                    logInfo("Successfully created document with id={}", document);
                    asyncResultHandler.handle(succeededFuture(documentHelper.buildResponseWithLocation(String.format(DOCUMENTS_LOCATION_PREFIX, id, document.getDocumentMetadata().getId()), document)));
                  })
                  .exceptionally(t -> handleErrorResponse(asyncResultHandler, documentHelper, t));
              }
            }
          }
        } catch (Exception e) {
          handleErrorResponse(asyncResultHandler, documentHelper, e);
        }
      }