Upload

This page documents the techniques and configurations needed to allow reliable uploading of multi-gigabyte files over HTTP.

2020-09-05: This technique remains unused due to chronic time shortage. Old-school upload is used until further notice...

Basic flow

Only users that have been listed as active "Teachers" are able to access upload UI.
vm.utu.fi/upload.html page features an upload element supporting both drag'n drop and dialog based select file submissions.
Upload mechanism is client-controlled server-passive partitioned process using HTML5 File API. In practice that means that:
- Client sends the file in pieces (henceforth "chunks") and replies on response statuses of the server to determine if each chunk was successfully delivered.
- Server accepts any chunk, even if had already been successfully received.
- Client queries (GET request) if a chunk is already uploaded, and skips already uploaded chunks. This is not a requirement, but just how Flow.js works (resume support - page can reload and file transfer can be restarted).
- Server blindly trusts the Flow.js data (total size, chunk sizes, number of chunks) and enforces no checks of its own. Again, how Flow.js has been designed - it doesn't really accept any control from the server.
- Client is happy to accept that the file upload was a success once the last chunk has been accepted. Or in other words, Flow.js accepts no post upload status information.
Server makes a fast-space-allocation (one null byte write and file close) by using connection initialization parameters from the browser, to ensure that disk space will not accidentally run out during the transfer. (Note: Will not work with Reiser FS).
Client sends checksums along with the chunks on each HTTP Request. (THIS IS NOT IMPLEMENTED - TBA)
(TBA) Server calculates its own checksum from the received chunk and compares it to the received checksum. OK or NOT OK Responses are set accordingly.

ISSUES

Parallel uploads currently disabled. Need a lot of robustness testing...
Flow.js HTML element for drag'n drop and file selection dialog doesn't seem to be configurable to single file upload only model. Custom element is designed and Flow.js is used only as a file upload module.

Cleanup of Interrupted and Unfinished Transfers

Because the server has no control or information over the success or failure of a transfer, the only apparent solution to clean out failed transfers is to create a background task that lists all those unpublished files that have last modified attribute 24 hours or older.

Technical Notes

Old-school browser engine upload is often, for a reason, capped at 1 MB only (Nginx default). It is not a very good idea to remove this limit ... or even configure it so high that it supports all possible virtual machine image file sizes, because leaves the whole server very vulnerable to a type of DoS attack where the resources are being compromised faster then they can be recovered.

Writing checks into the Flask application does not solve this issue, because anyone can initiate upload to the API endpoint and Nginx will receive up to the configured maximum bytes BEFORE starting the Flask application with the completed request.

This solution, even with chunk based transmission scheme, is likely to need an increase on this limitation (see below).

Nginx - 413 Request Entity Too Large

Apparently, the built-in maximum request size in Nginx is 1 MB.

To set upload limit to all sites run by this instance of nginx: (DO NOT USE THIS) /etc/nginx/nginx.conf:

html {
    ...
    client_max_body_size 60M;
}

To set upload limit a path in site: (USE THIS) /etc/nginx/sites-available/vm.utu.fi:

location / {
    include uwsgi_params;
    client_max_body_size 60M;
    uwsgi_pass unix:/run/uwsgi/app/vm.utu.fi/vm.utu.fi.socket;
}

Set the limit TWICE the size of uploaded chunks! (read below)

NOTE: Default configuration of Flow.js will result last chunk being up to 200% of the configured chunk size! This is by design, as it has been created for media uploads, which need either the beginning or the end of the media file to read its properties - and it has a feature to send head and tail of the file first, so that the server can analyze it while the rest of the bulk is still being transmitted. Head is always the size of the chunk, but in order to ensure that the last chunk is usable, they opted a model where the last chunk equals the second last and the last bundled together.

This can be configured out of the library, but because it is a variable entirely in the hands of the browser, it would be really stupid NOT to configure Nginx with 2x the size of the largest Flow.js chunk value.

Client-side library (Flow.js)

Version under development is based on Flow.js which, while lacking proper documentation, is mature and stable. This library can take care of the file transfer, but is entirely unsuited for the next steps, which will have to be specifically created for this project (initial search for a suitable dynamic REST API based form generators was deeply unsatisfactory).

Server-side processor

Flow.js GET and POST requests are both handled by /api/file/flow endpoint (routes.py). Due to the sheer size of the uploads and the timeouts in HTTP(S) connections, files are not assembled by the request handler when the final chunk has been uploaded. Instead, a .job file is written with the same flow ID as the uploaded chunks. Job file contains a JSON with values: owner, filename, size, chunks, flowid.

Cron job (flow-upload-processor.py) executes every minute (crontab: */1 * * * *) and scans {UPLOAD_DIR} for *.job files. Every one it finds, it "tags" by appending the .job suffix with PID (.job -> .job.41231, for example). This prevents other possible (manual/other) invocations of this script from trying to process the same upload.

flow-upload-processor.py does three things:

Creates the image file into {DOWNLOAD_DIR} and then writes it with the chunk contents. Chunks and the .job file are removed.
If the suffix of the image file is .ova, .OFV extraction and parsing is attempted to provide better details for the database record (next step).
A database record for the image file is created, containing all details that could be collected.

If the script fails on concatenating the image file from the chunks, the PID appended .job.{PID} and the chunks remain in the {UPLOAD_DIR}, but they will not be retried because the job file suffix is not strictly .job. For easier problem resolution, the script will write an .error file into the same directory with the associated error message. This is also useful for the below described progress feedback solution.

Progress feedback for user

The approach described above lacks any progress reporting to a user after the last chunk has been uploaded, which is why a separate solution is created using SSE (Server-Sent Events).

TO-BE-ADDED-LATER

Provide feedback

Saved searches

Use saved searches to filter your results more quickly