Best practices for Uploading large files

Hi all,

I'm looking for best practices for uploading large files (in the order of Gb).
The application user base is in the order of the thousands and I don't want to block the servers if 100 users need to upload one huge video file.

I have some options already available on my plate from Akamai, Azure blob storage to simple hosted solutions with Nginx or even IIS.

There's no need to act upon the upload and the users are distributed all over the world so the cloud solution is the most logic. The main requirement is to get the files from the users as fast and reliable as possible.

Do you have any experience to share? Pitfalls? Edge-cases to be aware? Successful architectures?

LVL 31
Alexandre SimõesManager / Technology SpecialistAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Rich RumbleSecurity SamuraiCommented:
You'll want to dedup like Box, DropBox and others do... hash the file before the upload to see if it already exists, then make a copy on the server or use a pointer to the same file. The names can be different, they have no bearing on the hash or file size. The other part you'd want is compression, compress during (before) transmission. There are a variety of ways to do that, but you can probably find some Javascript and utilities that help speed up the transmission. For downloads you use GZ transmission to compress the files before you send them to the users and the browser decompresses them for the user.
btanExec ConsultantCommented:
Cloud files, CDN, Storage tiering, on demand secure access and rate throttling areas which I will focus on to consider for such use case. Uploading large file esp video can still be time consuming so some below are some suggestions.

1. In Azure context, differentiate the use of Block and Page Blob. E.g. Choose Block Blob - Mostly for streaming contents. It consumed in blocks, and idea for easier rendering in streaming solutions. And choose Page Blob - Mostly for frequent write needs esp like VM image instances. This Page Blob grant writing to specific part so you do not need to rewrite the whole blob as it is time consuming.

2. Consider creating and maintaining snapshots esp if content does not change much after upload. It does not means it is static per se but the frequent write is lesser. This "snapshotting" can be a way to improve availability too as you assigned it as default Storage Blob. It can then be accessed by all your users performing only read operations, and leaving the original Blob only for writes.

3. CDN as mentioned is good for fast access leveraging the caching aspects, serving the cache content base on nearest node to the originating user request locality. This global access is more attractive as compare to on premise scheme  using accelerator. The latter still helps but CDN will already comes with it too. You can have both for the value add experiences. Overall, CDN reduces latency and increases availability by placing a duplicate of such content. However the costs will increase with this CDN subscription, but you are will not be charged for storage transaction costs for each user to the Blob (if using Azure) since the client is hitting the CDN node and not the storage.

4. Consider a resumable data transfer feature that can allow resumption in event parallel upload take times and network latency can kick in with the small bandwidth that organisation usually have. This can lets user resume upload operations after a communication failure has interrupted the flow of data.

5. Secure Access is also critical so that not all are allowed to access and have all permission. This can indirectly complicate the storage strategy and it is bad security practice if not adhering least privileged principle. There are incident to breaching the cloud storage causing data leakage which we must avoid. Enforce granular access such as owner access key known to authorised user and shared access key for authorised team. The permission can then be set for the granularity via the blob, container, storage etc depending on the provider capability offering. It is best to also consider multi-factor authentication to reach out to allowing true user, as username and password tend to be weak esp user has simple password for convenient. We can set password complexity but it is good to guard minimally for privileged admin to these cloud storage remote administration.

Maybe good to catch this article is a good read for best practice in designing large scale services (including tiering up the storage). It sums up the approach to scale is to: partition the load, and compose it across multiple scale units - be that multiple VMs, databases, storage accounts, cloud services, or data centers.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
David Johnson, CD, MVPOwnerCommented:
Then upload to a cloud provider they have the bandwidth and the disk I/o speed to not block your cloud based server.  Video files are already compressed files unless they are sending raw avi's so gzip will probably had more complexity and for little or no reward. There may even be a penalty if the files are not compressible. Akami is a huge cache but I've received broken files from them in the past usually when I've updated the rss feed before the total content is cached by Akami. or Cachefly. What these 2 do is reduce your bandwidth by using a form of multicasting and the cost savings can really add up when there are a lot of users trying to access the same data at the same time and the files are large.  Geo-redundancy takes a bit of time (minutes not hours) but it is still something to take into consideration. Asia and Australia are the primary problem areas that they deal with extremely well.  Australia is a particular problem since they have limited pipes (getting better over time)..

Data deduplication is only a factor when there is similar data and you want to save your storage costs. If  thousands of  users have Madonna's Ray of Light album then considerable savings are achievable .. This is where the scale of the user base comes into play.. the more users you have the more uses that might have duplicate content being put into a centralized store. End to End Encryption and Encryption at rest makes keeping the users data encrypted at all stages where the user has the only decryption key makes the data pretty much uncompressible and the providers lose this economy of scale for storage as each file (which now really is just a data stream) is unique. as they are encrypted differently.

If you were referring to a local area network that has limited bandwidth and disk iops (in comparison to a cloud provider where 1000 users is just a drop in the ocean. Here 100 users uploading or downloading a huge file will fill your pipe and cause a degradation of service to others within the same LAN.. Now you have to do some traffic shaping to maintain a Quality of Service for everyone.
btanExec ConsultantCommented:
Adding on...

5. Also drilling a bit into the database storage, it is generally not "healthy" to store the file blob in the database table compared to just a path or link index to the place to retrieve the actual file blob. E.g  SQL Server have a FILESTREAM column type that dose the latter. Furthermore, if what you have is an link or file paths, it is more efficient to simply change those paths if the file content is updated or any changes. The processing and resource is less intensive compared to actual entry replacement even if it is by part of the blob which is more computation heavy. Overall, the actual blob stored in the DB table can hinder your DB performance and will not improve file retrieval performance. If needed still, it is still best to store large file blob in a separate table and just keep a foreign key reference to the blob in your main table. Avoid duplicating it.

6. For case of file conversion, if there is a need, Cloud storage may not be the best place to do it offline or is it for user to do it and upload again. It is optimal to have control and consistent experience in such conversion of file via user request in a secure and fast fashion. If interested, you can check out the that processes and converts the uploaded files according to your file conversion instructions.
Alexandre SimõesManager / Technology SpecialistAuthor Commented:
Hi guys, sorry for the delay.

Currently we're dealing with a lot of constraints, most of them more political than technical.
The solution we found will use chunk upload whenever allowed by the browser and directly into the file share.
Streaming these files back to the client is a requirement that we'll have to deal later as one of the requirements is to be able to tag the media files (pictures and videos).

For now direct upload to the CDN is not possible because of security constraints... we'll have to revisit this later, specially for the streaming part.

Thank you very much for your inputs,
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Cloud Computing

From novice to tech pro — start learning today.