For the last little while I’ve been working on moving some of our services to Amazon Web Services, with my most recent work focusing on document storage. One of the more interesting problems has been creating zip files for documents contained within Amazon S3. Multi-file download is one of the most commonly used tools in our application so the aim is to make this process as quick as possible, so that users spend a minimum amount of time waiting for their download to begin. After experimenting with several approaches and tools, I came across a simple solution that obliterates all others with respect to compression speed and even simplicity of implementation.
Before we get to the good stuff, let’s step back and look at the typical approach to zipping files on S3. Given that S3 has no native support for doing this, the selected files must first be downloaded into an EC2 instance, and then compressed using your toolkit of choice. Our initial steps looked something like this:
- Submit a post to a web service to initiate the zip process along with the corresponding S3 file keys to be compressed
- The request is then placed in a queue so that the next available EC2 server can process it. Initially we used SQS to queue the request but its complete lack of ordered queuing rendered it quite useless. We therefore used a list in Redis to maintain an ordered queue of zip requests. Along with the entry we generate a unique zip identifier that gets returned to the client.
- Once the next available EC2 instance pops the entry off the queue, it begins downloading the files from S3 and building the final zip. Once the zip is created, the local files are cleaned up and the new zip file is pushed back to S3.
- The completed request is once again pushed to Redis using the previously generated zip identifier
- During this time, the client has been continuously polling to check if the zip has completed (you could use long polling here as well). Once the server finds the zip identifier in Redis, it responds with the S3 download link and the user can proceed to download the file.
- Clean up the zip files in S3 after a day or so using a thread or S3’s new object expiry option
First off, the toolkit you use to build your zip is absolutely crucial so invest in a good one. Since our application is written in Scala, I initially started with the native Java zip tools which were painfully slow. I then moved to the Chilkat zip tools which were an order of magnitude faster as they are written in C and accessed in Java over JNI. Chilkat has implementations for nearly every popular programming language and for only $249 for a site-wide license, it’s money well spent and serves as the basis for this solution.
So now that we have a basic framework, how can we improve on this naïve approach? Upon inspecting the Chilkat API, I noticed the existence of a QuickAppend method which serves to append one zip to another. I began wondering how the compression time would be affected if we pre-zipped each file in S3, in its destination directory structure, and then simply appended them all together to form the final zip. To my dismay, the difference in compression time was astonishing. Small zip files in the 100kb-300kb range saw a 2x-3x speed improvement, while those larger than 10mb saw a 10x – 15x improvement. For example, a 14mb zip with 25 files varying in size from 100kb to 8mb took a mere 120ms to compress into the final zip, while building the zip from scratch took over 1.5 seconds.
The additional benefit of this is that if your users store lots of files that compress well, there’s less data to download from S3 to EC2 to create the zip in the first place. The degree of compression for the original files also affects the speed of the QuickAppend operation, whereby highly compressed files can achieve speed improvements up to 25x. Most files in my tests were moderately or only lightly compressed as they were comprised of PDF and image documents.
The obvious downside of this approach is that you have to store two copies of each file, one in its original form, and another in compressed form. In our case the speed advantages outweigh the added storage costs.
The final architecture has to change somewhat as we have to build services to zip each file at the time of the upload and store them to S3. In this case SQS is a viable solution as the uploaded files don’t have to be processed in a strict sequence. If a user should happen to download files immediately after uploading them, your compression algorithm will also have to deal with the possibility of the zipped file not being ready. The final zip implementation becomes quite trivial:
- Download the first pre-zipped file from S3 to your EC2 server
- Iterate subsequent pre-zipped files by downloading them and appending them to the first file
- If a pre-zipped file is not found, download the original and zip it, then append it
- Upload the completed zip back to S3
If you any other techniques for zipping files on AWS, please share them in the comments.
If you liked this post please follow me on Twitter or upvote it on Hacker News and Reddit