Note: This originally appeared on my personal blog, which I no longer maintain due to time constraints, so have moved it here for reference to the code.
The SaaS product that we are currently redeveloping has a need to store images uploaded by the user before sending them off to a publishing site. We are also wanting to extend this functionality to include video in the media types that are available to the user.
At the moment the system grabs a multi-part form post and stores the entire byte-array of the image in the database, along with the rest of the details of that piece of content. When the platform was originally built (2012), this may have sufficed, however now that we have much larger image files and the need to store videos up to 1Gb in size, this is probably not ideal.
I decided to implement storage via AWS S3 – it’s nice and fast, has lots of locations around the world and some great capabilities as far as security and access control are concerned. I have had a little experience in the past integrating with the SDKs, but mostly from PHP – and this bad boy is built in Java (not my first language of choice – or the one I am very versed at). I built a class that just wrapped up what I needed it to do, sending a file blob to an S3 bucket of my choosing in a region of my choosing; attaching a nice bunch of meta-data along the way. All was going well, when I had the epiphany – why not change the S3 integration to the client-side, then we will wouldn’t have to deal with the upload bandwidth at all.
The benefits of moving to a front-end solution were, it would be quicker for the end-user (the file would go straight to S3, not to our server, then S3); it would save bandwidth on our server (and ultimately money); and it would save on disc space requirements on our server – happy days.
Now, it would be remiss of me to outline the dis-benefits of this change – first, this would take more time to implement than going with the S3 integration from the server (this one is certainly outweighed by the benefits though); the other really big issue was that if we were to integrate via the front-end, our Amazon SDK credentials would be exposed for the world to see (a great little solution to this problem, next). All-in-all, I think the benefits greatly outweighed the dis-benefits, so we were a go on the front-end integration.
As stated previously, one of the dis-benefits to integrating with AWS from the front-end is that you need to provide SDK credentials for the world to see in your javascript. So the solution to this was to do the following:
- Create a temporary bucket that will only receive PUT requests via appropriate CORS headers (which also limits domain origins that can operate on the bucket)
- Create a final storage bucket that has NO CORS headers, meaning it cannot be uploaded to from javascript
- Add a file expiry policy on the bucket to 24 hours (this will probably come down considerably, after launch) – so files will automatically be deleted after this period of time
- Create an IAM role in AWS that has very limited capabilities – ONLY S3 access and ONLY the ability to upload (no delete or move, etc) – this will be used by the javascript
- Create an IAM role in AWS that has much more powerful capabilities (still only S3 access, but more control over all aspects of S3) – this will be used by the Java backend
- Upload the file to the temporary bucket from javascript, using the credentials for the limited IAM user
- Send the details of the file (along with the rest of the form details) to our server
- Our server will move (or rather, copy) the file from one bucket to another (note: the file would never go to our server), using the more powerful IAM role
- Store the file details in the database along with the rest of the form data
Obviously we would need to manage the ability to have content with the old, blob-storage images AND the newly created S3 url stored images, but that is a single check on the DB record and only a couple of places in the rest of the codebase where it is required.
Using the limited IAM user and the multi-bucket system, alongside Amazon’s lifecycle file management in S3; means that the worst that can happen is that someone maliciously uploads a really large file to our S3 bucket, and it gets deleted automatically after a short period of time (granted it would be 24 hours to begin with, but this will come down).
Here is a cut-back gist of the code – note that it is written completely functionally and in a future build of the platform, this will all get wrapped up in a nice component within a larger framework, but for now, we needed to get this going as fast as we could.
I will go through a more technical document for the actual code at a later date, but the code has a few comments if you want to follow along at home… 😉
https://gist.github.com/3cb2180f19fc59e482d2
Of course there can always be improvements; and the big one here will be to implement signed urls, which would mean that I would no longer need to have the AWS Secret in the front-end. I would still use the temp bucket as that would mean the main bucket could not be touched via javascript or CORS, just adding a small amount of extra security.
I am sure I will think of more improvements – in fact this entire solution will probably look positively archaic in a few months ! 😉