Skip to content

Instantly share code, notes, and snippets.

@jonnyidk

jonnyidk/blog.md Secret

Created July 20, 2017 14:26
Show Gist options
  • Save jonnyidk/4d2441d490c4088d826a66eb3001bc0f to your computer and use it in GitHub Desktop.
Save jonnyidk/4d2441d490c4088d826a66eb3001bc0f to your computer and use it in GitHub Desktop.

###THE SITUATION We were recently involved with the addition of Tripwire Interactive’s FPS Killing Floor 2 to PS Plus; a subscription service offering “free” games to in excess of 26 million gamers.

Prior to this Killing Floor 2 had a core player base on PC, peaking at around 3,000 concurrent players daily.


###THE PROBLEM Predicting the impact of granting access to millions of PS4 owners is as challenging as it is terrifying! It’s fair to expect that a game of this quality will see a huge uptake; the hugely successful Rocket League is a great example of how much this can boost a games popularity.

The traditional approach to handling a sudden increase in capacity is to be over prepared and over provisioned. It's hard to think of many things more damaging than gamers simply not being able to play when they want to. This is hugely amplified during a launch but can be disastrous in normal operationas well.

So how many should you plan for? Whether you go for 10,000, 70,000 or 100,000 it's very rarely correct.

Evidently the result is going to be one of three things:

  • You get your estimates wrong. You thought that you'd get far more players which means you spent too much and reduced your profit.
  • You get your estimates wrong. You thought that you'd have far less players and you don't have enough capacity to serve the players.
  • You get your estimates right. You should consider a career as a sorcerer (Please send us your CV).

###THE SOLUTION Multiplay's Hybrid Scaling technology (Clanforge) uses both bare metal and multiple cloud providers to remove this guess work. Typically we achieve this by doing the following:

  1. Use our best estimates and modelling to provision a level of bare metal hardware. In normal, day to day operation, this should handle most of your gameplay sessions.
  2. Configure multiple cloud providers (Instance types, availability zones etc) to be able to start up and provision capacity on demand.
  3. Set thresholds at which we trigger this. Essentially we always try to maintain a pool of unused capacity; if we drop under this amount VMs are requested without any human interaction.
  4. Monitor the initial period of the launch, around 2-3 days, and add more bare metal in where needed. This achieves the sweet spot on cost to ensure we only use as much cloud as we need for peak and unexpected trends.

The intended result of this system is, to be frank, a boring launch regardless of the influx of players. We love boring launches. Boring launches for us mean that the players are happy.

In the moments where games are exploding and the stream of players seems endless it's very easy to be blinded by the success. This could, without any bad intentions, lead to you have multiple thousands of bare metal and virtuals instances burning a huge hole in your pocket once the player count drops or the next shiny game comes along. Multiplay has experienced this first hand in the form of some hugely popular games whose lifespans were far shorter than they deserved.

To tackle this, Clanforge will wait for a set period of time for a VM to become empty and ensure that it's shut down as quickly as is safe to do so. We'll keep it around for another pre-determined time for re-use, after which it'll go away for good. This cycle of provisioning and deprovisioning is constant and highly effective, ensure you only have capacity online for the time it's neeeded.


###THE RESULT In the case of the Killing Floor 2 launch, the first two days of launch went extremely well; peak CCU was around 17,000 on day one and 50,000 on day two: ALTTEXT The team at Multiplay was very happy with this uptake and it was particularly exciting to see our platform taking another hugely popular game and working flawlessly.

ALTTEXT The above graph shows our scaling for the first two days of this release. The graph shows the total number of server instances (Copies of the executable) running for one of our busier regions. The light green line shows the total number of "Allocated" or in use instances. The dark green (and mostly obscured) line shows the total number of started/running instances. The blue line is the total created capacity of server instances.

As you can see as the day progresses, we create additional capacity constantly (as shown by the yellow line at the bottom) and stay just ahead of the curve. We cover the line showing running servers with out blue "Total" line for the majority of the time; this is because we have all created caqpacity online and either in use or in a "hot standby" state. Typically we'll see these lines seperate as capacity is shut down during the quieter times of day. lol boobies Our approach for handling this increase worked as planned. Initially our cloud partners quickly provisioned new VMs when requested and we didn't have any instance of players failing to find a game. Following this initial period, the trend from the first two days was entered into our modelling tool to give us a quantity of additional bare metal needed. Whilst our hybrid scaling platform takes care of the initial, unpredictable demand, we're quick to act to ensure we keep costs down.

ALTTEXT This graph shows our split between Bare Metal, Google Cloud Platform and Amazon EC2. The majority of our burst capacity was provisioned into GCP, with EC2 serving Australia and South America.

Other than the capacity provided by these cloud providers we also gain huge amounts of resiliency from having both different companies and multiple availability zones within each region. An example from the Killing Floor 2 use of the platform was that we managed to, yet again, prove that cloud is not infinite. Whilst scaling very quickly on the second evening we had utilised all available capacity of one machine type within one AZ. The benefit of having multiple AZs configured for the region is clear in this example as our system detected this and disabled the AZ, scaling into an alternate one.


In conclusion, it's very clear to see the power that having a hybrid scaling solution offered in this case. We were able to deliver game servers to hundreds of thousands of players without them seeing any impact of, at times, a 2500% increase in players.

If you have any questions or would like to know more, please feel free to get in touch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment