Server issues during launch – why?

37, 1004, 3102, 90000… seemingly random and innocuous numbers. However, if you play video games, these numbers may be familiar to you. Let me explain – error 37 was what greeted many fans of Diablo 3 during its launch. Error 1004 was an issue experienced by some SimCity players when it launched this year, but the number was picked to help illustrate my point – the fact of the matter is that SimCity gave good old text error messages when it launched, or rather, attempted to launch. The last two… well they’re the reason I’m writing this article, and they both belong to the most recent game to experience launch issues – Final Fantasy XIV.

Wait a second, you say – Final Fantasy XIV? But it’s release date is listed as August 27th, which isn’t even here yet! No, not really – a week ago the game went through it’s last beta test, an open beta designed to stress the servers, which also carried the perk of not including a server wipe. Yes, it was essentially open beta combined with early access. Issues were discovered, because that’s what a beta test is for, and then this weekend arrived, scheduled as the actual early access period; full launch is occurring on the aforementioned August 27th. Sounds like a wonderful plan, doesn’t it? Get a bunch of people in to load up the servers, make sure everything is okay, then unleash the game upon the world. Except it didn’t go so well, and many players have been unable to get into their accounts, experiencing the errors above or other general server load issues. (I’ve got to feel like there’s a tiny bit of irony there considering Final Fantasy XIV is actually in the process of re-launching after their initial launch presented a game that was met with low review scores and disappointed players.)

However, I’m not writing this to pick on FFXIV. In fact, like other games before it, I know it will weather the launch issues and eventually be judged based on the game it is, and I wish everyone at Square-Enix the best with the game – so far I’ve enjoyed what I’ve been able to play of it, if that is worth anything. What I’ve really got to wonder is how there are companies which struggle to get their products launched smoothly. I question that because the number of players expecting to get into these games is a known quantity – in the age of digital sales and distribution, a company can look at sales records and instantly know what kind of population they’re going to have to support. Even adding a physical distribution method as FFXIV has done, there is still the fact that you must create an account and register a key to that account to get access to the game – another easy tracking point. If you know as near to exactly the maximum number of people who will be knocking on the door come launch day, why haven’t you done everything you can to prepare?

I think there are two major factors that need to be considered in answering that question. First, the optimization of the software itself and understanding of load characteristics so you know how many servers you’re going to need to support your incoming player base. Second, once you’ve established those load numbers, a there needs to be a balance struck between under-provisioning your service and causing load issues, and over-provisioning and causing a financial burden upon yourself with no tangible benefit. As a point of clarification here, when I say “server” I’m talking about a single physical or virtualized instance of hardware, whereas often in the online game space “server” applies to what also can be referred to as a “shard” – an instance of the game world, identical but separated from other instances of the game world, generally for the purpose of load management. To confuse the issue, a shard may actually be made up of multiple servers.

I feel like load testing, understanding load characteristics, and general server side optimization should be well known quantities at this point. In the early-mid 2000s I was on a team launching and operating the highest traffic Canadian-based websites around, and we regularly ran in to load problems that had to get dealt with. To be clear, some of these sites were games in and of themselves, and while an online game is still more complex than those, the techniques that went into our load profiling would be similar for an online game. Unfortunately, these activities often fall under the purview of quality assurance, and many in the industry will tell you these poor souls are asked to move mountains with too few people and not enough money in too little time. That may not be the case in any of the examples above, but I wonder if it had an impact.

Let’s say it didn’t, and perhaps a company has a precise understanding of exactly how many people they can squeeze into a server before things start to go sideways. With that out of the way, at this point the discussion is “How many servers do we launch with?” Overshoot and you’ve spent time, effort, and money provisioning that hardware for no reason. Undershoot, and, well… error 37. This is where I think a couple adjustments to launch day thinking could make all of the difference for future titles.

Before discussing the financial reasons, it is important to point out and deal with one of the major game play reasons not to launch with too many shards (and therefor servers). Any new game is going to have an influx of people who want to check it out and see if they enjoy it, and obviously the concentration of these players is going to be highest at launch. At some point after launch these people will move on, which can leave a shard with a low population. In terms of an online game, this can be incredibly frustrating for those still playing on that shard, perhaps causing some of them to leave, creating a downward spiral effect. In my mind, the best answer here is simply to do away with shards altogether. They came about in a time when technology was not at a point that would support everyone on a single shard, and they were the simple answer. I no longer see a good technological reason for them, let’s stop using them.

Having eliminated shards, the next step is having the appropriate amount of hardware on hand for launch. There is clearly a large financial incentive to not overspend on hardware that will sit idle shortly after launch. However, having worked with hardware vendors for years, I suspect it wouldn’t be hard to persuade a company to come to some kind of arrangement during the launch window for something as simple as publicity. Can you imagine millions of players experiencing a smooth Diablo 3 launch with a graphic during the loading screen that said “This smooth launch brought to you by Dell”? There’s actually some precedent towards this – it’s well known that CCP uses IBM hardware in their environment upon which they run EvE Online (and presumably Dust 514 now as well). Maybe I’m completely wrong on this point, but I believe there’s potential on both sides of that agreement for some substantial benefit. Alternatively there’s room to discuss offloading processing to scalable cloud services like AWS rather than hosting all of your own processing, although that’s something you have to be planning for pretty early on and is a whole different discussion.

Am I talking crazy talk over here? Are these completely unreasonable thoughts? I have no idea, because I don’t work in the gaming industry. I do have an awful lot of experience in making stuff work online though, and I think these are some steps in the right direction. If somebody has a launch in the future and wants to test me on it, I’m open to offers! ;)

3 comments
SatorriSynoptic
SatorriSynoptic

The question that has to be answered to get to the bottom is why each of those companies had the issues they did. Blizzard, EA, Square? These are TITANS, not rinkydink Indies. Presumably they have resource to support substantial online facilities beyond any single project; I mean, I don't know about Square, but Blizzard and EA manage online metasystems.

Here's the thing I've seen, though, working for a consumer goods titan, though:

Lots of money and resource doesn't mean it gets spent well, or that having teams specialized in every discipline gets used effectively. Unfortunately, the more money you make the more a company starts to pinch pennies to try and squeeze up margins. The more teams you have, the more it becomes a challenge of communication to get the important messages to the people who actually make the decisions for directing efforts. My company has a massive infrastructure with project planning and operation tools to try and empower communication and make sure things are brought up early and visible to stakeholders (decision makers), and things still become a mess often enough.

I agree with you. Each company has invested in making persistent and substantial online operations, and Blizzard/EA are doing it across multiple franchises. They have no excuse for not investing in creating a way to smooth launches. We, as players, have watched this happen again and again. If they're smart, they've been collecting info and recognizing the trend in behavior, because the scale may change, but I'd wager the trend itself does not vary much. With a company that size, I'd imagine it would be worth investing in swing support. I am imagining they have massive centers where they rent infrastructure, so I wouldn't think it would be a hard/bad idea to negotiate a provision with their supplier to have a short-term hardware bump.

But, engineer-get-to-the-root'ing aside, hehe, there is really no reason they shouldn't be learning and improving on these things. I am most curious to see Blizzard's next major launch (Titan, perhaps? Though that feels still very far away).

dmarkd
dmarkd moderator

@SatorriSynoptic Good points - I had meant to include the fact that these are large companies making these mistakes, but forgot. That's what I get for writing this at 1am.

The only thing I can come up with is that as these are game companies they are focusing on the "game" portion of the launch and ignoring the "online" part. That's the only way I've been able to justify the attitude and results that have been demonstrated by these companies in the past while.

SatorriSynoptic
SatorriSynoptic

@dmarkd @SatorriSynopticSounds almost too gracious, especially in the face of EA/SimCity's insistence on "online gaming is the future." They deliberately modelled the game so that it could only be played online. I appreciate their design, their ambitions, and I like the doors it opens (I won't be one of the people clamoring that it MUST BE OFFLINE).
But if you're going to assert that, you would *think* that they would put some effort in to back it up.
That said, I usually default to the stance that if these are big honking companies who I know have departments dedicated to their online infrastructure. So, if they end up having issues, I assume that there was something they didn't anticipate, or some monkeywrench they didn't see coming. Sometimes it is just oversights. I don't want to believe they're knowingly short-sheeting themselves, but then I can imagine a business case for that as well.
I suppose this is just another thing that could be sorted under the Golden Triangle. Cost, Quality, and Speed. Current business seems to favor fixing cost (market determined), and short-changing quality for speed. It is more important that it comes out at a particular time, and the quality stuff will be resolved or accepted after launch with only some attrition.
Basically, I'm just saying we should be running our own company and trying to hold up our own ideals about quality designs and implementation, because you know.... I'm sure no one else in the business started that way. ha ha ha