Skype's substantial period of downtime last week has been traced to overloaded servers triggering a bug in the most widespread version of the Windows Skype client, the company has reported on its blog. At the height of the problem, only a few hundred thousand users were showing up online; normally, the voice and video chat boasts in excess of 20 million online users.
The initial problem was that on December 22nd the servers handling offline messaging became overloaded and slow to respond. Windows client version 5.0.0152, representing about 50% of users, responded to this condition by crashing, with about 40 percent of those clients failing. Skype works using a peer-to-peer network to transfer data. Some nodes on the network are chosen as "supernodes"; these supernodes carry out additional coordination duties, routing traffic and performing directory lookups. The widespread client failures caused 25-30% of the supernodes to fail, which in turn caused the remaining peer-to-peer network to become overloaded—a problem exacerbated by the large number of Windows users restarting their crashed clients. The result was widespread service outages.
Though users running other versions of the Skype software—both older and newer—did not suffer from the initial crashing issue, the failure of the peer-to-peer network caused even them to lose service.
Skype eventually restored its network by bringing online a large number of extra high capacity supernodes to handle the extra load and allow clients to connect properly.
The incident points to a certain kind of fragility of Skype's network. Any issue that causes widespread client crashes is liable to deplete the number of supernodes, and as this incident shows, that can have a catastrophic effect on the rest of the network.
The company was quite vague about how it hopes to prevent similar issues in the future. A more aggressive update policy would have worked wonders in this case—at the time of the problem, a newer client that didn't have the same crashing bug was available for Windows, so if this had been installed automatically, the widespread failures probably would not have occurred. Though the company says that it will re-evaluate its automatic update process, at the time of writing clients running the affected version are still not offered an automatic update.
Skype is by no means the first company to have a service outage after a supposedly robust network infrastructure suffered wholescale failures as a problem snowballed. Facebook went down for several hours in September after a problem in its fault-tolerant architecture caused repeated system failures, and Amazon's S3 cloud storage service has had outages, again due to errors propagating through a network. Though designed to be robust—Skype, for example, can easily tolerate the loss of a few supernodes as people quit the client—the distributed and decentralized nature of these networks seems at times to come with a tendency to distribute problems, greatly magnifying them. The impact of the Skype client crashes was felt by more than just users of version 5.0.0152.
Skype brought down by double whammy of overloaded servers, client bugs