Twitter has just explained the causes of — and remedies for — its multiple and massive failures over the past week.
The high number of errors and generally poor performance were reminiscent of the Great Twitter Outages of the summer of 2009. Once again, this summer’s problem has been one of scale: Twitter is growing so much and so quickly that the engineering team has been challenged when trying to keep up with the sheer volume of data going through the service’s internal network.
What happened that caused this week’s Twitter issues, wrote engineer Jean-Paul Cozzatti, is that the engineering team made three critical mistakes:
- The team put two important, fast-growing, high-bandwith components on the same segment of Twitter’s internal network.
- The network wasn’t being monitored the way it should have been.
- The internal network was also temporarily misconfigured.
To ensure the same mistakes aren’t repeated, Cozzatti continued to outline what Twitter will be doing to fix the problem. He wrote that the company has doubled the capacity of its internal network, improved how it’s monitored and rebalanced its traffic.
“For much of 2009,” he wrote, “Twitter’s biggest challenge was coping with our unprecedented growth (a challenge we happily still face)… But as this week’s issues show, there is always room for improvement.
“Based on our experiences this week, we’re working with our hosting partner to deliver improvements on all three fronts. By bringing the monitoring of our internal network in line with the rest of the systems at Twitter, we’ll be able to grow our capacity well ahead of user growth.”
Did Twitter’s errors have any impact on your professional or personal life this week? And how do you think the service will hold up during the World Cup? As Twitter continues to grow in cultural importance and user adoption, will it ever be a truly reliable service in the way that other, larger tech companies have become?