Monday, October 12, 2009

DDoS Attack Hits Amazon Cloud Customer Hard

UPDATE: updated to clarify that the downtime was isolated to the single customer, and to include comments from Amazon Web Services.

(WEB HOST INDUSTRY REVIEW) -- As has been reported in several sources this week, including this report from The Register, a customer of Amazon’s cloud computing service - Bitbucket (www.bitbucket.org), a web-based code hosting service that uses both EC2 and the Amazon’s Elastic Block Storage - reported 19 hours of downtime as a result of a DDoS attack.

The disruption, as most of the reports have mentioned, is certain to raise questions about both the resiliance and stability of cloud hosting solutions, as well as invite scrutiny into the quality of Amazon’s response to customer situations like this one.

In an email, Amazon Web Services public relations manager Kay Kinton writes "The customer’s Amazon EC2 instance was receiving a very large amount of network traffic. This large flood of traffic exceeded the networking of the customer’s single Amazon EC2 instance and caused performance to degrade on all network operations for that instance (including access to the EBS volume). Once we properly diagnosed the problem, we worked with the customer to put measures in place to help mitigate the unwanted traffic they were receiving. We have continued to work with the customer to apply network filtering techniques which have kept their site functioning properly."

Bitbucket operator Jesper Nøhr spent a great deal of energy venting his frustrations about the situation via Twitter over the weekend and on Monday. He also published a detailed account of the outage timeline on the company’s blog.

He says in the post that, while he was advised not to reveal the source of the outage (according to The Register, because Amazon felt revealing the attack might give future attackers fuel for designing new means of attacking the service), he believed he owed it to customers to explain what went wrong.

“We were attacked. Bigtime,” he wrote. “We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn’t read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That’s nice.”

Nøhr’s blog post, and much of the discussion that has flowed from the incident, is more a look at Amazon’s response, and a debate about the nature of customer service than an investigation into the technical issues at play. His post concludes by saying Bitbucket is taking measures to ensure an outage of this kind won’t happen again, and that those measures might include moving to a new hosting provider.

In the final analysis, he doesn’t come down too hard on Amazon, demonstrating that the company devoted a lot of support attention to his cause after a frustrating (and exceedingly long) preliminary period – about 10 hours.

After the service was originally restored, Bitbucket reportedly suffered two further attacks, Sunday morning and early Monday.

The technical question seems to be how an attack using external traffic was able to disrupt the connection between Bitbucket’s EC2 and EBS resources, which one might reasonably consider an “internal” network connection, and why Amazon didn’t have the means in place to more quickly identify the outage as the result of a DDoS attack.

Kinton points out that AWS may have to do more work to help educate customers on the availability of measures to mitigate the impact of unexpected spikes in traffic.

"We are working to make sure we learn from this and continue to improve the speed with which we and our customers can diagnose issues like this," she says. "We will also provide additional guidance to our customers on how they can better detect this sort of problem and use existing features of EC2 to run a highly scalable and available website. These features include taking advantage of EC2’s scale-out ability to run multiple instances in multiple availability zones and regions, the Elastic Load Balancing feature which helps users filter and balance traffic among their instances and the Auto-Scaling feature which allows users to scale in response to changes in customer traffic."

In the comment section of the Bitbucket blog post, Nøhr wrote that he wouldn’t answer many of the technical questions posted in the comments, but that the company’s system admins “are writing a follow-up post outlining some more meaty details with numbers and graphs.”

Source: http://www.thewhir.com/web-hosting-news/100609_Outage_Hits_Amazon_Cloud_Customer_Hard

No comments: