By Barry Eitel
SAN FRANCISCO (AA) – Amazon said Thursday human error was the likely culprit for a massive outage earlier in the week that slowed down or took offline thousands of Internet services and websites.
It appears a single programming misstep caused a cascade of events that resulted in intermittent outages for Amazon clients ranging from government websites to music streaming services.
One of the main data storage systems for Amazon Web Services (AWS), the company’s cloud computing platform for businesses, went offline Feb. 28.
It involved AWS’ data storage system named S3 that is utilized by almost 150,000 websites.
Not all of the sites were not completely taken down after S3’s outage but many loaded considerably slower while the issue was being resolved. Others were apparently not effected.
Pinterest, Airbnb, Netflix, Slack, Buzzfeed and Spotify all use the S3 platform. Though many sites were back up after a few hours, the issue dragged on for 3.5 hours.
Amazon said an engineer accidentally input one erroneous command line of computer programing.
The engineer intended to take offline a small subset of servers for debugging, but the command instead took down a much larger group of servers.
Because many of the S3 servers require others to work properly, the mistake caused a waterfall of outages.
Amazon said it hopes to prevent such issues in the future by partitioning servers, like the index subsystem server where the error first occurred, in a way that prevents such ballooning outages.
“As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery,” the company said. “During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.”