Google apologizes for cloud outage that was a 'comedy of errors'

Advertisement

Advertisement
Google chairman Eric Schmidt looking down

Business Insider

Google executive chairman Eric Schmidt

Google is investing billions in its data centers and hiring salespeople like crazy to take on Amazon in the cloud computing industry. It just held a big conference in March to get the IT world excited about its cloud.

So it was not good that on Monday, the Google Compute Engine cloud service went down for 18 minutes at 7 p.m. Pacific for just about all of its customers everywhere.

Compute Engine is the cloud service that Google has launched to compete head-to-head with Amazon Web Services EC2, where companies can rent space on Google's computers accessed over the internet.

Complimentary Tech Event
Transform talent with learning that works
Capability development is critical for businesses who want to push the envelope of innovation.Discover how business leaders are strategizing around building talent capabilities and empowering employee transformation.Know More

While the world did not spin into an apocalyptic frenzy because of the outage (it didn't impact Google's regular services like search or maps or Gmail), such a big outage was a black eye. Companies like BrightCove, DataStax, Evite, HTC, Zulily use Compute Engine, Google says. 

More importantly, this is Google. Going dark for nearly 20 minutes just isn't supposed to happen. The company has systems and backup systems to prevent that.

Advertisement

So on Wednesday, Google published an apology, and a lengthy explanation. It also offered to credit its customers with 10% to 25% of their monthly bill,  more than the refund it promises in its service agreement.

The non-technical TLDR version: Someone was doing a semi-routine update to the network and hit a bug. Then the automated failsafe software that should have caught the problem and automaticaly fix it also hit a bug. Then the software went nuts and sent the wrong technical information across the whole network and boom, the network went down.

All told, that 20 minutes outage caused Google to make "14 distinct engineering changes" to ensure its cloud won't go down like this again. And more changes are coming "as our engineering teams review the incident with other senior engineers across Google in the coming week."

The Google Cloud team says, "We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur."

But the whole thing still leaves a little egg on Google's face.

Advertisement

 

NOW WATCH: EX-UNDERCOVER DEA AGENT: What I did when drug dealers asked me to try the product