Netflix attacks own network with “Chaos Monkey"

The place for technology related posts.

Moderator: Moderators

Post Reply
User avatar
Sabre
DCAWD Founding Member
Posts: 21432
Joined: Wed Aug 11, 2004 8:00 pm
Location: Springfield, VA
Contact:

Netflix attacks own network with “Chaos Monkey"

Post by Sabre »

Ars Tech
Failure is the last thing you want when running a huge network, particularly one that supports a multi-billion dollar business. But preventing failure requires practice and good planning—and that's why Netflix developed software that attacks its own network more than 1,000 times a week.

By forcing Netflix engineers to recover from small failures that customers won't notice, the company hopes to prevent major outages in its video streaming service. Netflix calls the software it built to automate the process of causing failure a "Chaos Monkey," and today announced the release of Chaos Monkey's source code onto GitHub under the Apache License.

"We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient," Netflix engineer Cory Bennett and executive Ariel Tseitlin wrote in the Netflix tech blog today.

Like many businesses, Netflix hosts its infrastructure on the Amazon Web Services cloud. This allows companies to build out huge clusters of servers and storage without operating their own data centers, but it doesn't insulate them from failure. Businesses that run infrastructure on Amazon have to think about what happens both when Amazon services suffer outages and when their own software causes downtime.

Netflix's Chaos Monkey is "a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact," Netflix explained. "The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption."

Specifically, the Chaos Monkey randomly terminates virtual machines Netflix operates in Amazon's Auto Scaling service. In the past year, Netflix says its Chaos Monkey "has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."

The Auto Scaling technology on Amazon's cloud should detect the termination of an instance and automatically configure a new, identical one to replace it. But the Chaos Monkey's random attacks can still suss out problems, like a patch gone wrong or a traffic load balancer that's failing to route requests around offline instances. While Netflix uses the Chaos Monkey on Amazon, it's flexible enough that it can be installed on other public cloud networks. By default, it only runs during business hours, so people are around to clean up the Chaos Monkey's mess when it identifies a serious problem.

Amazon's cloud infrastructure is divided into data center regions (like the East Coast or West Coast), which in turn are divided into availability zones. Customers are more likely to survive Amazon failures if they build systems that can fail over across availability zones or regions. Building across regions is the most expensive option, but also the most resilient, as failures have occurred across multiple availability zones on numerous occasions.

Last year, customers like reddit, Foursquare, and Quora experienced first-hand what can happen when multiple availability zones are hit with the same problem. Just last month, a power outage followed by the failure of Amazon's primary, backup, and secondary backup power systems took down many virtual machines and storage volumes in Amazon's East coast region. And yes, even Netflix was taken offline by another outage at the end of June.

As such, Netflix's error detection efforts have to go beyond the scale of individual virtual machines. Netflix detailed its Chaos Monkey one year ago in a blog post that also revealed plans for various other chaos-inducing "monkeys." There's a Latency Monkey that introduces artificial delays into Netflix's REST-ful client-server communication layer to simulate service degradation, and a Conformity Monkey that shuts down instances that don't adhere to best practices. There's even a Chaos Gorilla that acts like a Chaos Monkey but simulates an outage of an entire Amazon availability zone.

While the Chaos Monkey is available to anyone who wants it today, there's no word yet on when or whether any of Netflix's other monkeys will be released into the wild. A posting on GitHub describes the Chaos Monkey as the "first member" of Netflix's Simian Army.
I love the idea... but can also see how this could be used for evil :twisted: Good stuff!
Sabre (Julian)
Image
92.5% Stock 04 STI
Good choice putting $4,000 rims on your 1990 Honda Civic. That's like Betty White going out and getting her tits done.
User avatar
complacent
DCAWD Founding Member
Posts: 11651
Joined: Sun Aug 29, 2004 8:00 pm
Location: near the rockies. very.
Contact:

Re: Netflix attacks own network with “Chaos Monkey"

Post by complacent »

they've really set the standard for scaling on aws. iirc they built one of those tools after the first east coast aws outage.

it's been a while though, i may be confused :lol:
colin

a tank, a yammie, a spaceship
i <3 teh 00ntz
User avatar
ElZorro
DCAWD Founding Member
Posts: 5958
Joined: Thu Aug 12, 2004 8:00 pm
Location: USA! USA!

Re: Netflix attacks own network with “Chaos Monkey"

Post by ElZorro »

fun.

This idea has been around for a long time outside the industry - running exercises to test readiness, identify defects (in process, training, equipment, communications, etc) has been used in the military and first responder community forever.
Jason "El Zorro" Fox
'17 Subaru Forester 2.0XT
DCAWD - old coots in fast scoots.
Post Reply