Shared posts

14 Apr 06:57

Slik har det nye Widerøe-flyet blitt «verdens mest effektive fly» - se video fra landingen i Norge

by Per Erlien Dalløkken

Kom til Flesland torsdag ettermiddag.

26 Oct 13:13

Fixing Problems

'What was the original problem you were trying to fix?' 'Well, I noticed one of the tools I was using had an inefficiency that was wasting my time.'
31 Jul 13:29

Tracking down the Villains: Outlier Detection at Netflix

by Philip Fisher-Ogden
It’s 2 a.m. and half of our reliability team is online searching for the root cause of why Netflix streaming isn’t working. None of our systems are obviously broken, but something is amiss and we’re not seeing it. After an hour of searching we realize there is one rogue server in our farm causing the problem. We missed it amongst the thousands of other servers because we were looking for a clearly visible problem, not an insidious deviant.

In Netflix’s Marvel’s Daredevil, Matt Murdock uses his heightened senses to detect when a person’s actions are abnormal. This allows him to go beyond what others see to determine the non-obvious, like when someone is lying. Similar to this, we set out to build a system that could look beyond the obvious and find the subtle differences in servers that could be causing production problems. In this post we’ll describe our automated outlier detection and remediation for unhealthy servers that has saved us from countless hours of late-night heroics.

Shadows in the Glass

The Netflix service currently runs on tens of thousands of servers; typically less than one percent of those become unhealthy. For example, a server’s network performance might degrade and cause elevated request processing latency. The unhealthy server will respond to health checks and show normal system-level metrics but still be operating in a suboptimal state.

A slow or unhealthy server is worse than a down server because its effects can be small enough to stay within the tolerances of our monitoring system and be overlooked by an on-call engineer scanning through graphs, but still have a customer impact and drive calls to customer service. Somewhere out there a few unhealthy servers lurk among thousands of healthy ones.

NIWSErrors - hard to see outlier (can you spot).png
The purple line in the graph above has an error rate higher than the norm. All other servers have spikes but drop back down to zero, whereas the purple line consistently stays above all others. Would you be able to spot this as an outlier? Is there a way to use time series data to automatically find these outliers?

A very unhealthy server can easily be detected by a threshold alert. But threshold alerts require wide tolerances to account for spikes in the data. They also require periodic tuning to account for changes in access patterns and volume. A key step towards our goal of improving reliability is to automate the detection of servers that are operating in a degraded state but not bad enough to be detected by a threshold alert.

Finding a Rabbit in a Snowstorm

To solve this problem we use cluster analysis, which is an unsupervised machine learning technique. The goal of cluster analysis is to group objects in such a way that objects in the same cluster are more similar to each other than those in other clusters. The advantage of using an unsupervised technique is that we do not need to have labeled data, i.e., we do not need to create a training dataset that contains examples of outliers. While there are many different clustering algorithms, each with their own tradeoffs, we use Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to determine which servers are not performing like the others.

How DBSCAN Works

DBSCAN is a clustering algorithm originally proposed in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu. This technique iterates over a set of points and marks as clusters points that are in regions with many nearby neighbors, while marking those in lower density regions as outliers. Conceptually, if a particular point belongs to a cluster it should be near lots of other points as measured by some distance function. For an excellent visual representation of this see Naftali Harris’ blog post on visualizing DBSCAN clustering.


To use server outlier detection, a service owner specifies a metric which will be monitored for outliers. Using this metric we collect a window of data from Atlas, our primary time series telemetry platform. This window is then passed to the DBSCAN algorithm, which returns the set of servers considered outliers. For example, the image below shows the input into the DBSCAN algorithm; the red highlighted area is the current window of data:
In addition to specifying the metric to observe, a service owner specifies the minimum duration before a deviating server is considered an outlier. After detection, control is handed off to our alerting system that can take any number of actions including:

  • email or page a service owner
  • remove the server from service without terminating it
  • gather forensic data for investigation
  • terminate the server to allow the auto scaling group to replace it

Parameter Selection

DBSCAN requires two input parameters for configuration; a distance measure and a minimum cluster size. However, service owners do not want to think about finding the right combination of parameters to make the algorithm effective in identifying outliers. We simplify this by having service owners define the current number of outliers, if there are any, at configuration time. Based on this knowledge, the distance and minimum cluster size parameters are selected using simulated annealing. This approach has been effective in reducing the complexity of setting up outlier detection and has facilitated adoption across multiple teams; service owners do not need to concern themselves with the details of the algorithm.

Into the Ring

To assess the effectiveness of our technique we evaluated results from a production service with outlier detection enabled. Using one week’s worth of data, we manually determined if a server should have been classified as an outlier and remediated. We then cross-referenced these servers with the results from our outlier detection system. From this, we were able to calculate a set of evaluation metrics including precision, recall, and f-score:

Server Count

These results illustrate that we cannot perfectly distill outliers in our environment but we can get close. An imperfect solution is entirely acceptable in our cloud environment because the cost of an individual mistake is relatively low. Erroneously terminating a server or pulling one out of service has little to no impact because it will be immediately replaced with a fresh server.  When using statistical solutions for auto remediation we must be comfortable knowing that the system will not be entirely accurate; an imperfect solution is preferable to no solution at all.

The Ones We Leave Behind

Our current implementation is based on a mini-batch approach where we collect a window of data and use this to make a decision. Compared to a real-time approach, this has the drawback that outlier detection time is tightly coupled to window size: too small and you’re subject to noise, too big and your detection time suffers. Improved approaches could leverage advancements in real-time stream processing frameworks such as Mantis (Netflix's Event Stream Processing System) and Apache Spark Streaming. Furthermore, significant work has been conducted in the areas of data stream mining and online machine learning. We encourage anyone looking to implement such a system to consider using online techniques to minimize time to detect.

Parameter selection could be further improved with two additional services: a data tagger for compiling training datasets and a model server capable of scoring the performance of a model and retraining the model based on an appropriate dataset from the tagger. We’re currently tackling these problems to allow service owners to bootstrap their outlier detection by tagging data (a domain in which they are intimately familiar) and then computing the DBSCAN parameters (a domain that is likely foreign) using a bayesian parameter selection technique to optimize the score of the parameters against the training dataset.

World on Fire

As Netflix’s cloud infrastructure increases in scale, automating operational decisions enables us to improve availability and reduce human intervention. Just as Daredevil uses his suit to amplify his fighting abilities, we can use machine learning and automated responses to enhance the effectiveness of our site reliability engineers and on-call developers.  Server outlier detection is one example of such automation, other examples include Scryer and Hystrix. We are exploring additional areas to automate such as:

  • Analysis and tuning of service thresholds and timeouts
  • Automated canary analysis
  • Shifting traffic in response to region-wide outages
  • Automated performance tests that tune our autoscaling rules

These are just a few example of steps towards building self-healing systems of immense scale. If you would like to join us in tackling these kinds of challenges, we are hiring!
23 Mar 16:54

Why fuel cell cars don't work - part 4

by mux
We have arrived at the final station of fuel cell cars. This is the end. We have seen how hydrogen is quite an annoying fuel to use in many respects and how other fuels have their share of drawbacks as well. We've gone over the technical details of a bunch of fuel cell types. I have even talked a bit about the economics of it all. Today I want to talk about what I think will be the future, and...
18 Jun 07:25

Do you need a PhD?

by Matt Welsh
Since I decamped from the academic world to industry, I am often asked (usually by first or second year graduate students) whether it's "worth it" to get a PhD in Computer Science if you're not planning a research career. After all, you certainly don't need a PhD to get a job at a place like Google (though it helps). Hell, many successful companies (Microsoft and Facebook among them) have been founded by people who never got their undergraduate degrees, let alone a PhD. So why go through the 5-to-10 year, grueling and painful process of getting a PhD when you can just get a job straight out of college (degree or not) and get on with your life, making the big bucks and working on stuff that matters?

Doing a PhD is certainly not for everybody, and I do not recommend it for most people. However, I am really glad I got my PhD rather than just getting a job after finishing my Bachelor's. The number one reason is that I learned a hell of a lot doing the PhD, and most of the things I learned I would never get exposed to in a typical software engineering job. The process of doing a PhD trains you to do research: to read research papers, to run experiments, to write papers, to give talks. It also teaches you how to figure out what problem needs to be solved. You gain a very sophisticated technical background doing the PhD, and having your work subject to the intense scrutiny of the academic peer-review process -- not to mention your thesis committee.

I think of the PhD a little like the Grand Tour, a tradition in the 16th and 17th centuries where youths would travel around Europe, getting a rich exposure to high society in France, Italy, and Germany, learning about art, architecture, language, literature, fencing, riding -- all of the essential liberal arts that a gentleman was expected to have experience with to be an influential member of society. Doing a PhD is similar: You get an intense exposure to every subfield of Computer Science, and have to become the leading world's expert in the area of your dissertation work. The top PhD programs set an incredibly high bar: a lot of coursework, teaching experience, qualifying exams, a thesis defense, and of course making a groundbreaking research contribution in your area. Having to go through this process gives you a tremendous amount of technical breadth and depth.

I do think that doing a PhD is useful for software engineers, especially those that are inclined to be technical leaders. There are many things you can only learn "on the job," but doing a PhD, and having to build your own compiler, or design a new operating system, or prove a complex distributed algorithm from scratch is going to give you a much deeper understanding of complex Computer Science topics than following coding examples on StackOverflow.

Some important stuff I learned doing a PhD:

How to read and critique research papers. As a grad student (and a prof) you have to read thousands of research papers, extract their main ideas, critique the methods and presentation, and synthesize their contributions with your own research. As a result you are exposed to a wide range of CS topics, approaches for solving problems, sophisticated algorithms, and system designs. This is not just about gaining the knowledge in those papers (which is pretty important), but also about becoming conversant in the scientific literature.

How to write papers and give talks. Being fluent in technical communications is a really important skill for engineers. I've noticed a big gap between the software engineers I've worked with who have PhDs and those who don't in this regard. PhD-trained folks tend to give clear, well-organized talks and know how to write up their work and visualize the result of experiments. As a result they can be much more influential.

How to run experiments and interpret the results: I can't overstate how important this is. A systems-oriented PhD requires that you run a zillion measurements and present the results in a way that is both bullet-proof to peer-review criticism (in order to publish) and visually compelling. Every aspect of your methodology will be critiqued (by your advisor, your co-authors, your paper reviewers) and you will quickly learn how to run the right experiments, and do it right.

How to figure out what problem to work on: This is probably the most important aspect of PhD training. Doing a PhD will force you to cast away from shore and explore the boundary of human knowledge. (Matt Might's cartoon on this is a great visualization of this.) I think that at least 80% of making a scientific contribution is figuring out what problem to tackle: a problem that is at once interesting, open, and going to have impact if you solve it. There are lots of open problems that the research community is not interested in (c.f., writing an operating system kernel in Haskell). There are many interesting problems that have been solved over and over and over (c.f., filesystem block layout optimization; wireless multihop routing). There's a real trick to picking good problems, and developing a taste for it is a key skill if you want to become a technical leader.

So I think it's worth having a PhD, especially if you want to work on the hardest and most interesting problems. This is true whether you want a career in academia, a research lab, or a more traditional engineering role. But as my PhD advisor was fond of saying, "doing a PhD costs you a house." (In terms of the lost salary during the PhD years - these days it's probably more like several houses.)