The Past and Future of Systems Management

“I said that I’mma ride for my motherf*ckin' n***as
Most likely I’mma die with my finger on the trigger
I’ve been grindin outside all day with my n***as
And I ain’t goin' in unless I’m with my n***as
My n***as, my n***as
My n***as, my n***as (My muthaf*cking n***as!)
My n***as, my n***as (My n***as, my n***as)
My n***as, my n***as"

—YG, “My N***a”

My Nigga | Listen for free at

Ten years ago I had a big problem. I was CEO of Opsware, a systems management software company, and we were losing a lot of deals to our major competitor, BladeLogic. We were losing, because they had a better product. There were many reasons for that, starting with the fact that we never designed Opsware for broad based usage, but in a desperate move, had yanked it out of our cloud computing business, Loudcloud, and began to offer it as a commercial product. As a result, Opsware worked okay for some, but was not ready to go up against an excellent competitor. Naturally, nobody cared about our excuses and our business was spiraling down the drain. 

When I fully realized what was happening, I went to see my chief architect, Phil Liu to tell him the bad news. He and I meticulously went through the details of why customers thought BladeLogic's product was better and why they were probably right. It was a come to Jesus moment, but Jesus wasn’t there, so it was up to Phil and me.

As Phil thought through the implications of re-architecting the product and how little time he had to do so, his face told the story. It said: “Oh fuck, this is going to be hard if not impossible and the company is going to die if we don’t figure it out. But we will figure it out or die trying.” When I saw that look, I thought: “My guy."

After nine months of grueling effort by Phil and the team, we released our new product code named Darwin. As soon as it hit the market, our win rate went from sub 50% to 80%. We had done it. It was a new day and we weren’t going to squander it. I will never forget that moment with Phil. When things go perfectly in a company, it’s sometimes difficult to differentiate amongst good employees as everyone consistently beats their objectives. However, when things go horribly wrong, the greatest people distinguish themselves. Phil could have made many excuses and blamed many people — most of all me. But, he would have none of that. Instead, he simply found his greatness. 

Years later, HP acquired Opsware and Phil went on to design a system for monitoring and managing a leading, massive, billion-user application called Facebook. Because Phil intimately knew the traditional systems management market, he quickly realized that the traditional systems would not work for modern, massive, cloud-based architectures. In fact, they would not work properly for cloud-based architectures of any scale. A new system had to be developed. The reasons were several:

Traditional systems are server centric — Even relatively modern systems management products like New Relic treat servers as sacred resources which must be kept alive, but Facebook loses servers every day and it doesn’t matter. Facebook doesn’t care about servers; they care about services. Knowing when a cluster of services that provides, for example, an identity service is out of capacity is critical, but getting paged in the middle of the night because you lost one server in a cluster of 20 is asinine. 

Developers are central to the business — Developers are now playing a much more critical role in the business and need an application intelligence solution the same way marketers need business intelligence. Developers need to answer questions like: Which APIs are being used the most frequently by which customers? Is a sudden spike in latency for my top customers the result of an increase in load or some upstream service using my APIs inefficiently? Which customers are taxing my infrastructure in the most expensive way? How should I think about scaling my systems if my user base were to double?

Monitoring is now an analytics problem — How many milliseconds should a packet take to travel from the database to the application server for the photo app? Don’t feel badly if you can’t answer that because it would be a silly thing to know, yet monitoring systems have wanted people to know such things for years. How about a system that reports the moving average and anomalies such as 3 standard deviation variances from it? 

Applications are now a collection of micro-services — These micro services are often managed by separate teams with all sorts of upstream and downstream dependencies. Having a solution that tracks all the relevant metrics across all the services fosters a much more collaborative environment where teams can communicate with one another (versus logs, where only the developer who wrote the app can really understand what's going on).

Time is money — Facebook is now a $10 billion company. That means that if the site is down for an hour, that’s roughly $11.5 million. So, logged data and dashboards that aren’t real-time won’t cut it. Every second counts and a proper system must enable you to see all of the data in real-time. 

After building a system, which solves the above and serves Facebook amazingly well, Phil co-founded SignalFx with another stellar ex-Loudcloud employee and legendary VMware executive, Karthik Rau. Together, they have built the systems management product of the future. The current product is amazing, but more importantly, Phil and Karthik will make sure that it remains the best product for a very long time.

I am so pleased to announce that Andreessen Horowitz is an investor in SignalFx.

Ben's Book