Post-mortem: Outages on 1/19/17 and 1/23/17

Skyliner was unavailable from 00:59 to 01:26 PST on January 19th and from 12:49 PST to 12:56 PST on January 23rd due to a memory leak in our application. During those times, customers were unable to deploy using Skyliner, but no customer infrastructure or applications were affected. The incident was fully resolved.

At 00:59 PST on 1/19/17, response times for skyliner.io began to rise, and by 01:06 requests were timing out. Our external monitoring triggered an alert, and by 01:13 we were actively responding. Service was fully restored at 01:26. Our hunch was that the outage was due to a memory leak of some sort, but we didn’t have enough forensic information to confirm this. Based on our information at that time, we took the following steps:

  • Improved our ability to enable SSH on application instances during an incident.
  • Enabled slow query logging on our database server.
  • Enabled full GC logging on our application instances.
  • Added periodic logging of key JVM metrics.

At 12:25 PST on 1/23/17, response times for skyliner.io again began to rise, and by 12:49 PST we began receiving automated exception reports indicating heap exhaustion on some of our application instances. Our external monitoring triggered an alarm at 12:54 PST. Given our working hypothesis, we rebooted some of the instances and service was restored by 12:56 PST. Unlike the first outage, our customer communication here was poor, and the outage wasn’t announced on Twitter until 13:36 PST. The GC logging indicated our application’s heap was exhausted, but we were unable to secure a full heap dump which would allow us to examine the heap’s contents in detail. To get a full heap dump, we took the following steps:

  • Resolved tooling conflicts with our container’s base image to allow us to run jmap.
  • Increased our application instance’s disk space to allow us to secure a full JVM heap dump.
  • Secured a full heap dump from an instance that had been running overnight but did not yet have degraded performance.

Analysis of the heap dump identified an unbounded, non-evicting, strongly-referencing cache (clojure.core/memoize) in a third-party library (Amazonica) which stored AWS SDK client objects given the credentials and configurations which were used to originally construct the objects. Skyliner is a multi-tenant application which uses temporary AWS credentials of short duration to manage customer resources across all 14 public regions. As a result, the cache quickly accumulated hundreds of thousands of client objects, eventually exhausting the JVM’s heap and sending the garbage collector into a death spiral. By 09:35 PST 1/24/17, our application was patched, deployed, and confirmed to have a stable heap size. We have also reached out to the library’s author to report the problem and are working with them on a long-term fix.

This morning, we established a timeline of the events and conducted our full post-mortem of the incident. We identified several areas of possible improvement and began implementation of the following:

  • Decrease time-to-detection by alerting on overall response latency.
  • Decrease time-to-detection by alerting on logged GC failures.
  • Decrease time-to-response by ensuring our ability to page team members using Pagerduty.
  • Improve our alerts for customer questions about outages, and make it easier for customers to find the channel for Skyliner problems in https://slack.skyliner.io.

Let us know if you have questions about any of the above.

Like what you read? Give Coda Hale a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.