Advancing the Art of Internet Edge Outage Detection

Publication
to appear in ACM Internet Measurement Conference
Location
Boston, MA
Date

Abstract

Measuring reliability of edge networks in the Internet is difficult due to the size and heterogeneity of networks, the rarity of outages, and the difficulty of finding vantage points that can accurately capture such events at scale. In this paper, we use logs from a major CDN, detailing hourly request counts from address blocks. We discovered that in many edge address blocks, devices, collectively, contact the CDN every hour over weeks and months. We establish that a sudden temporary absence of these requests indicates a loss of Internet connectivity of those address blocks, events we call disruptions. We develop a disruption detection technique and present broad and detailed statistics on 1.5M disruption events over the course of a year. Our approach reveals that disruptions do not necessarily reflect actual service outages, but can be the result of prefix migrations. Major natural disasters are clearly represented in our data as expected; however, a large share of detected disruptions correlate well with planned human intervention during scheduled maintenance intervals, and are thus unlikely to be caused by external factors. Cross-evaluating our results we find that current state-of-the-art active outage detection over-estimates the occurrence of disruptions in some address blocks. Our observations of disruptions, service outages, and different causes for such events yield implications for the design of outage detection systems, as well as for policymakers seeking to establish reporting requirements for Internet services.