Advancing the Art of Internet Edge Outage Detection

to appear in ACM Internet Measurement Conference
Boston, MA


Estimating the reliability of edge networks in the Internet is difficult, due to the size and heterogeneity of the network, the rarity of outages, and the difficulty of finding methods and vantage points that can accurately capture such events at scale. In this paper, we leverage logs from a major CDN, detailing hourly request counts from address blocks over one year. We discovered that devices from many edge address blocks contact the CDN every single hour over long time periods. We establish that a sudden temporary absence of these requests is indicative of a loss of Internet connectivity of the given IP address blocks, events we refer to as disruptions. Leveraging our vantage point and detection technique, we present broad and detailed statistics on some 1.5M disruption events over the course of one year. What our approach reveals is that many of these disruptions do not reflect actual service outages, but are the result of mass prefix migrations. Methods that attempt to detect outages by sending active probes into the network may thus over-estimate the occurrence of outages. Further we find that while major external events such as natural disasters are clearly represented in our data, a large share of detected disruptions is unlikely to be caused by external factors, but correlates well with planned human intervention during scheduled ISP maintenance intervals. Our observations of disruptions, service outages, and different causes for such events yield implications both for current and future outage detection systems, as well as for regulators and policymakers seeking to establish outage reporting requirements for Internet services.