Kalyan Veeramachaneni1 | Ignacio Arnaldo2 | Alfredo Cuesta-Infante2 | Vamsi Korrapati2 | Costas Bassias2 | Ke Li2 |
CSAIL & IDSS, MIT - Cambridge, MA1 | PatternEx - San Jose, CA2 |
We present an analyst-in-the-loop security system, where analyst intuition is put together with stateof-the-art machine learning to build an end-to-end active modeling system. The system has four key features: a big data behavioral analytics platform, an ensemble of outlier detection methods, a mechanism to obtain feedback from security analysts, and a supervised learning module. When these four components are run in conjunction on a daily basis and compared to an unsupervised outlier analysis, this improves the detection rate by 3.41× on an average and reduces false positives by more than 5×. We validate our system with a real-world data set consisting of 3.6 billion log lines. These results show that the system is capable of learning to defend against unseen attacks.
Paper: AI2: Training a big data machine to defend (PDF)
In Figure 1, we present a snapshot of our system’s progress after 12 weeks of use. With 3 months’ worth of data, and with awareness of attacks, we evaluate whether our solution can improve attack detection rates (recall) while reducing the number of alerts shown to the analyst (“daily investigation budget” k).
Using analyst time effectively:The AI2 system achieves a detection rate of 86.8% even at an extremely low daily investigative budget of k = 200 events. This represents more than tenfold improvement1 over the unsupervised outlier detection approach rate, which is 7.9%. Fixing the daily investigation budget at 200 keeps the false positive rate at 4.4%.
Reducing false-positives by a factor 5: If we allow for an higher daily investigative budget (for example, up to 1000), the unsupervised outlier detection based method can still only achieve a 73.7% detection rate, and the false positive rate is > 22%. AI2 achieves > 86% for a false positive rate of 4.4% a reduction by factor of 5.
On our choice of the title “Training a big data machine to defend”: We define a big data system or machine as a software infrastructure that is able to ingest data in real time, compute and generate quantities that can then be analyzed, either by data scientists or a machine learning system. A machine learning substrate that sits on top of this system can analyze the data and automatically produce outliers. We provide a system that collects and incorporates analyst feedback, generates, and uses these models continuously without any involvement from its original developers - that is us. Thus, we are able to deliver a fully automatic system that could be trained by analysts.