Anomaly Discovery Paradigm

Staggering volumes of data sets collected by modern applications from financial transaction data to IoT sensor data contain critical insights from rare phenomena to anomalies indicative of fraud or failure. To decipher valuables from the counterfeit, analysts need to interactively sift through and explore the data deluge. By detecting anomalies, analysts may prevent fraud or prevent catastrophic sensor failures. While previously developed research offers a treasure trove of stand-alone algorithms for detecting particular types of outliers, they tend to be variations on a theme. There is no end-to-end paradigm to bring this wealth of alternate algorithms to bear in an integrated infrastructure to support anomaly discovery over potentially huge data sets while keeping the human in the loop.

This project is the first to design an integrated paradigm for end-to-end anomaly discovery. This project aims to support all stages of anomaly discovery by seamlessly integrating outlier-related services within one integrated platform. The result is a database-system inspired solution that models services as first class citizens for the discovery of outliers. It integrates outlier detection processes with data sub-spacing, explanations of outliers with respect to their context in the original data set, feedback on the relevance of outlier candidates, and metric-learning to refine the effectiveness of the outlier detection process. The resulting system enables the analyst to steer the discovery process with human ingenuity, empowered by near real-time interactive responsiveness during exploration. Our solution promises to be the first to achieve the power of sense making afforded by outlier explanation services and human feedback integrated into the discovery process.

Smile: A System to Support Machine Learning on EEG Data at Scale

In order to reduce the possibility of neural injury from seizures and sidestep the need for a neurologist to spend hours on manually reviewing the EEG recording, it is critical to automatically detect and classify ``interictal-ictal continuum'' (IIC) patterns from EEG data. However, the existing IIC classification techniques are shown to be not accurate and robust enough for clinical use because of the lack of high quality labels of EEG segments as training data. Obtaining high-quality labeled data is traditionally a manual process by trained clinicians that can be tedious, time-consuming, and error-prone.

In this work, we propose Smile, an industrial scale system that provides an end-to-end solution to the IIC pattern classification problem. The core components of Smile include a visualization-based time series labeling module and a deep-learning based active learning module. The labeling module enables the users to explore and label 350 million EEG segments (30TB) at interactive speed. The multiple coordinated views allow the users to examine the EEG signals from both time domain and frequency domain simultaneously. The active learning module first trains a deep neural network that automatically extracts both the local features with respect to each segment itself and the long term dynamics of the EEG signals to classify IIC patterns. Then leveraging the output of the deep learning model, the EEG segments that can best improve the model are selected and prompted to clinicians to label. This process is iterated until the clinicians and the models show high degree of agreement. Our initial experimental results show that our Smile system allows the clinicians to label the EEG segments at will with a response time below 500 ms. The accuracy of the model is progressively improved as more and more high quality labels are acquired over time.

ASSET: A System for Exploring Sequential Patterns

Modern data intensive applications generate massive amounts of time-stamped data, much of it in the form of discrete, symbolic sequences produced from sensors, system logs, and e-commerce. While valuable knowledge in such sequence data can be found by mining sequential patterns from them, to date no suitable technology exists capable of supporting interactive exploration of such patterns.

In this work, we present the first comprehensive solution, the ASSET system, that efficiently supports a rich variety of pattern exploration requests with varying pattern semantics and input parameter settings. ASSET employs an innovative index structure that succinctly summarizes sequence data using a small set of sequential patterns. Using this ASSET index, ASSET provides highly effective execution strategies for a range of pattern mining and query requests. As a result, ASSET offers near-real time responsiveness even on large sequence data sets. Our experimental evaluation using several real world and synthetic sequence data sets demonstrates the versatility and efficiency of ASSET, showing it to be 3 or more orders of magnitude faster than the state-of-the-art techniques.