Research
Anomaly Discovery Paradigm
Staggering volumes of data sets collected by modern applications from financial transaction data to IoT
sensor data contain critical insights from rare phenomena to anomalies indicative of fraud or failure.
To decipher valuables from the counterfeit, analysts need to interactively sift through and explore the
data deluge. By detecting anomalies, analysts may prevent fraud or prevent catastrophic sensor failures.
While previously developed research offers a treasure trove of stand-alone algorithms for detecting particular
types of outliers, they tend to be variations on a theme. There is no end-to-end paradigm to bring this
wealth of alternate algorithms to bear in an integrated infrastructure to support anomaly discovery over
potentially huge data sets while keeping the human in the loop.
This project is the first to design an integrated paradigm for end-to-end anomaly discovery. This project
aims to support all stages of anomaly discovery by seamlessly integrating outlier-related services within
one integrated platform. The result is a database-system inspired solution that models services as first
class citizens for the discovery of outliers. It integrates outlier detection processes with data sub-spacing,
explanations of outliers with respect to their context in the original data set, feedback on the relevance
of outlier candidates, and metric-learning to refine the effectiveness of the outlier detection process.
The resulting system enables the analyst to steer the discovery process with human ingenuity, empowered
by near real-time interactive responsiveness during exploration. Our solution promises to be the first
to achieve the power of sense making afforded by outlier explanation services and human feedback integrated
into the discovery process.
Smile: A System to Support Machine Learning on EEG Data at Scale
In order to reduce the possibility of neural injury from seizures and sidestep the need for a neurologist
to spend hours on manually reviewing the EEG recording, it is critical to automatically detect and classify
``interictal-ictal continuum'' (IIC) patterns from EEG data. However, the existing IIC classification techniques
are shown to be not accurate and robust enough for clinical use because of the lack of high quality labels
of EEG segments as training data. Obtaining high-quality labeled data is traditionally a manual process
by trained clinicians that can be tedious, time-consuming, and error-prone.
In this work, we propose Smile, an industrial scale system that provides an end-to-end solution to the
IIC pattern classification problem. The core components of Smile include a visualization-based time series
labeling module and a deep-learning based active learning module. The labeling module enables the users
to explore and label 350 million EEG segments (30TB) at interactive speed. The multiple coordinated views
allow the users to examine the EEG signals from both time domain and frequency domain simultaneously.
The active learning module first trains a deep neural network that automatically extracts both the local
features with respect to each segment itself and the long term dynamics of the EEG signals to classify
IIC patterns. Then leveraging the output of the deep learning model, the EEG segments that can best improve
the model are selected and prompted to clinicians to label. This process is iterated until the clinicians
and the models show high degree of agreement. Our initial experimental results show that our Smile system
allows the clinicians to label the EEG segments at will with a response time below 500 ms. The accuracy
of the model is progressively improved as more and more high quality labels are acquired over time.
ASSET: A System for Exploring Sequential Patterns
Modern data intensive applications generate massive amounts of time-stamped data, much of it in the form
of discrete, symbolic sequences produced from sensors, system logs, and e-commerce. While valuable knowledge
in such sequence data can be found by mining sequential patterns from them, to date no suitable technology
exists capable of supporting interactive exploration of such patterns.
In this work, we present the first comprehensive solution, the ASSET system, that efficiently supports
a rich variety of pattern exploration requests with varying pattern semantics and input parameter settings.
ASSET employs an innovative index structure that succinctly summarizes sequence data using a small set of
sequential patterns. Using this ASSET index, ASSET provides highly effective execution strategies for a
range of pattern mining and query requests. As a result, ASSET offers near-real time responsiveness even
on large sequence data sets. Our experimental evaluation using several real world and synthetic sequence
data sets demonstrates the versatility and efficiency of ASSET, showing it to be 3 or more orders of
magnitude faster than the state-of-the-art techniques.