DataHub is a unified, managed, collaborative platform for making data-processing easy. It consists of: (1) a flexible data store (files, relational databases, extensible to other data-storage backends) with sharing/collaboration capabilities, managed on behalf of different users/groups, and (2) an app ecosystem that hosts apps for various data-processing activities such as ingestion, curation, integration, discovery, query, analytics, visualization, and machine learning. The DataHub users can use any of the apps from the App Center for processing their data as it fits their need.
Confer is a tool for conference planning. It helps a) conference attendees find interesting papers/talks, organize schedule, and discover people with similar interests, and b) conference organizers schedule sessions, plan community interactions, understand community structure, and discover new areas of research, practice, methodologies, and emerging application areas. It has been deployed at 13 academic conferences including CHI, CSCW, KDD, ACM MM, SIGMOD, SIGIR, and WSDM, and has more than 18,000 unique users.
Distill is a general purpose example-based data cleaning/extraction tool for converting semi-structured text into a structured table. A user provides a few examples (2 or 3 examples) by specifying the desired tabular output for a given sample input text. The system uses the sample examples to automatically infer a model which can be used to extract the complete table from the raw file.
Barista is a distributed, synchronously replicated, fault tolerant relational data store. It runs as a middleware service over database instances to provide an abstraction for a distributed relational store. It ensures that the data is replicated across many sets of Paxos state machines in replica groups to provide fault-tolerance and recovery. The replication enables load balancing and availability; clients automatically failover between replicas. Barista exposes SQL for data management. Client applications can use Barista with the same SQL code they used before, and under the hood it guarantees replication, consistency, and fault-tolerance seamlessly.
The problem with current mailing lists is that they are all-or-nothing; when one sends an email, everyone gets it. People need to feel like their email is very important before they use the list, which makes polite people reluctant to do it at all. Many not-so-polite people do reply all and send to entire list which causes inbox flooding. Many unsubscribe from the list and miss useful information. Mail-X is a mailing list manager that allows more controlled, yet effective information dissemination.
Current search engines use inverted index to encode natural language text. Unfortunately, it fails to capture the semantic understanding and thus can only go as far as giving relevant document hits. This project introduces an indexing scheme which can capture semantic understanding.
Voice-X is an infrastructure designed for rural areas/slums where people do not have computers/internet access. The goal is to enable people to search for information, generate and share information, exchange knowledge, be social, and much more, all through their feature phones, over SMS.
Programmers often borrow code by searching web (code repositories, documentation, blogs and forums). Redprint is a browser based IDE for PHP that integrates web search into the development environment to enable programmers with context specific example code in real time, thus reducing the cost of searching and the time to develop software.
Sookasa is a security and data management layer on top of cloud services such as Dropbox. Wrote the initial system which later got converted into a commercial company (http://sookasa.com).
Data Wrangler is an interactive system for creating data transformations. It combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects.
Anant Bhardwaj, Amol Deshpande, U Maryland UMD, Aaron Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang. Collaborative Data Analytics with DataHub. VLDB 2015.
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. DataHub: Collaborative Data Science & Dataset Version Management at Scale. CIDR 2015.
Anant Bhardwaj, Juho Kim, Steven P. Dow, David Karger, Sam Madden, Robert C. Miller, and Haoqi Zhang. Attendee-Sourcing: Exploring the Design Space of Community-Informed Conference Scheduling. HCOMP 2014.
Juho Kim, Haoqi Zhang, Paul Andre, Lydia B. Chilton, Anant Bhardwaj, David Karger, Steven P. Dow, and Robert C. Miller. Cobi: Community-Informed Conference Scheduling. HCOMP 2013.
Anant P. Bhardwaj, Dave Luciano, and Scott R. Klemmer.
Redprint: Integrating API specific "instant example" and "instant documentation" display interface in IDEs. UIST 2011.