Why did we create the Data Science Machine?
In recent years, more and more data has begun to be collected, and is starting to come online (with cloud infrastructure). As data scientists who regularly work with this data, we noticed a few important aspects:
- More and more data was about our interactions with a complex system (learning platforms, retail websites) or about our use of a complex system (physical systems like turbines or cars, or companies like airlines or Uber).
- The data had some very interesting properties: It was bursty, irregular, and collected at many different time scales, and came in many different types (categorical, numeric, textual and temporal).
- Most recent developments in methods, such as deep learning solutions and other automated methods, focus on images and text. System-related data of this kind was still left to largely human-driven and manual processing.
- We were encountering new problems in new domains at such a rate, our need to solve them far surpassed the need to refine existing machine learning solutions—that is, in most cases the first solution designed was good enough, so we were never asked to improve.
With this type of data and increased need, and as we set out to create data-driven solutions for multiple problems, we noticed:
- It took a long time to get from data to models, and most of this delay was not due to computational bottlenecks. Without human ingenuity, raw data cannot be transferred into the data representations required by machine learning platforms.
- When hopping from one problem to another, it was cognitively challenging to understand the data and its context, and to remember the nuances of all the fields that composed the data. For example, one day we would be working on a data project from a car company which involved 7000 signals (numeric, categorical) recorded by different systems within the car; the next day, we might be working on building predictive models for a software consulting firm that recorded data in 510 fields about progress of multiple software development projects; and yet another day, we might work on analyzing data from an online learning platform.
- We noticed that we took the same steps for each of the projects: organizing the data, linking and cleaning, extracting features by writing customized scripts (while considering the nuances of bursty, multiple time scales and types), and then building models once we had features.
We wanted to address our own need to scale up our data science efforts, and also to answer this question: How can we reduce the time it takes to bring the data to a format that is usable by machine learning algorithms? How can we bring some systematicity in this process, while considering the maximum possible complexities in the data?
Why fully automate?
We think it is important to challenge ourselves to build something that could replace our manual efforts with each new incoming data problem. A challenge like that pushes us to develop technologies that we would not conceive otherwise. However, once addressed, the technologies developed as part of the system are often used to enhance and aid humans, and not necessarily to replace humans. But starting with this goal allowed us to think big