Instance-Optimized Database Systems

Machine learning has been exploited in different computing fields, e.g., computer vision, natural language processing, artificial intelligence, bioinformatics, etc, where researchers aim to provide solutions that exhibit useful learning behavior autonomously. Undoubtedly, the field of data management is not an exception to this, as there has been a flurry of research efforts over the past few decades to explore the usage of machine learning in automatically choosing database indexes, updating the query optimizer plans, and materializing database views, among others. Although such research efforts showed the important role that machine learning can play in improving the performance of database operations, they are still limited trials and did not explore the full power of machine learning yet because they were proposed to help in learning the behavior of specific functions in the different database components.

The goal of this project is to envision a more holistic approach to build an end-to-end machine learning empowered database system that custom-tailors its performance towards user workloads and data distributions. The core components of database systems, e.g., data access methods, query optimizer, query scheduling and query execution can be fully replaced with learned components. I have been involved in using machine learning to improve the data access and processing methods ([CIDR'21], [AIDB@VLDB'20]), multi-query scheduling ([SIGMOD'22]), hashing ([VLDB'23], [AIDB@VLDB'21]), and in-memory join processing ([VLDB'23]).

Open Source Code: Learned hashing [GitHub]
Awards: [NSF/CRA Computing Innovation Fellowship (CiFellow)'20-'23]
Media: [MIT News Article] [Amazon Science Post]


SMLN: Adapting Markov Logic Networks (MLN) for Big Spatial Data and Applications

Recently, there has been a proliferation in the amounts of spatial data produced from several devices such as satellites, space telescopes, and medical devices. Various agencies need to analyze these unprecedented amounts of spatial data to extract useful information and decisions in their applications. Meanwhile, Markov Logic Networks (MLN) have been introduced as an efficient and user-friendly framework for statistical learning and inference. Unfortunately, researchers never take advantage of the recent advances in Markov Logic Networks (MLN) to boost the usability, scalability, and accuracy of spatial machine learning tasks (e.g., spatial regression and spatial-aware knowledge bases) used in these applications.

The goal of this project was to provide the first full-fledged MLN framework with a native support for spatial data, called Spatial Markov Logic Networks (SMLN). In particular, SMLN pushes the spatial awareness inside the internal data structures and core learning and inference functionalities of MLN, and hence inside all MLN-based machine learning techniques and applications. In this project, we showed three case studies on the efficiency of SMLN including Sya [ICDE'20] [SIGMOD'18 (Demo)], a system for spatial probabilistic knowledge base construction, TurboReg [TSAS'19] [SIGSPATIAL'18], a framework for scaling up spatial autologistic regression models, and Flash [SIGSPATIAL Special'19] [SIGSPATIAL'19 (SRC)] [VLDB'19 (Demo)], a framework for scalable spatial probabilistic graphical modeling. A nice introduction about the overall SMLN architecture can be found here [PhD@VLDB'19].

Tutorials related to this project: [MDM'21] [ICDE'20] [VLDB'19] [Slides]
Awards: [University of Minnesota Best Dissertation Honorable Mention'21] [University of Minnesota Doctoral Dissertation Fellowship'19-'20] [Gold Medal of Student Research Competition in SIGSPATIAL'19] [Best Paper Nomination in SIGSPATIAL'18]


CRA: Enabling Data-Intensive Applications in Containerized Environments

Common Runtime for Applications (CRA) is a software layer (library) that makes it easy to create and deploy distributed dataflow-style applications on top of resource managers such as Kubernetes, YARN, and stand-alone cluster execution. Currently, we support stand-alone execution (just deploy an .exe on every machine in your cluster) as well as execution in a Kubernetes/Docker environment. CRA has been used to build both offline and streaming analytics platforms in Microsoft such as Quill and online microservice fabrics such as Ambrosia. A nice introduction about the overall CRA architecture can be found here [ICDE'19] [Full Version].

Open Source Code: [GitHub]


Efficient Spatial Query Processing/Optimization in MapReduce-based Data Processing Frameworks

In this project, we focused on supporting efficient built-in spatial queries processing and optimization inside the popular MapReduce-based data processing frameworks. Specifically, I worked to achieve this goal within two main spatial-aware MapReduce-based systems, SpatialHadoop and Sphinx:

  • Optimizing spatial queries in SpatialHadoop: SpatialHadoop is an open-source MapReduce extension to Apache Hadoop designed specially to work with spatial data. It is used it to analyze huge spatial datasets on a cluster of machines. Recently, SpatialHadoop has been acquired under the name of GeoJini by eclipse foundation as one of its LocationTech projects. I have been working on extending SpatialHadoop to support optimizing large-scale spatial queries e.g. spatial join [SIGSPATIAL'2017] [SIGMOD'17 (SRC)].

    Open Source Code: [GitHub]
    Awards: [Selected among Top 10 Finalists of the Student Research Competition in SIGMOD'17]

  • Efficient spatial indexing and query execution in Sphinx: Sphinx is a lighting-fast, distributed SQL queries for petabytes of spatial data, based on Cloudera Impala. The main objective is to implement a full stack of spatial data processing, including query parser, indexer, query planner, and query executor. I have been working on supporting efficient spatial indexing (e.g., grid and r-tree) and query processing inside Sphinx. A nice introduction about the overall Sphinx architecture can be found here [SSTD'17].

    Open Source Code: [GitHub]

Other Projects

Device-Free Passive WLAN Localization

In this project, we developed and evaluated accurate multi-entity tracking solutions that use the human’s effect on the Radio Frequency (RF) in WiFi environments to infer the human’s presence and location. To achieve that, we investigated recent wireless technologies (e.g. 802.11n) combined with solid machine learning techniques [TMC'15] [WCNC'13] [GLOBECOM'12] [WINTECH'12 (Demo)].


Collaborative Machine Translation Evaluation

In this project, we focused on providing an efficient platform for integrating the automated evaluation of machine translations with human judgments to produce accurate quality estimation of large-scale translations. The resulting platform has been integrated as a web service for Microsoft Translator Hub that serves requests for public translation and private ones. A nice introduction about the details of this platform can be found here [Thesis].

Media: [Egypt Newspapers Coverage]


Efficient Semantic-based Recommendation

In this project, we proposed a novel approach to construct an ontology from Wikipedia graphs of categories and articles to solve the problems of using traditional ontologies for the text analysis in text-based recommendation systems. In addition, we proposed an efficient structure for users' profiles to integrate smoothly with the built ontology [ISDA'10].

Awards: [Alexandria University's Best CS Bachelor's Thesis Award in 2010]