Database Management Systems

A relational database stores both data and the relationships between the data, which is the most widely used type of databases. The core of a relational database is the database engine which processes the input SQL language statements called database queries. My research projects cover two major topics, namely query processing and query optimization.

Query Processing

SQL statements are referred as queries in a database. The major forms of computation supported by SQL are tuple-level computations and aggregate computations. Tuple-level computations are defined by expressions on variables in calculus expressions or on values of each tuple of a relation in algebraic expressions, whereas aggregate computations apply aggregate functions, such as minimum, maximum, average, count, etc., on sets of tuples. Unfortunately, SQL does not include set computations, which involve comparisons on sets of tuples.

We proposed a set of set comparison operators to enhance the usability of SQL. These operators allow users to formulate conventionally difficult queries in an easier way, greatly enhancing the user-friendliness of SQL.

Query Optimization

For each query given to database, there are many alternative ways for execution which are called query plans. The number of all possible query plans depends on both on how many tables are included and the operators in it. With a few tables and operations, this number can be very large. The process intending to find the best or optimal plan for query execution is referred as query optimization. The most difficult problem of query optimization is to find the optimal plan for join queries, which is referred as the join ordering problem.

Big Data and NoSQL

Big data features in 4″V”‘s, namely volume (huge quantity), velocity (high speed), variety (rich formats), and veracity (uncertain data). It usually exceeds the processing capability of any current single computing and storage architecture. My research in big data focuses on two major fields including NoSQL and heterogeneous platforms. The later topic aims to build a platform which provides service of both big data and cloud computing.

NoSQL (or Not Only SQL) refers to the databases which store data not necessarily in relational restrictions. NoSQL databases are comprised of a large set of databases and widely used in big data. NoSQL databases have a big subset called column-store databases. A notable feature of column-store databases is that they store data in columns instead of in rows. This innovative design results in a faster data reading speed and higher data compression rate compared with traditional row-based databases. However, optimizing write operations on column-store databases has always been a well-known challenge. Most existing works on write performance optimization merely focus on the in-memory environment which is hardly applicable to big data.

Heterogeneous Cloud

Cloud computing has gained a lot of popularity because it promises to provide stability and proficiency, while lowering the cost of physical equipment and maintenance. With the surge of big data, more and more requests of enabling big data in the cloud environment have been observed. However, to our best knowledge, there is no such a platform to enable these functionalities.

The STEM Heterogeneous Cloud (or STEHC) is a research project initiated from my faculty startup fund from 2013 Fall. The aim of this project is to create a platform to provision services of both cloud computing and big data. Existing platforms solely provide either one of the services without the ability to integrate both of them. Namely, this heterogeneous cloud is a combination of both cloud computing and big data storage in one platform. STEHC has two major components, namely a (private) cloud, and a big data platform. The private cloud platform is constructed based on OpenStack For the big data platform, we use the Apache Hadoop. Both platforms are composed of multiple modules, and each module is provisioning an individual functional service. The challenge of this project is to aggregate many services of both platforms into one cluster.