Our research was supported by Amazon Inc. and Computer Research Association.

Research Areas

Approximate Query Processing (AQP)

Approximate Query Processing (or AQP) is a promising direction in big data management. Complex queries on big data can be time-consuming. AQP is an alternative scheme that generates approximated query answers with acceptable accuracy at a swift speed. AQP can be widely employed in data science fields, such as exploratory data analysis (EDA), data mining, and machine learning, where approximate answers to complex queries are needed instantly.

Most AQP frameworks need to collect statistics from the data to provide estimations of query results. Based on how the statistics are collected, there are two general directions including the Online AQP and Offline AQP. Most existing research belongs to the Online AQP, which collects statistics after a query is given for estimation. To achieve a high query estimation speed, they usually rely on costly data structures, such as hash tables or indices, and expensive high-end hardware. Moreover, statics collected by online AQP cannot be re-used and must be recollected if a different query is given.

We are interested in the unique direction of Offline AQP which collects statistics before a query is given. The challenge lies in collecting statistics across multiple correlated tables in a database join graph. A new offline AQP framework is introduced called Scalable Correlated Sample Synopsis (CS) which can work on big data. CS is independent of ancillary structures and can work on lower-end hardware which is affordable and economic for small and middle businesses. It comes with accurate and unbiased query estimators, which can estimate common time-consuming queries including join and aggregate queries. Moreover, CS* is the first work to address the problem of cyclic join query estimation. As an offline AQP framework, the statistics collected once in AQP can be re-used for future queries in the same database join graph. His work has been published in top conferences including 2019 IEEE International Conference on Big Data and won the Best Paper Award of 2019 International Conference of Software Engineering and Data Engineering.

AQP Example on TPC-H 100GB Test Data

NoSQL Database Write Optimization

Column-store databases are a special type of NoSQL database. They store data in columns instead of rows which are distinct from common relational databases. Column-store databases feature in a much faster read speed than relational databases and thus are widely used in data mining and OLAP (On-Line Analytical Processing) for scientific and business usage. The well-known challenge of column-store databases lies in the write optimizations meaning to add or change data. This problem is called write-optimization on column-store databases.

Existing state-of-the-art works focus on the write optimization on the column-store databases in the main memory of a computer system. These main-memory column-store databases can only store relatively smaller sizes of data. Different from existing research, I focused on the write-optimization of column-store databases in external memory and big data environments. I uniquely re-designed a new storage data structure for the column-store database, called Timestamped Binary Association Table (or TBAT) to replace the conventional storage structure called Binary Association Table (BAT). Based on TBAT, a special write operation called Asynchronous Out-of-Core Update is designed to speed up the writing operations on column-store databases. Experimental studies show that TBAT is faster than BAT in writing by orders of magnitudes. To further increase the reading speed on TBAT after plenty of updates, I continuously worked to introduce several read optimization methods on TBAT. He introduced three data cleaning methods and one index-based method, called offset B+ Tree (or OB-Tree). In addition, he extended the concept of TBAT from conventional file systems to big data file systems to improve the write performance of column-store databases in data-intensive environments.

These works are continuously published and offered oral presentations in popular international conferences 2015 International Conference on Database and Expert Systems (DEXA), 2014 and 2015 International Conference on Computers and Their Applications (CATA), 2014 and 2015 International Conference on Computer Applications in Industry and Engineering (CAINE), and 2015 IEEE International Congress on Big Data (IEEE BigData).

Write Speed: TBAT (proposed method) vs BAT (traditional)

Database Query Optimization

To retrieve data from multiple tables, the most important query is the join query. Determine the optimal plan to execute a join query has been proved as an NP-hard problem. Conventional research employs approximation methods such as histogram and sample to estimate the size of each join expression to assist database query optimizers in looking for the optimal plan. Sample, as a database synopsis, has been proved to be simpler and faster than histogram; however, it is well-known that simple random samples are very poor to estimate long join queries.

To address this problem, I created a novel synopsis, called correlated sample synopsis (or CS2), which utilizes samples but can generate fast and accurate query size estimations for join queries. A distinctive sampling algorithm called correlated sampling is developed to create CS2 at a fast speed. Unlike other existing synopses, with one CS2 synopsis created a database based on its join graph, the sizes of all database queries following the same join graph can be efficiently and accurately estimated using the same CS2 synopsis. Experimental studies show that CS2 is faster in construction time and more accurate than existing synopses. In addition, CS2 provides estimations for queries of broader ranges with fewer restrictions.

This result is accepted for a 30-minute oral presentation in the 2013 ACM SIGMOD International Conference on Management of Data which is the flagship conference in database system research.

Heterogeneous Cloud

Cloud computing has gained a lot of popularity because it promises to provide stability and proficiency, while lowering the cost of physical equipment and maintenance. With the surge of big data, more and more requests of enabling big data in the cloud environment have been observed. However, to our best knowledge, there is no such a platform to enable these functionalities.

YSU STEM Cloud is a research project from 2013 jointly supported by multiple departments in STEM College and government agency funded projects directed by YSU professors. It provides versatile services mainly including virtual computing, virtual network, virtual firewall, and cloud storage with disaster recovery ability. This infrastructure will assist the research and teaching at YSU in cloud computing, big data, high-performance computing, data mining and machine learning.

The STEM Cloud is installed with the world-leading private cloud platform, namely OpenStack, which features powerful and flexible cloud functionalities. It provides versatile services mainly including virtual computing, virtual network, virtual firewall, and cloud storage with disaster recovery ability. The total computing power of the current cloud infrastructure includes 160 CPU cores, 320 GB memory, and 57 TB cloud storage.

YSU STEM Cloud – Running a Virtual Machine