What is modern query federation?

Query Federation allows the querying of data stored in multiple different data sources, gives the ability to map multiple database systems into a single database, which can all be accessed via a single query.

Problem - Research Question

What is the most efficient way to run SQL queries from different data sources?

Action

Our project compares the speed of the execution of SQL queries which implement JOIN that are run on miniPresto using query federation against those that are run on PostgreSQL without query federation.

How we built it

We first had to create the database and load the data into it. In order to do that, we installed Java (JDK), Maven, PostgreSQL, and IntelliJ IDEA. Then, we installed WSL and Ubuntu in order to run commands in a bash shell to create the database and load the data. After the data was ready, we compiled the necessary code from the Git repository. Then came the main part -- we created the SQL queries using JOIN, and some of us ran the queries using miniPresto which used query federation, while some of us ran the queries using PostgreSQL without query federation. Once we ran the same queries using these two methods, we documented the speed of the execution of these queries in an excel sheet. We were finally able to compare the two methods. The result is that, using query federation is much more efficient than without it.

Results

The pushdown approach had the fastest operation times for each trial, for the hash joins. We were unable to finish investigating the nested loop joins, as the computational time was too long.

Challenges we ran into

One challenge was that the installation was really difficult; most members didn’t manage to install the software in time. Secondly, it required much thinking to implement algorithms with Java data structures, but with discussion and the help of our mentor, we were able to do it. The implementation of tables using Java lists was hard to understand. Also, there were OutOfMemory issues with big tables. We fixed this by increasing the memory. The nested loop join was meant to be inefficient, but it was too slow with large datasets to actually test the speed. One way we dealt with this is, when we ran the query, we turned print off so that it executed faster.

Accomplishments that we're proud of

We're proud that we were able to grasp a difficult, unfamiliar concept about database querying, solve problems and develop queries, and see the results of our research.

What we learned

We learned how to use the platforms and databases we used, and how to develop SQL queries using JOIN. We also learned how databases work.

Share this project:

Updates