Join is a complicated interface which is very commonly used in database systems. A JOIN is a means for combining fields from two tables (or more) by using values common to each. Join can also implemented by the application if the databased itself can not support this type of more complicated operations. For example, most NoSQL database use simple schemas and are therefore easier to scale out, thus they do not support complicate schemes, such as join. NoSQL has a very good scalability. However, the joins implemented in application level suck really badly as you’d need to do a lot of foreign key lookups, which would end up going to many, many different nodes. Distributed database which implement the join inside the database, has the additional optimization opportunity for the performance.
This paper focus on reduce the shuffle in distributed database join. It category the output of a join into 3 categories:
- Dimension:Dimension (D:D) dimensional merge two tables.
- Attribute:Attribute (A:A) compares the values of individual attribute.
- Attribute:Dimension (A:D/D:A) both compares the dimension and the value.
Example exist in the following figure:
The paper first design logical join algorithms, such as Hash Join (for A:A), Merge Join (for D:D) and Nested Loop Join (for A:D/D:A). Then it slice the data and mapping each slice to physical node. Each slice can be marked as Sparse Chunk or Dense Chunk. Finally, Physical Planning of Shuffle and join are executed. To optimize the shuffle, this paper use the density skew in data set.
The optimization is mainly for the beneficial skew as shown in the above figure. The Sparse Chunk is transfered to the Dense Chunk. Thus the data to be transfered can be reduced.