William Kent

Database Technology Department

Hewlett-Packard Laboratories

Palo Alto, California

Aug 1992

> 1 INTRODUCTION . . . 1

> 2 THE GENERAL HYPER-JOIN . . . 2

> 3 SPECIALIZED OPERATORS . . . 3

>> 3.1 Cross Product and Append . . . 3

>> 3.2 Join, Equijoin, and Outer Joins . . . 4

>> 3.3 Complements and Negation . . . 4

> 4 HYPEREXTENSION FOR BOOLEAN VARIABLES . . . 5

> 5 SUMMARY . . . 6

> 6 IMPLEMENTATIONS . . . 6

> 7 CONCLUSIONS . . . 7

The *hyper-join* is a generalized binary operator on bags (multisets) which
subsumes a number of operators as special cases, such as several joins, union, and
complement. The operators are applicable to relations as well as to other aggregates of
objects.

The hyper-join operation is a useful generalization of joins, equi-joins, and outer joins in a form that also supports negation and Boolean variables. The approach is especially useful in strongly typed object systems where the underlying domains will often be finite object types.

The hyper-join provides a basis for defining those other operations by fixing values of parameters to the hyper-join.

The hyper-join operates on a pair of bags and an arbitrarily specified "match condition". The various specialized operators are differentiated by the nature of the bags, the nature of the match condition, and the actions to be taken when the match condition is or is not satisfied.

Let C and D be two bags. Let c IN C denote the number of occurrences of the element c in the bag C.

Let the *match condition* p be a binary predicate such that p(c,d) is true or
false for any c in C and d in D. The match condition induces three new bags:

- The "left" bag L contains those elements c in C for which there does not exist any element d in D such that p(c,d) is true. Cardinality is preserved: if c is in L, then c IN L = c IN C.
- The "right" bag R is correspondingly defined for those elements of d in D for which there are no matching elements in C. Again, if d is in R, then d IN R = d IN D.
- The "middle" or "matched" bag M consisting of all pairs <c,d> for which p(c,d) is true. The cardinality is a product, analogous to the behavior of a cross-product. If p(c,d), then <c,d> IN M = c IN C x d IN D.

C c2 c1 ....-------------.................... L | c2L | c1L | L ....|-----+-----|...-------------.... M | c2M | c1M | | d1M | d2M | M ....-------------...|-----+-----|.... R | d1R | d2R | R ....................-------------.... d1 d2 D

For many operations, C and D will be restricted to bags of pairs c=<c1,c2> and d=<d1,d2>. The components c1 and d1 participate in the match condition p(c1,d1), while the components c2 and d2 do not. For joins, c1 and d1 represent the sets of columns being compared in the join, while c2 and d2 are the other columns that may be in the relations. For operations not requiring C and D to consist of such pairs, we will characterize c2 and d2 as being nil.

The behavior of various operators can be described in terms of the contributions made by the bags L, R, and M to a result bag B.

The general hyper-join, trying to match C=[<c1,c2>] and D=[<d1,d2>] on the match condition p(c1,d1), yields the following results:

- The middle bag M contains tuples <c1,c2,d1,d2> for which p(c1,d1) is true. The number of occurrences of such a tuple is the product of the occurrences of <c1,c2> in C and the occurrences of <d1,d2> in D. (For simplicity, we have flattened <<c1,c2>, <d1,d2>> into <c1,c2,d1,d2>. It shouldn't make any significant difference.)
- For each occurrence of such a tuple in M, the result bag B will contain an occurrence of <c1,c2,d1,d2,T>. T is a Boolean true value.
- The left bag L contains tuples <c1,c2> from C such that there is no <d1,d2> in D for which p(c1,d1) is true. For each occurrence of such a tuple in L, the result bag B will contain an occurrence of <c1,c2,N,N,F>. F is a Boolean false value. N is a null construct; for relations, it is a sequence of an appropriate number of null values. In this particular case, if d1 represents two columns of a relation and d2 represents three columns, then the first N in <c1,c2,N,N,F> denotes two nulls while the second N denotes three nulls.
- Similarly, the right bag R contains tuples <d1,d2> from D such that there is no <c1,c2> in C for which p(c1,d1) is true. For each occurrence of such a tuple in R, the result bag B will contain an occurrence of <N,N,d1,d2,F>.

This can be summarized as follows:

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

hyper-join | hy | any | any | any | <c1,c2,N,N,F> | <c1,c2,d1,d2,T> | <N,N,d1,d2,F> |

Other operations produce restricted forms of these outputs.

As a trivial example, we can describe the cross-product C x D itself in these terms. The match condition p is trivially true for all elements of C and D; all elements qualify. Since nothing is really being compared, the components c2 and d2 are nil; C can be characterized as having elements c1, and D has elements d1. Everything winds up in the middle bag M, with nothing in L or R. The result is simply the middle bag M.

Append (concatenate, union without duplicate removal) can also be characterized in a trivial way by taking the match condition to be uniformly false. Nothing winds up in M. All of C winds up in L, D winds up in R, and the result B is all of L and R.

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

cross-product | xp | nil | any | T | nothing | <c1,d1> | nothing |

append | ap | nil | any | F | <c1> | nothing | <d1> |

The general join is based on an arbitrary match condition p(c1,d1) imposed on columns c1 of C and columns d1 of D. The result consists only of the middle bag M (we do not provide duplicate elimination).

Equijoin is like general join except that the match condition is equality and one copy of the matching columns is eliminated.

Outer joins are like the equijoin except that unmatched elements in L and R also contribute to the results, appropriately padded with nulls. The left outer join adds results from L, the right outer join takes results from R, and the symmetric outer join takes both.

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

join | jo | any | any | any | nothing | <c1,c2,d1,d2> | nothing |

equijoin | ej | any | any | = | nothing | <c1,c2,d2> | nothing |

left outer-join | lj | any | any | = | <c1,c2,N> | <c1,c2,d2> | nothing |

right outer-join | rj | any | any | = | nothing | <c1,c2,d2> | <d1,N,d2> |

sym. outer-join | sj | any | any | = | <c1,c2,N> | <c1,c2,d2> | <d1,N,d2> |

We can define a "general complement" gc which only requires portions of C and D to be comparable. If <c1,c2> is in C, it will occur in C gc D if and only if there is no <d1,d2> in D where c1=d1.

The ordinary relative complement C~D is the limiting case where the entire elements of C and D are compared. This is another case in which c2 and d2 are nil.

In both cases, the results come entirely from the left bag L.

If the members of D are taken from some underlying domain dom(D), then ¬D (the negation of D) can be defined as the relative complement dom(D)~D.

If D is a relation, then dom(D) is the cross-product of the underlying domains on which D is defined. In object systems, where the underlying domains might be such finite object types as employees and departments, dom(D) will often be finite.

Negation ¬D is thus defined here as a relative complement in which the bag C is dom(D), i.e., ¬D = dom(D)~D.

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

gen. complem. | gc | any | any | = | <c1,c2> | nothing | nothing |

rel. complem. | rc | nil | any | = | <c1> | nothing | nothing |

negation | ng | nil | dom(D) | = | <c1> | nothing | nothing |

In a language which supports Boolean variables it might be possible, for example, to ask for a list of employees together with a Boolean flag indicating whether or not they are exempt:

Select Employee e, Boolean b where b=Exempt(e);

This query would be difficult to execute if exempt employees were listed in a separate table, with non-exempt employees simply not included. In general, the extension of a predicate is maintained in a relation by including the values for which it is true and excluding the values for which it is false.

Variables in queries are typically bound to columns of relations. A term of the form y=f(x) is handled by binding x and y to columns. This doesn't generally work for predicate values. If p is a predicate (or Boolean expression), the variable b in the term b=p(z) can't be mapped to any column. The value of b, True or False, generally corresponds to the presence or absence of z in the underlying relation, and not to any explicit column value.

If a relation D is the extension of the predicate p, then p(z) is true for each member of D. For what elements is p false? It should be the members of ¬D.

We define the *hyperextension* of D to be a bag containing each
member of D with a T appended, together with each member of ¬D with an F appended.

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

hyperextension | hx | nil | dom(D) | = | <c1,F> | <c1,T> | nothing |

The hyperextension essentially converts the predicate p into a Boolean-valued function whose extension has an explicit column for the Boolean function value. Queries with Boolean variables can be executed by binding variables to such appended columns.

operator | op | c2,d2 | C | p(c1,d1) | from L | from M | from R |
---|---|---|---|---|---|---|---|

hyper-join | hy | any | any | any | <c1,c2,N,N,F> | <c1,c2,d1,d2,T> | <N,N,d1,d2,F> |

cross-product | xp | nil | any | T | nothing | <c1,d1> | nothing |

append | ap | nil | any | F | <c1> | nothing | <d1> |

join | jo | any | any | any | nothing | <c1,c2,d1,d2> | nothing |

equijoin | ej | any | any | = | nothing | <c1,c2,d2> | nothing |

left outer-join | lj | any | any | = | <c1,c2,N> | <c1,c2,d2> | nothing |

right outer-join | rj | any | any | = | nothing | <c1,c2,d2> | <d1,N,d2> |

sym. outer-join | sj | any | any | = | <c1,c2,N> | <c1,c2,d2> | <d1,N,d2> |

gen. complem. | gc | any | any | = | <c1,c2> | nothing | nothing |

rel. complem. | rc | nil | any | = | <c1> | nothing | nothing |

negation | ng | nil | dom(D) | = | <c1> | nothing | nothing |

hyperextension | hx | nil | dom(D) | = | <c1,F> | <c1,T> | nothing |

This overview is shown in another form in the summary tables.

These operations could be implemented as special cases within a general iterative framework much like current join algorithms. Some, like the symmetric and right outer joins (sj, rj) might be difficult to implement as nested loop joins; it's hard to determine the bag R, i.e., elements of the inner relation not matched by elements of the outer relation. This might be overcome by taking a cross-product with an underlying domain.

When required, the domain relation could be generated within a multi-way hyper-join including, say, the cross-product of types as well as a negation. For example, the query

Select T1 z1, T2 z2 where not q(z1,z2)

would translate to

T1 xp T2 ng Q.

Although we describe negation and Boolean variables in terms of domain relations, actual query computations will often involve subsets of domain relations defined by appropriate query transformations. Some transformations may replace an infinite domain relation with a finite subset.

For example, suppose Award(e,y), defined on employees and integers, is true if an employee e received an award in the year y. Suppose we wanted a list of employees and the years in this century when they did not receive awards:

Select Employee e, Integer y where Between(y,1900,2000) and not Award(e,y).

Although the domain relation domain(Award) is the infinite cross product of employees and integers, this query should be computable by replacing this domain with the cross product of employees and the set of integers between 1900 and 2000.

The general question of query transformation and optimization for these operations requires much investigation.

The hyper-join operation is a useful generalization of joins, equi-joins, and outer joins in a form that also supports negation and Boolean variables. The approach is especially useful in strongly typed object systems where the underlying domains will often be finite object types.

This has not been a formal treatment, but rather an informal exposition of the essential concepts sufficient to motivate both formalizations and implementations.