The Hyper-Join

William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California

Aug 1992


> 1 INTRODUCTION . . . 1
> 2 THE GENERAL HYPER-JOIN . . . 2
> 3 SPECIALIZED OPERATORS . . . 3
>> 3.1 Cross Product and Append . . . 3
>> 3.2 Join, Equijoin, and Outer Joins . . . 4
>> 3.3 Complements and Negation . . . 4
> 4 HYPEREXTENSION FOR BOOLEAN VARIABLES . . . 5
> 5 SUMMARY . . . 6
> 6 IMPLEMENTATIONS . . . 6
> 7 CONCLUSIONS . . . 7


1 INTRODUCTION

The hyper-join is a generalized binary operator on bags (multisets) which subsumes a number of operators as special cases, such as several joins, union, and complement. The operators are applicable to relations as well as to other aggregates of objects.

The hyper-join operation is a useful generalization of joins, equi-joins, and outer joins in a form that also supports negation and Boolean variables. The approach is especially useful in strongly typed object systems where the underlying domains will often be finite object types.

The hyper-join provides a basis for defining those other operations by fixing values of parameters to the hyper-join.

2 THE GENERAL HYPER-JOIN

The hyper-join operates on a pair of bags and an arbitrarily specified "match condition". The various specialized operators are differentiated by the nature of the bags, the nature of the match condition, and the actions to be taken when the match condition is or is not satisfied.

Let C and D be two bags. Let c IN C denote the number of occurrences of the element c in the bag C.

Let the match condition p be a binary predicate such that p(c,d) is true or false for any c in C and d in D. The match condition induces three new bags:

          C
       c2    c1
....-------------....................
 L  | c2L | c1L |                  L
....|-----+-----|...-------------....
 M  | c2M | c1M |   | d1M | d2M |  M
....-------------...|-----+-----|....
 R                  | d1R | d2R |  R
....................-------------....
                       d1    d2
                          D

For many operations, C and D will be restricted to bags of pairs c=<c1,c2> and d=<d1,d2>. The components c1 and d1 participate in the match condition p(c1,d1), while the components c2 and d2 do not. For joins, c1 and d1 represent the sets of columns being compared in the join, while c2 and d2 are the other columns that may be in the relations. For operations not requiring C and D to consist of such pairs, we will characterize c2 and d2 as being nil.

The behavior of various operators can be described in terms of the contributions made by the bags L, R, and M to a result bag B.

The general hyper-join, trying to match C=[<c1,c2>] and D=[<d1,d2>] on the match condition p(c1,d1), yields the following results:

This can be summarized as follows:

operator op c2,d2 C p(c1,d1) from L from M from R
hyper-join hy any any any <c1,c2,N,N,F> <c1,c2,d1,d2,T> <N,N,d1,d2,F>
N=Nulls, T=True, F=False

Other operations produce restricted forms of these outputs.

3 SPECIALIZED OPERATORS

3.1 Cross Product and Append

As a trivial example, we can describe the cross-product C x D itself in these terms. The match condition p is trivially true for all elements of C and D; all elements qualify. Since nothing is really being compared, the components c2 and d2 are nil; C can be characterized as having elements c1, and D has elements d1. Everything winds up in the middle bag M, with nothing in L or R. The result is simply the middle bag M.

Append (concatenate, union without duplicate removal) can also be characterized in a trivial way by taking the match condition to be uniformly false. Nothing winds up in M. All of C winds up in L, D winds up in R, and the result B is all of L and R.

operator op c2,d2 C p(c1,d1) from L from M from R
cross-product xp nil any T nothing <c1,d1> nothing
append ap nil any F <c1> nothing <d1>
N=Nulls, T=True, F=False

3.2 Join, Equijoin, and Outer Joins

The general join is based on an arbitrary match condition p(c1,d1) imposed on columns c1 of C and columns d1 of D. The result consists only of the middle bag M (we do not provide duplicate elimination).

Equijoin is like general join except that the match condition is equality and one copy of the matching columns is eliminated.

Outer joins are like the equijoin except that unmatched elements in L and R also contribute to the results, appropriately padded with nulls. The left outer join adds results from L, the right outer join takes results from R, and the symmetric outer join takes both.

operator op c2,d2 C p(c1,d1) from L from M from R
join jo any any any nothing <c1,c2,d1,d2> nothing
equijoin ej any any = nothing <c1,c2,d2> nothing
left outer-join lj any any = <c1,c2,N> <c1,c2,d2> nothing
right outer-join rj any any = nothing <c1,c2,d2> <d1,N,d2>
sym. outer-join sj any any = <c1,c2,N> <c1,c2,d2> <d1,N,d2>
N=Nulls, T=True, F=False

3.3 Complements and Negation

We can define a "general complement" gc which only requires portions of C and D to be comparable. If <c1,c2> is in C, it will occur in C gc D if and only if there is no <d1,d2> in D where c1=d1.

The ordinary relative complement C~D is the limiting case where the entire elements of C and D are compared. This is another case in which c2 and d2 are nil.

In both cases, the results come entirely from the left bag L.

If the members of D are taken from some underlying domain dom(D), then D (the negation of D) can be defined as the relative complement dom(D)~D.

If D is a relation, then dom(D) is the cross-product of the underlying domains on which D is defined. In object systems, where the underlying domains might be such finite object types as employees and departments, dom(D) will often be finite.

Negation D is thus defined here as a relative complement in which the bag C is dom(D), i.e., D = dom(D)~D.

operator op c2,d2 C p(c1,d1) from L from M from R
gen. complem. gc any any = <c1,c2> nothing nothing
rel. complem. rc nil any = <c1> nothing nothing
negation ng nil dom(D) = <c1> nothing nothing
N=Nulls, T=True, F=False

4 HYPEREXTENSION FOR BOOLEAN VARIABLES

In a language which supports Boolean variables it might be possible, for example, to ask for a list of employees together with a Boolean flag indicating whether or not they are exempt:

Select Employee e, Boolean b where b=Exempt(e);

This query would be difficult to execute if exempt employees were listed in a separate table, with non-exempt employees simply not included. In general, the extension of a predicate is maintained in a relation by including the values for which it is true and excluding the values for which it is false.

Variables in queries are typically bound to columns of relations. A term of the form y=f(x) is handled by binding x and y to columns. This doesn't generally work for predicate values. If p is a predicate (or Boolean expression), the variable b in the term b=p(z) can't be mapped to any column. The value of b, True or False, generally corresponds to the presence or absence of z in the underlying relation, and not to any explicit column value.

If a relation D is the extension of the predicate p, then p(z) is true for each member of D. For what elements is p false? It should be the members of D.

We define the hyperextension of D to be a bag containing each member of D with a T appended, together with each member of D with an F appended.

operator op c2,d2 C p(c1,d1) from L from M from R
hyperextension hx nil dom(D) = <c1,F> <c1,T> nothing
N=Nulls, T=True, F=False

The hyperextension essentially converts the predicate p into a Boolean-valued function whose extension has an explicit column for the Boolean function value. Queries with Boolean variables can be executed by binding variables to such appended columns.

5 SUMMARY

operator op c2,d2 C p(c1,d1) from L from M from R
hyper-join hy any any any <c1,c2,N,N,F> <c1,c2,d1,d2,T> <N,N,d1,d2,F>
cross-product xp nil any T nothing <c1,d1> nothing
append ap nil any F <c1> nothing <d1>
join jo any any any nothing <c1,c2,d1,d2> nothing
equijoin ej any any = nothing <c1,c2,d2> nothing
left outer-join lj any any = <c1,c2,N> <c1,c2,d2> nothing
right outer-join rj any any = nothing <c1,c2,d2> <d1,N,d2>
sym. outer-join sj any any = <c1,c2,N> <c1,c2,d2> <d1,N,d2>
gen. complem. gc any any = <c1,c2> nothing nothing
rel. complem. rc nil any = <c1> nothing nothing
negation ng nil dom(D) = <c1> nothing nothing
hyperextension hx nil dom(D) = <c1,F> <c1,T> nothing
N=Nulls, T=True, F=False

This overview is shown in another form in the summary tables.

6 IMPLEMENTATIONS

These operations could be implemented as special cases within a general iterative framework much like current join algorithms. Some, like the symmetric and right outer joins (sj, rj) might be difficult to implement as nested loop joins; it's hard to determine the bag R, i.e., elements of the inner relation not matched by elements of the outer relation. This might be overcome by taking a cross-product with an underlying domain.

When required, the domain relation could be generated within a multi-way hyper-join including, say, the cross-product of types as well as a negation. For example, the query

Select T1 z1, T2 z2 where not q(z1,z2)

would translate to

T1 xp T2 ng Q.

Although we describe negation and Boolean variables in terms of domain relations, actual query computations will often involve subsets of domain relations defined by appropriate query transformations. Some transformations may replace an infinite domain relation with a finite subset.

For example, suppose Award(e,y), defined on employees and integers, is true if an employee e received an award in the year y. Suppose we wanted a list of employees and the years in this century when they did not receive awards:

Select Employee e, Integer y where Between(y,1900,2000) and not Award(e,y).

Although the domain relation domain(Award) is the infinite cross product of employees and integers, this query should be computable by replacing this domain with the cross product of employees and the set of integers between 1900 and 2000.

The general question of query transformation and optimization for these operations requires much investigation.

7 CONCLUSIONS

The hyper-join operation is a useful generalization of joins, equi-joins, and outer joins in a form that also supports negation and Boolean variables. The approach is especially useful in strongly typed object systems where the underlying domains will often be finite object types.

This has not been a formal treatment, but rather an informal exposition of the essential concepts sufficient to motivate both formalizations and implementations.