William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California
October 1988
> 1 INTRODUCTION . . . 1
>> 1.1 Counter Objects . . . 2
> 2 PROJECTIONS ON NO COLUMNS . . . 2
> 3 FAMILIES OF FUNCTIONS OVER RELATIONS . . . 3
>> 3.1 Generators . . . 3
>> 3.2 Predicates . . . 4
>> 3.3 Duplicates and Profiles . . . 4
>> 3.4 Function Families: Summary . . . 4
> 4 THE ARITHMETIC OF RELATIONAL OPERATORS . . . 6
>> 4.1 Select . . . 6
>> 4.2 Project . . . 7
>> 4.3 Eliminating Duplicates . . . 8
>> 4.4 Join . . . 8
>> 4.5 Concatenation and Union . . . 9
> 5 AN EXPERIMENT . . . 9
> 6 CONCLUSIONS . . . 10
By introducing counters, which are simultaneously integers and relations, we can develop a foundation for the relational model based on arithmetic instead of set theory. Duplicates can be accounted for, and predicates can be integrated with the relational operators. It may also be possible to develop arithmetic approaches to query processing and dependency theory.
A counter is a bag (multiset) of empty tuples. It is like a relation having no columns, yet containing a definite number of rows, each containing the empty tuple - hence, there are duplicates. We can say that its width is zero.
Finite counters are isomorphic to the non-negative integers, i.e., the counting numbers. A counter corresponds to the number expressing the quantity of empty tuples contained in the counter. In fact, until we encounter some consequent anomaly, we will say that the finite counters are the non-negative integers:
0 = [ ]
1 = [<>]
2 = [<>, <>]
3 = [<>, <>, <>]
...
We introduce an infinity object INF, with the property of being larger than any number. An infinite counter, being an infinite bag of empty tuples, corresponds to INF. (If this ever gives rise to insurmountable difficulties, e.g., with arithmetic on counters, we can comfortably limit ourselves to finite counters.)
Thus a counter is simultaneously all of following:
When we project a relation on n of its columns (without removing duplicates), we get a relation containing the same number of tuples as the original, with the resulting tuples containing only the elements from the projection columns. This is a familiar concept for n>0, and we can now extend it naturally to n=0.
Following the same definition, such a projection on no columns yields a relation containing the same number of tuples as the original, with the resulting tuples containing no elements. Thus the result of projecting on no columns is a relation of empty tuples, i.e., a counter.
Relations of empty tuples suggest an image of shadow relations. Projecting a relation on no columns leaves none of the substance of the original relation, yet there is a definite outline of something having been there: an empty tuple in the result marking the place of each tuple in the original. Like shadows, these counters are lacking a dimension.
On the other hand, if we think of counters as being integers (or INF), then such a projection on no columns actually yields the cardinality of the original relation: projecting a relation on no columns returns the number of rows in the relation. This is a pleasant and useful surprise. Serendipity at play.
There is a natural family of functions that can be defined over a relation, which are executed by selecting on the columns corresponding to their arguments and projecting on the columns corresponding to their results.
Consider an EmpDept relation with two columns, Emp and Dept, and two functions:
DeptOf: Emp -> Dept
EmpOf: Dept -> Emp.
DeptOf(Sam) is executed by selecting on Sam in the Emp column (Emp=Sam), and projecting the result on the Dept column.
EmpOf(Sales) is executed by selecting on Sales in the Dept column (Dept=Sales), and projecting the result on the Emp column.
In general, we can define functions whose arguments are selections on some columns of a relation and whose results are projections on the remaining columns.
Can we make sense of the boundary cases? What if we
(1) Select on no columns and project on all?
(2) Select on all columns and project on none?
Do those correspond to functions in any sensible way?
We can think of selection as restricting the tuples of a relation. If, as in Case (1), we have no restriction, then the whole relation is retained after selecting on no columns. If we then project on all columns, we still retain the whole relation. This sequence of operations is a "no-op", leaving the original relation unchanged.
Case (1) corresponds to a function in the family which takes no arguments, since it imposes no selection criteria. We can specify such a function as
Assignments: () -> Emp x Dept.
Such a function serves as a generator. Its result is the whole relation itself. Thus, Assignments() returns the EmpDept relation.
What about Case (2)? What if we select on all columns and project on none? This corresponds to a function that has all the columns as its arguments, e.g.,
Assigned: Emp x Dept -> ().
What should Assigned(Sam,Sales) return?
First we do a selection with the restrictions Emp=Sam and Dept=Sales. The result is either an empty relation or a relation containing the one tuple <Sam,Sales>. Then, when we project this on no columns, we get a zero-width shadow relation containing either nothing or one empty tuple, i.e., 0 or 1.
Thus the result of Assigned(Sam,Sales) is either 0 or 1, depending on whether or not the EmpDept relation contains the tuple <Sam,Sales>, i.e., whether or not Sam is assigned to the Sales department.
This Assigned function is behaving like a predicate, returning 1 for True and 0 for False. For now, we might want to think that True corresponds to a zero-column relation containing an empty tuple (i.e., 1), while False corresponds to an empty zero-column relation (i.e., 0).
In general, the function in the family taking all the columns as its arguments behaves like a predicate, testing whether or not a given tuple is in the relation.
What we've said about predicates applies to ideal relations, containing no duplicates. What happens with relations as they are implemented in real relational systems, which might contain duplicates?
Suppose that EmpDept contains three copies of <Sam,Sales> for some reason. EmpDept might be a view, or an unnamed intermediate result of a projection in a relational expression, or it might actually be a stored relation with duplicates. What is the result of Assigned(Sam,Sales)?
After the selection step, we have a relation containing three copies of <Sam,Sales>. After projection on no columns, we have a shadow relation containing three empty tuples, i.e., the number 3.
The Assigned function isn't exactly returning a truth value; it is returning the number of times that the tuple occurs in the relation. It is only when the relation is ideal, i.e., contains no duplicates, that the result must be 0 or 1.
For this general case, we will call functions such as the Assigned function profiles. Their results are counters. When the underlying relation does not admit duplicates, the profile is a predicate.
A family of functions
fi: x1,...,xj -> y1,...,yk
can be defined over a relation such that the x's and y's correspond to distinct columns of the relation, i.e., j+k=n, the degree of the relation. There are 2**n such functions, corresponding to all the subsets of the n columns which can be taken as arguments (ignoring permutations among the x's or among the y's).
Other similar functions might be defined whose arguments and results do not partition the columns, but we are not interested in those now.
A function call fi(c1,...,cj) returns a relation produced by:
1. Selecting on x1=c1 and ... and xj=cj.
2. Projecting on y1,...,yk.
For the boundary case j=0 (k=n), the function takes no arguments. This function is the generator in the family, returning the entire relation as its result.
For the boundary case j=n (k=0), the arguments correspond to an entire tuple of the underlying relation. This function is the profile in the family, returning the number of times that a tuple occurs in the relation (possibly zero or INF). If the relation does not admit duplicates, then the profile is a predicate, returning 0 or 1 for all arguments.
A profile can be thought of as the characteristic function of a relation, indicating whether - or how many times - a tuple occurs in the relation.
The constraint that a relation may not have duplicates can be expressed in terms of its profile:
p(x1,...,xn) < 2.
Note that the results of all the functions in such a family are relations. The generator returns the whole relation. The profile returns a shadow relation. Even a function call such as
DeptOf(Sam)
returns a relation containing a tuple containing Sam's department as its only element.
We add a few more observations about function families...
Function families also relate to certain notions in entity-relationship models. The generator and profile in a family correspond to an undirected relationship, while the other functions in the family correspond to the various directed relationships that can be defined as going "from" some of the participants "to" the others.
This is particularly intuitive when n=2, as in our example. The Assignments generator and the Assigned profile both correspond to the undirected relationship between employees and departments. The DeptOf function corresponds to the directed relationship from employees to departments, and EmpOf corresponds to the inverse directed relationship.
All the functions in a family are represented in the same underlying relation. In mapping between a functional model and a relational model, any function in the family can be used to correspond to the relation. If a just one function needs to be so designated, then both the generator and the profile are reasonable candidates.
It might be useful to substitute one family member for another during query transformations. A query to find y such that y=f(x) is simply an invocation of f (select on x, project on y); a query to find x such that f(x)=y is an invocation of the family member g such that g(y)=x (select on y, project on x).
The relational operators are closed on relations, i.e., when applied to relations they yield relations. It might be interesting to develop a parallel arithmetic formulation, expressing the profiles of the results in terms of the profiles of the operands.
Assigned(x,y) is the profile for our EmpDept relation. Assigned(Sam,Sales) has the value 1 or 0 (assuming no duplicates), depending on whether or not the tuple <Sam,Sales> occurs in the relation.
Suppose we wanted to select on Emp=Sam, yielding a relation Q, whose profile is q(x,y).
According to our earlier notions of True and False, we could say that x=Sam is a numeric-valued expression, returning 1 if x has Sam as its value and 0 otherwise.
We could then express the profile q in terms of multiplication:
q(x,y) ::= Assigned(x,y) * x=Sam.
What this says is that the number of occurrences of <x,y> in Q will be the number of occurrences of <x,y> in EmpDept multiplied by the trueness of x=Sam. In effect, the relation Q will contain the tuple <x,y> if and only if that tuple occurs in the EmpDept relation and x=Sam. That's exactly what we want for the selection operation.
We also get correct results if there are duplicates. <Sam,Sales> will occur in Q as many times as it occurs in EmpDept.
We could use multiplication for any conjunction of selection criteria. If we wanted to produce Q by selecting on Emp=Sam and Dept=Sales, we could write
q(x,y) ::= Assigned(x,y) * x=Sam * y=Sales.
Selection criteria based on other comparisons can be handled in the same way, so long as we define the comparison operators to return 0 or 1 as their results. And we can easily accommodate comparisons among columns, such as x<y.
Given:
then the result of these selections is a relation Q whose profile q is given by
q(x1,...,xn) ::= p(x1,...,xn) * s1 * ... * sj.
If we wished to be even more general, we could define selection by any expression of the form
q(x1,...,xn) ::= p(x1,...,xn) * f(x1,...,xn),
so long as f always returns 0 or 1. This could turn into a mechanism for injecting duplicates if we allowed f to return values greater than 1.
We will do projection without eliminating duplicates.
Given a relation P characterized by the profile
p(x1,...,xn,y1,...,ym),
a projection on the columns corresponding to x1,...,xn yields a relation Q characterized by the profile
q(x1,...,xn) ::= SUM[y1,...,ym] p(x1,...,xn,y1,...,ym).
What does that mean? Suppose we projected the EmpDept relation on the Emp column, yielding a relation Q whose profile is q(x). The number of times that <Sam> occurs in the projection is given by q(Sam). How do we compute that? We take the summation of
Assigned(Sam,y)
over all possible values of y. Sam will occur in the resulting projection once for each department y such that <Sam,y> occurs in EmpDept.
This formulation is correct for projections of relations containing duplicates. If the tuple <Sam,d1> occurs k times in EmpDept, it will contribute k occurrences of Sam to the projection Q.
Let's check the formula for the boundary cases. When projecting on all columns, we have
q(x1,...,xn) ::= SUM[] p(x1,...,xn).
There is nothing to sum over, and we really have
q(x1,...,xn) ::= p(x1,...,xn),
which is what we wanted: the result of projecting a relation on all of its columns is just the original relation itself.
When projecting on no columns, we have
q() ::= SUM[y1,...,ym] p(y1,...,ym).
That is, we sum the number of occurrences of each tuple in the original relation, yielding - as we had hoped - the cardinality of the original relation.
If P is the underlying relation characterized by the profile p, we can define Card(P) as the projection on no columns:
Card(P) ::= SUM[y1,...,ym] p(y1,...,ym).
We introduce a unary arithmetic operator delta ("distinct"), which returns 0 if its argument is 0 and 1 otherwise.
If p(...) characterizes an arbitrary relation, then the corresponding relation with duplicates removed is characterized by the profile delta p(...). If the original relation contains no duplicates, then delta p(...)=p(...).
Projection with duplicates removed is defined by
q(x1,...,xn) ::= delta SUM[y1,...,ym] p(x1,...,xn,y1,...,ym).
Cross-products are trivially defined by multiplication:
r(x1,...,xn,y1,...,ym) ::= p(x1,...,xn) * q(y1,...,ym).
This is correct for p and q containing duplicates as well.
Generalized joins can be expressed by combining our definitions for cross-product, select, and project. That can get tedious.
Equi-joins are easily characterized by
r(x,y,z) ::= p(x,y) * q(x,z).
(x, y, and z can be sets of variables, e.g., for multi-column joins.)
Try some examples yourself, if you have any doubts.
Concatenation (union without duplicate elimination) is easily characterized in terms of addition:
r(x,y) ::= p(x,y) + q(x,y).
True union is given by
r(x,y) ::= delta (p(x,y) + q(x,y)).
Let's experiment with something more radical. Let's see what happens if we say that a relation P and its profile p are in fact the same thing.
We said that the result of any function in the family is a relation. If the result of f(c) is in fact a profile, what could that mean? It means that f(c) could be applied to another object as its argument as f(c)(d), yielding a counter value. What could that mean?
Let's say that the result of f(c) is r. If it is a profile, then r(d) returns a counter indicating how many times d occurs in the relation characterized by r. In effect, r(d) indicates how many times d occurs in the result of f(c); we can write that as f(c)(d).
We should be able to show that the number f(c)(d) is the same as the number of occurrences of <c,d> in the underlying relation P, which should also be the value of p(c,d). We won't do that exercise right now, and might even leave it for the reader, but we'll just take note of the result:
f(c)(d) = p(c,d).
We will in fact use that notation to indicate how the function f is defined to be in the family characterized by the profile p.
The constraint that a function have no duplicates in its results can be expressed in the form
f(x)(y) < 2,
i.e., the number of occurrences of any value of y in the result of f(x) must be less than 2. This turns out to be equivalent to not allowing duplicates in the underlying relation:
p(x,y) < 2.
Card(P), the cardinality of the underlying relation, can now be written as Card(p). That is, profiles have cardinalities, equivalent to the cardinalities of their underlying relations. Since f(c) yields a relation/profile, it is meaningful to write Card(f(c)) to express the cardinality of the result of f(c). It follows that the constraint that a function f be single-valued can be expressed as
Card(f(x)) < 2.
We summarize the two constraints we have developed for the function f:x->y :
No duplicates: f(x)(y) < 2.
Single-valued: Card(f(x)) < 2.
With single-valuedness constraints, we can explore functional dependencies. Given a relation characterized by p(x,y), and a function defined by
f(x)(y) = p(x,y),
it turns out that the functional dependency x->y is equivalent to f being single-valued. That is,
Card(f(x)) < 2
expresses the functional dependency x->y. This should suggest some means of reasoning over functional dependencies via the laws of arithmetic.
We're not sure where else this experiment might lead. We are open to opportunities.
Almost as an afterthought: we do need to account for generators. If g is the generator function for the relation P, we had said that g()=P. Now it turns out that g()=p, i.e., the result of g() is the profile function itself. Curious.
Another afterthought: it turns out that, for consistency, the counters themselves need to be profiles. Applied to no arguments, they return themselves as results. A mathematician's delight.
Having an arithmetic basis for relational operators has some interesting consequences.
The numeric values of profiles provides a formal means of accounting for duplicates. This extended relational model, admitting duplicates, is still theoretically sound, resting on the soundness of arithmetic itself. The traditional relational model is subsumed, since all the appropriate behaviors are preserved when there are no duplicates.
Results of complex relational expressions might be analyzable in terms of arithmetic calculations.
Valid query transforms may be expressible in terms of arithmetic laws, e.g., commutativity, associativity, distribution, etc. New transforms can be sought out in terms of equivalence of arithmetic expressions.
Predicates behave similarly to relational operators, being functions that return relations. They are strange, on the other hand, since their results have zero width, and are not obtained directly from column values.
There are no clues to implementation in all of this. The theory may be elegant, but how do you physically represent empty tuples and zero-width relations? That's an exercise for the reader.