William Kent, "Shadow Relations and Function Families", HPL-SAL-89-18, Hewlett-Packard Laboratories, Jan. 10, 1989. [10 pp]

William Kent

January 1989

> 1 INTRODUCTION . . . 2

> 2 COUNTER OBJECTS . . . 2

> 3 PROJECTIONS ON NO COLUMNS . . . 2

> 4 FAMILIES OF FUNCTIONS OVER RELATIONS . . . 4

>> 4.1 Generators . . . 5

>> 4.2 Predicates . . . 5

>> 4.3 Duplicates and Profiles . . . 6

>> 4.4 Function Families: Summary . . . 7

> 5 THE ARITHMETIC OF RELATIONAL OPERATORS . . . 8

>> 5.1 Select . . . 8

>> 5.2 Project . . . 9

>> 5.3 Eliminating Duplicates . . . 11

>> 5.4 Join . . . 11

>> 5.5 Concatenation and Union . . . 12

> 6 AN EXPERIMENT . . . 12

> 7 CONCLUSIONS . . . 14

> 8 REFERENCES . . . 15

By introducing counters, which are simultaneously integers and relations, we can develop a foundation for the relational model based on arithmetic instead of set theory. Duplicates can be accounted for, and logical predicates can be integrated with the relational operators. It may also be possible to develop arithmetic approaches to query processing and dependency theory.

Together with function families, also introduced here, counters suggest a basis for incorporating the relational model into a function/object model as used in the Iris object-oriented database system [Fi].

A *counter* is a bag (multiset) of empty tuples. It is like a relation having no
columns, yet containing a definite number of rows, each containing the empty tuple -
hence, there are duplicates. We can say that its width is zero.

Finite counters are isomorphic to the non-negative integers, i.e., the counting
numbers. A counter corresponds to the number expressing the quantity of empty tuples
contained in the counter. In fact, until we encounter some consequent anomaly, we will say
that the finite counters *are* the non-negative integers:

0 = [ ]

1 = [<>]

2 = [<>, <>]

3 = [<>, <>, <>]

...

We introduce an infinity object INF, with the property of being larger than any number. An infinite counter, being an infinite bag of empty tuples, corresponds to INF. (If this ever gives rise to insurmountable difficulties, e.g., with arithmetic on counters, we can comfortably limit ourselves to finite counters.)

Thus a counter is simultaneously all of following:

- A bag of empty tuples.
- A relation (with duplicates) having a definite number of rows, but no columns.
- A non-negative integer or INF.

When we project a relation on n of its columns (without removing duplicates), we get a relation containing the same number of tuples as the original, with the resulting tuples containing elements from only the projection columns. This is a familiar concept for n>0, and we can now extend it naturally to n=0.

Following the same definition, such a projection on no columns yields a relation containing the same number of tuples as the original, with the resulting tuples containing no elements. Thus the result of projecting on no columns is a relation of empty tuples, i.e., a counter.

Relations of empty tuples suggest an image of shadow relations. Projecting a relation on no columns leaves none of the substance of the original relation, yet there is a definite outline of something having been there: an empty tuple in the result marking the place of each tuple in the original. Like shadows, these counters are lacking a dimension.

On the other hand, if we think of counters as being integers (or INF), then such a projection on no columns actually yields the cardinality of the original relation: projecting a relation on no columns returns the number of rows in the relation. This is a pleasant and useful surprise. Serendipity at play.

For example, we start with the following Assignments relation as a base

Emp Dept Date ---------------------- | <Sam, Sales, 1984> | ---------------------- | <Sam, Purch, 1987> | ---------------------- | <Ann, Acctg, 1986> | ----------------------

A projection on the first two columns yields

Emp Dept ---------------- | <Sam, Sales> | ---------------- | <Sam, Purch> | ---------------- | <Ann, Acctg> | ----------------

A projection on the first column (without removing duplicates) yields

Emp --------- | <Sam> | --------- | <Sam> | --------- | <Ann> | ---------

By extension, projection on zero columns yields

- | - | - | -

This last shows our diagramming convention for a shadow relation. In this case it is equal to 3, which is the cardinality of the base relation.

There is a natural family of functions that can be defined over a relation, which are executed by selecting on the columns corresponding to their arguments and projecting on the columns corresponding to their results. (We mean "select" in the sense of the relational "restrict" operation, not the SQL query operator.) Consider the following functions corresponding to our Assignments relation:

WorkHist: Emp -> Dept x Date

StaffHist: Dept -> Emp x Date

Events: Date -> Emp x Dept

AsgmtDate: Emp x Dept -> Date

WhereAsgd: Emp x Date -> Dept

StaffOnDate: Dept x Date -> Emp

WorkHist(Sam) is effectively obtained by

1. Selecting on the argument roles: Emp = Sam

---------------------- | <Sam, Sales, 1984> | ---------------------- | <Sam, Purch, 1987> | ----------------------

2. Projecting on the result roles: Dept, Date

----------------- | <Sales, 1984> | ----------------- | <Purch, 1987> | -----------------

This result is Sam's employment history. The results of other functions in the family can be obtained by the same select/project paradigm.

Two likely candidates are notably absent in our example family:

Assignments: () -> Emp x Dept x Date

WasAssigned: Emp x Dept x Date -> ()

The first of these takes no arguments; it would be evaluated by selection on no columns, projection on all. The second, where all columns correspond to arguments, must involve projection on no columns.

What sense can we make of these boundary cases?

We can think of selection as restricting the tuples of a relation. If, as in the first boundary case, we have no restriction, then the whole relation is retained after selecting on no columns. If we then project on all columns, we still retain the whole relation. This sequence of operations is a "no-op", leaving the original relation unchanged.

Thus the function in the family which takes no arguments serves as a *generator*
for the family. Its result is the whole relation itself. Our Assignments function is aptly
named, since its value is the original Assignments relation itself.

What about the second case? What if we select on all columns and project on none? What should be returned by WasAssigned(Sam,Purch,1987)?

First we do a selection with the restrictions Emp=Sam, Dept=Purch, and Date=1987, yielding the intermediate result

---------------------- | <Sam, Purch, 1987> | ----------------------

In our second step, projecting this on no columns, we get a shadow relation with one empty tuple.

If we instead evaluate WasAssigned(Ann,Purch,1987), we get an empty relation as the intermediate result, and the final result is also the empty relation. Thus

- WasAssigned(Sam, Purch, 1987) = | = [< >] = 1 - WasAssigned(Ann, Purch, 1987) = - = [ ] = 0

The WasAssigned function is behaving like a predicate, returning 1 for True and 0 for False. For now, we might want to think that True corresponds to a zero-column relation containing an empty tuple (i.e., 1), while False corresponds to an empty zero-column relation (i.e., 0).

In general, the function in the family taking all the columns as its arguments behaves like a predicate, testing whether or not a given tuple is in the relation.

What we've said about predicates applies to ideal relations, containing no duplicates. What happens with relations as they are implemented in real relational systems, which might contain duplicates?

Suppose that our relation contained two copies of <Sam,Purch,1987> for some reason:

---------------------- | <Sam, Sales, 1984> | ---------------------- | <Sam, Purch, 1987> | ---------------------- | <Sam, Purch, 1987> | ---------------------- | <Ann, Acctg, 1986> | ----------------------

The relation might be a view, or an unnamed intermediate result of a projection in a relational expression, or it might actually be a stored relation with duplicates.

Now what is the result of WasAssigned(Sam,Purch,1987)?

The intermediate result after the selection, without removing duplicates:

---------------------- | <Sam, Purch, 1987> | ---------------------- | <Sam, Purch, 1987> | ----------------------

And the final result:

- | WasAssigned(Sam, Purch, 1987) = - = [< >,< >] = 2 | -

The WasAssigned function isn't exactly returning a truth value; it is returning the number of times that the tuple occurs in the relation. It is only when the relation is ideal, i.e., contains no duplicates, that the result must be 0 or 1.

For this general case, we will call functions such as the WasAssigned function *profiles*.
Their results are counters. When the underlying relation does not admit duplicates, the
profile is a predicate.

A family of functions

fi: x1,...,xj -> y1,...,yk

can be defined over a relation of degree n=j+k such that the x's and y's correspond to distinct columns. There are 2**n such functions, corresponding to all the subsets of the n columns which can be taken as arguments (ignoring permutations among the x's or among the y's).

Other similar functions might be defined whose arguments and results do not partition the columns, but we are not interested in those now.

A function call fi(c1,...,cj) returns a relation produced by:

1. Selecting on x1=c1 and ... and xj=cj.

2. Projecting on y1,...,yk.

(To be rigorous, we should always write these function calls as fi(<c1,...,cj>), but we sometimes elide the tuple brackets for readability.)

For the boundary case j=0 (k=n), the function takes no arguments. This function is the generator in the family, returning the entire relation as its result.

For the boundary case j=n (k=0), the arguments correspond to an entire tuple of the underlying relation. This function is the profile in the family, returning the number of times that a tuple occurs in the relation (possibly zero or infinity ). If the relation does not admit duplicates, then the profile is a predicate, returning 0 or 1 for all arguments.

A profile can be thought of as the *characteristic* function of a relation,
indicating whether - or how many times - a tuple occurs in the relation.

The constraint that a relation may not have duplicates can be expressed in terms of its profile:

p(x1,...,xn) < 2.

Note that the results of all the functions in such a family are relations. The generator returns the whole relation. The profile returns a shadow relation. Even a function call such as

AsgmtDate(<Sam,Purch>)

returns [<1987>], a relation containing a tuple containing the date as its only element; it doesn't simply return 1987.

We add a few more observations about function families...

Function families also relate to certain notions in entity-relationship models. The generator and profile in a family correspond to an undirected relationship, while the other functions in the family correspond to the various directed relationships that can be defined as going "from" some of the participants "to" the others.

When n=2, the two directed functions in the family are simple inverses of each other.

All the functions in a family are represented in the same underlying relation. In mapping between a functional model and a relational model, any function in the family can be used to correspond to the relation. If just one function needs to be so designated, then either the generator or the profile is a reasonable candidate.

It might be useful to substitute one family member for another during query transformations. A query to find y such that y=f(x) is simply an invocation of f (select on x, project on y); a query to find x such that f(x)=y is an invocation of the family member g such that g(y)=x (select on y, project on x).

The relational operators are closed on relations, i.e., when applied to relations they yield relations. Since every relation has a profile as its characteristic function, we can develop a parallel arithmetic formulation of the relational operators, and we can do so in a way that accounts for duplicates.

WasAssigned(x,y,z) is the characteristic profile for our Assignments relation.

Suppose we wanted to select on Emp=Sam, yielding a relation Q, whose profile is q(x,y,z). How can we relate q(x,y,z) to WasAssigned(x,y,z)?

According to our earlier notions of True and False, we could say that x=Sam is a numeric-valued expression, returning 1 if x has Sam as its value and 0 otherwise. We could then express the profile q in terms of multiplication:

q(x,y,z) ::= WasAssigned(x,y,z) * x=Sam.

What this says is that the number of occurrences of a tuple in Q will be the number of occurrences of that tuple in Assignments multiplied by the trueness of x=Sam for that tuple. In effect, the relation Q will contain the tuple <x,y,z> if and only if that tuple occurs in the Assignments relation and x=Sam. If it does occur in Q, it will occur the same number of times as in Assignments. That's exactly what we want for the selection operation.

We could use multiplication for any conjunction of selection criteria. If we wanted to produce Q by selecting on Emp=Sam and Dept=Sales, we could write

q(x,y,z) ::= WasAssigned(x,y,z) * x=Sam * y=Sales.

Selection criteria based on other comparisons can be handled in the same way, so long as we define the comparison operators to return 0 or 1 as their results. And we can easily accommodate comparisons among columns, such as x<y.

Most generally, selection can be defined in terms of an arbitrary selection expression S, such that the selection

R2 = SEL[S]R1

is defined by

p2(t) = p1(t) * S(t),

where pi is the profile of Ri, so long as S always returns 0 or 1 when applied to a tuple t. This could turn into a mechanism for injecting duplicates if we allowed S to return values greater than 1 [K1,K2].

We will do projection without eliminating duplicates.

Suppose relation R1 is defined on the columns x1,...,xn,y1,...,ym, and the relation R2 is given by the projection

R2 = PROJ[x1,...,xn]R1

Then the profile p2 of R2 is defined in terms of the profile p1 of R1 as

p2(x1,...,xn) ::= SUM[y1,...,ym] p1(x1,...,xn,y1,...,ym).

What does that all mean? It means that the number of times a tuple occurs in result is the sum of the number of times it occurred as part of any larger tuple in the original relation. If we wanted to project our Assignments relation on the first column, the result is a relation R2 having p2 as its profile:

R1: x1 y1 y2 R2: x1 ---------------------- --------- | <Sam, Sales, 1984> | | <Sam> | ---------------------- --------- | <Sam, Purch, 1987> | | <Sam> | ---------------------- => -------- | <Sam, Purch, 1987> | | <Sam> | ---------------------- --------- | <Ann, Acctg, 1986> | | <Ann> | ---------------------- ---------

If we want to know how many times <Sam> occurs in the result, i.e., the value of p2(<Sam>), we have to sum the number of occurrences of Sam in R1 with all possible values of departments and dates:

p1(<Sam,Sales,1984>) = 1 p1(<Sam,Purch,1987>) = 2 p1(<Sam,Sales,1983>) = 0 p1(<Sam,Sales,1982>) = 0 ... ------------------------------- p2(<Sam>) = 3

Let's check the formula for the boundary cases. When projecting on all columns, we have

p2(x1,...,xn) ::= SUM[]p1(x1,...,xn).

There is nothing to sum over, and we really have

p2(x1,...,xn) ::= p1(x1,...,xn),

which is what we wanted: the result of projecting a relation on all of its columns is just the original relation itself.

When projecting on no columns, we have

p2() ::= SUM[y1,...,ym] p1(y1,...,ym).

That is, we sum the number of occurrences of each tuple in the original relation, yielding - as we had hoped - the cardinality of the original relation. The value of p2() is a number corresponding to the number of tuples in the original relation, which is a shadow relation containing that many empty tuples, which is the projection on no columns. It's all consistent.

We introduce a unary arithmetic operator *delta *("distinct"), which
returns 0 if its argument is 0 and 1 otherwise.

If p(...) characterizes an arbitrary relation, then the corresponding relation with
duplicates removed is characterized by the profile *delta *p(...), which is a
predicate. If the original relation contains no duplicates, then *delta *p(...)=p(...).

A select operation that removes duplicates would be defined by

p2(t) =

delta(p1(t) * S(t)).

Projection with duplicates removed is defined by

p2(x1,...,xn) ::=

delta(SUM[y1,...,ym] p1(x1,...,xn,y1,...,ym)).

Cross-products are trivially defined by multiplication:

p3(x1,...,xn,y1,...,ym) ::= p1(x1,...,xn) * p2(y1,...,ym).

It simply means that the number of occurrences of <x1,...,xn,y1,...,ym> in the cross product R3 is the product of the occurrences of <x1,...,xn> in R1 and the occurrences of <y1,...,ym> in R2. This is correct for R1 and R2 containing duplicates as well.

Generalized joins can be expressed by combining our definitions for cross-product, select, and project. That can get tedious.

Equi-joins are easily characterized by

p3(x,y,z) ::= p1(x,y) * p2(x,z).

(x, y, and z can be sets of variables, e.g., for multi-column joins.)

Concatenation (union without duplicate elimination) is easily characterized in terms of addition:

p3(x,y) ::= p1(x,y) + p2(x,y).

True union is given by

p3(x,y) ::=

delta(p1(x,y) + p2(x,y)).

What if... we did something more radical. Let's see what happens if we say that a relation R and its profile p are in fact the same thing.

We said that the result of any function in a family is a relation. If the result of f(c) is in fact a profile, what could that mean? It means that f(c) could be applied to another object as its argument as f(c)(d), yielding a counter value. What could that mean?

Let's say that the result of f(c) is r. If this relation r is a profile, then r(d) returns a counter indicating how many times d occurs in the relation characterized by r. In effect, r(d) indicates how many times d occurs in the result of f(c); we can write that as f(c)(d).

We should be able to show that the number f(c)(d) is the same as the number of occurrences of <c,d> in the underlying relation R, which should also be the value of p(c,d). We won't do that exercise right now, and might even leave it for the reader, but we'll just take note of the result:

f(c)(d) = p(c,d).

Let's look at our example. If we evaluate Events(<1987>) we get a resulting relation that we will call r:

---------------- | <Sam, Purch> | Events(1987) = ---------------- = r | <Sam, Purch> | ----------------

The number of times <Sam,Purch> occurs in this relation is given by r(<Sam,Purch>), i.e.,

Events(1987)(<Sam,Purch>).

We see that this number is 2, which is the number of times that <Sam,Purch,1987> occurred in the Assignments relation (we are using the second version, with duplicates). Thus we can write

Events(<1987>)(<Sam,Purch>) = WasAssigned(<Sam,Purch,1987>) = 2.

We could use such a notation for specifying functions in a family:

f1(x)(y,z) = f7(x,y,z)

f2(y)(x,z) = f7(x,y,z)

f4(z)(x,y) = f7(x,y,z)

f3(x,y)(z) = f7(x,y,z)

f5(x,z)(y) = f7(x,y,z)

f6(y,z)(x) = f7(x,y,z)and even

f0()(x,y,z) = f7(x,y,z).

A sidelight: f0() = f7.

The constraint that a function have no duplicates in its results can be expressed in the form

f(x)(y) < 2,

i.e., the number of occurrences of any value of y in the result of f(x) must be less than 2. This turns out to be equivalent to not allowing duplicates in the underlying relation:

p(x,y) < 2.

Card(R), the cardinality of the underlying relation, can now be written as Card(p). That is, profiles have cardinalities, equivalent to the cardinalities of their underlying relations. Since f(c) yields a relation/profile, it is meaningful to write Card(f(c)) to express the cardinality of the result of f(c). It follows that the constraint that a function f be single-valued can be expressed as

Card(f(x)) < 2.

We summarize the two constraints we have developed for the function f:x->y :

No duplicates: f(x)(y) < 2.

Single-valued: Card(f(x)) < 2.With single-valuedness constraints, we can explore functional dependencies. Given a relation characterized by p(x,y), and a function defined by

f(x)(y) = p(x,y),

it turns out that the functional dependency x->y is equivalent to f being single-valued. That is,

Card(f(x)) < 2

expresses the functional dependency x->y. This should suggest some means of reasoning over functional dependencies via the laws of arithmetic.

We're not sure where else this experiment might lead. We are open to opportunities.

Almost as an afterthought: we do need to account for generators. If g is the generator function for the relation R, we had said that g()=R. Now it turns out that g()=p, i.e., the result of g() is the profile function itself. Curious.

Another afterthought: it turns out that, for consistency, the counters themselves need to be profiles. Applied to no arguments, they return themselves as results. A mathematician's delight.

## 7 CONCLUSIONS

Counters, together with function families, provide a basis for incorporating the relational model into a function/object model as used in Iris [Fi], as suggested in [K1,K2].

Having an arithmetic basis for relational operators has some interesting consequences.

The numeric values of profiles provides a formal means of accounting for duplicates. This extended relational model, admitting duplicates, is still theoretically sound, resting on the soundness of arithmetic itself. The traditional relational model is subsumed, since all the appropriate behaviors are preserved when there are no duplicates.

Results of complex relational expressions might be analyzable in terms of arithmetic calculations.

Valid query transforms may be expressible in terms of arithmetic laws, e.g., commutativity, associativity, distribution, etc. New transforms can be sought out in terms of equivalence of arithmetic expressions.

Predicates behave similarly to relational operators, being functions that return relations. They are strange, on the other hand, since their results have zero width, and are not obtained directly from column values.

There are no clues to implementation in all of this. The theory may be elegant, but how do you physically represent empty tuples and zero-width relations? That's left as an exercise for the reader.

## 8 REFERENCES

[Fi] Fishman, D.H. et al, "Iris: An Object-Oriented Database Management System", ACM Transactions on Office Information Systems, Volume 5 Number 1, January 1987.

[K1] Kent, W., "A Generalized Select Function", HPL-SAL-89-17, Hewlett-Packard Laboratories, Jan 3 1989.

[K2] Kent, W., "Profile Functions and Bag Theory", HPL-SAL-89-19, Hewlett-Packard Laboratories, Jan 10 1989. [pdf]