William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California
March 1993
> 1 INTRODUCTION . . . 2
> 2 MODEL-INDEPENDENT SEMANTICS . . . 3
>> 2.1 A Descriptive Model . . . 3
>> 2.2 Query Overview . . . 4
>> 2.3 The Search Domain . . . 5
>> 2.4 Generalized Selection . . . 6
>> 2.5 Result Generation . . . 7
>> 2.6 Summary . . . 8
>> 2.7 Secondary Characteristics (Post-Processing) .
. . 9
> 3 APPLICATIONS . . . 10
>> 3.1 Group-By as a Composition of Queries . . . 10
>> 3.2 Role of MIQP in Language/Model Specifications
and Standards . . . 15
>> 3.3 Analysis of Query Behaviors . . . 17
>> 3.4 Defining Relational Operators . . . 20
>>> 3.4.1 Project . . . 20
>>> 3.4.2 Select (Restrict) . . . 20
>>> 3.4.3 Joins . . . 20
>> 3.5 Query Transformations . . . 21
>> 3.6 Query Testing and Validation . . . 23
> 4 CONCLUSIONS . . . 23
> 5 ACKNOWLEDGMENTS . . . 24
> 6 REFERENCES . . . 24
> 7 ADDENDA . . . 24
> 8 TYPES AND SIGNATURES . . . 24
> 9 ORDERING . . . 26
> 10 MODEL-DEPENDENT REFINEMENTS AND ELABORATIONS . . .
27
>> 10.1 Aggregates . . . 27
>>> 10.1.1 Search Domains . . . 27
>>> 10.1.2 Extensions of Types . . . 27
>>> 10.1.3 Indexed Things . . . 28
>>> 10.1.4 Regular Collections . . . 28
>>> 10.1.5 Cross Products . . . 29
>>> 10.1.6 Type Constructors . . . 30
>>> 10.1.7 Usage in Grouping . . . 30
>>> 10.1.8 Usage in Result Generation . . . 31
>> 10.2 Expressions . . . 32
>>> 10.2.1 From Clause . . . 32
>>> 10.2.2 Where Clause . . . 32
>>> 10.2.3 Complex Grouping . . . 32
>>> 10.2.4 Results . . . 33
>>> 10.2.5 Logical Expressions (Predicates) .
. . 33
> 11 ORTHOGONALITY PRINCIPLES FOR LANGUAGE DESIGN,
DESCRIPTION, AND STANDARDS . . . 33
>> 11.1 Language Components . . . 33
>> 11.2 A Framework For Specification and
Standardization of Languages and Models . . . 34
>> 11.3 Why This Framework? . . . 35
> 12 APPENDIX: CASE STUDIES . . . 36
>> 12.1 SQL FROM Clause . . . 36
>> 12.2 Object-Oriented Query . . . 36
>> 12.3 Case Studies From Rafi . . . 37
>> 12.4 Case Studies From Ganski & Wong . . .
40
> 13 ADDED 3/12/93 . . . 42
>> 13.1 Negation . . . 42
>> 13.2 Conditionals . . . 42
>> 13.3 Quantification . . . 42
>> 13.4 Function Composition . . . 42
>> 13.5 Recursion . . . 42
"Query" means various things in the database literature. It is sometimes used in the sense of a query language (as in [1], for example), possibly referring to all data retrieval operations, or to all data manipulation operations, or to operations on sets or other aggregates in general. The term is sometimes used in the narrower sense of a query operation which selects things from a collection based on some criterion, often in a "select...from...where" syntactic form. We will use the term in this last sense: a query is a specific operation which generates a collection of things from some other collection. By "query semantics" we mean some formal specification of the results produced by a query operation.
Historically, relational database queries were based on a constructive paradigm. Relations are constructed from other relations via the relational algebra of join, select, and project. The SQL select-from-where query format is a syntactic sugaring firmly rooted in these constructive semantics. Thus the semantics of relational query operations are inextricably tied up with the relational data model, setting an expectation that each new form of data model must give rise to its own form of query operation. In particular, we are currently witnessing various attempts to define "object-oriented" query.
However, the semantics of queries, aggregates, and expressions can be treated as orthogonal language/model components, rendering the query operation itself largely independent of data model. The semantics of a query operator can be defined arithmetically, rather than in terms of a specific data model.
What queries essentially do is select things from collections on the basis of their properties. A query operator can be described as a form involving certain expressions, with virtually no restriction on the expressions except that some return integer values and some return some form of collection. Since expression, collection, and integer are rather universal concepts, such a query paradigm is largely model-independent. Even grouping (Group-by) can be defined in this general style. This foundation can then be refined to exploit additional functionalities provided in various specific models, object-oriented or otherwise.
In this approach, the semantics of queries can be specified and standardized independently of specific data model and computational features, which can be defined orthogonally. The full behavior of queries in any language or model is then described as a composition of query operation semantics with particular data model and computational features supported in the language.
The semantics of this query paradigm are purely external, without relying on any underlying query algebra or computational model other than arithmetic (as contrasted with, for example, [1] and [5]). Nothing is said about optimization (other than to suggest that query transformations might be discovered or validated by arithmetic analysis
[Section 3.4]). The paradigm defines a complete query result, i.e., it has no pipelined or lazy evaluation model. It assumes no side effects to the underlying collection during evaluation of the query operation.A Model-Independent Query Paradigm (MIQP) is proposed in Section 2. Section 3 then discusses various applications of the paradigm, and Section 4 then concludes with a summary and some recommendations.
A collection is anything which can be said to contain one or more occurrences of things. Thus a collection c is characterized by an Occurs function such that for any thing x (we avoid the term "object" for now), Occurs(x,c) returns a counter (non-negative integer) which is the number of occurrences of x in c. A collection does not have to contain duplicates. Sets are collections characterized by the constraint Occurs(x,c) =<1.
Which things occur in c? Those for which Occurs(x,c)>0. How many things are in c? The sum of the number of times each thing occurs in c. Thus the cardinality of a collection c is given by Card(c)=SUM[x]Occurs(x,c). How do we compute such a summation? The exact answer depends on the specific collection. In general, the variable x must at least range over every thing which occurs in c. It is sufficient for the summation domain to be a set X such that Occurs(x,c)>0 => x IN X, i.e., everything which occurs in c must occur in X. The cardinality of c is then the sum of the number of occurrences of each of these things in c. When the collection c is constrained to contain only instances of a certain type T, it suffices to sum over the extension of T.
A principal result of this paper is to develop arithmetic formulas defining the behavior of Occurs(x,c) for the collections produced by queries.
By expression we mean anything which can be evaluated with arguments to return results, subsuming such notions as function, procedure, operation, message, property, predicate, attribute (as in SQL dot notation), variable, etc. Thus any of these constructs can occur where an expression is called for.
We use the term filter to mean an expression which returns a counter (non-negative integer) for any possible argument. Logical predicates are a special case of filters. They can be treated arithmetically by using 1 and 0 (Boolean numbers) for True and False; corresponding computations can be defined to yield logically correct results [2][7].
The Occurs function, filters, and Boolean numbers are elements of a descriptive model used in this paper to carefully define terms and concepts you and I will use in describing other (target) models and languages. It is not assumed that these features are implemented in any real target model or language, as desirable as that might be. If they were supported, then we could be describing query in terms of other language features, rather than via a mechanism external to the language.
In the descriptive model, a collection c also serves as its own "profile function" [7], so that the notation c(x) means Occurs(x,c), giving the number of occurrences of x in c. This notation characterizes the membership of the collection c. Thus, this paper seeks to define an arithmetic formula for c(x) to characterize the collection produced by a query. Such a profile formula can be used in several ways:
With this notation we can define a set by c(x)=<1, and define the cardinality of a collection as Card(c)=SUM[x]c(x). Other elements of the descriptive model will be introduced as needed.
In the descriptive model, a query operation Q is a form containing three expressions d, w(x), and r(x), which will be described in subsequent sections. A query might be written in a syntax such as
SELECT r(x) FROM d x WHERE w(x).
A query in a target language should be rewritten in this form for subsequent analysis. The sequence of components, and the capitalized keywords, are illustrative only. Briefly,
The definition of the query operator imposes minimal requirements on the expressions d, w(x), and r(x):
It is up to the orthogonal, model-dependent portion of the language specifications to define what constructs are allowed in these expressions and how the expressions evaluate to the appropriate value.
The expressions may contain variables other than x. If we let v be the list of such variables, then the query can be denoted as Q(v), an expression which returns a collection when the variables v are bound. (Queries which don't return collections are described in
Section 2.7.) An unparameterized query can be denoted as Q(). For example, the querySELECT Name(x) FROM Employees x WHERE Salary(x)>v1 AND Age(x)>v2
is a parameterized query of the form Q(v1,v2) which can be evaluated once v1 and v2 are bound to specific values. In contrast, the query
SELECT Name(x) FROM Employees x WHERE Salary(x)>50000 AND Age(x)>30
is an unparameterized query of the form Q(), which can be evaluated without requiring further binding.
The result of a query can be characterized as follows:
The FROM d x clause provides the expression d defining the search domain cd. As mentioned, the definition of the query operation says nothing about the search domain other than that the expression d must yield a collection. In effect, the query will search over the elements of the search domain. The FROM clause also names the variable x which will be bound to things occurring in that collection.
For the query fragment
SELECT FROM Departments x,
the profile function cd(x) for the search domain can be written as Department(x), i.e., as a predicate which returns 1 if x is a department and 0 otherwise.
The notion of selection is generalized to allow injection of duplicates. In the query
SELECT FROM d x WHERE w(x),
the filter w must return a counter value for any x in cd; if w can't do anything else, then 0 would seem to be the natural default. The result of this query is a collection cw such that
cw(x) = cd(x) * w(x).
The number of occurrences of some x in cw is the number of its occurrences in cd multiplied by w(x). If x does not occur in cd then cd(x)=0 and x will not occur in cw. If w(x)=0, x will not occur in cw. This general form allows the query result to contain duplicates either because the search domain cd contained duplicates or because the filter w defines multiple occurrences, or both. Simple selection from a set occurs when the search domain cd is a set (cd(x)=<1) and the filter is a predicate, returning 1 or 0.
x is a bound variable with respect to the query as a whole. During the course of executing the query, the variable is bound once to each distinct thing occurring in cd, for which w(x) is evaluated. We could also say that x gets bound to everything in some universe, with cd(x)=0 for things not occurring in cd. In general, the collections cd and cw might be infinite.
For example (in a target language which supports the appropriate notions of types and functions), if Openings(x) gives the number of open positions in a department x, then the collection returned by
SELECT FROM Departments x WHERE Openings(x)
would contain n occurrences of a department which has n openings. This could serve as a roster of open positions containing one entry for each opening, identified by department, which could then be used in subsequent operations to fill data into the openings. This example makes sense when all the jobs in a given department are the same.
The profile (arithmetic specification) of the above sample query is
cw(x) = Department(x) * Openings(x).
A thing x will occur n times in cw if x is a department and x has n openings.
A more conventional query might arise if Hiring(x) returned 1 (True) if department x has any openings, and 0 (False) otherwise. The query
SELECT FROM Departments x WHERE Hiring(x)
returns the set of departments which have any openings. The profile of this query is
cw(x) = Department(x) * Hiring(x).
The filter w (the WHERE clause) is optional, defaulting to a constant 1, in which case cw=cd.
The results of a query are often not the things actually selected from the search domain cd. If cd contains tuples, the query result might contain a projection or restructuring of the tuples. Or the query result might contain properties of the things selected from cd.
The SELECT clause can specify a result-generating expression r, extending the query form to
SELECT r(x) FROM d x WHERE w(x).
This yields another collection cr. The collection cr contains an occurrence of r(x) for each occurrence of x in cw. This in itself could introduce duplicates in cr, when r(x1)=r(x2) for distinct x1, x2 in cw. For example, the query
SELECT Location(x) FROM Departments x WHERE Hiring(x)
will return one occurrence of a given location for each department at that location which is hiring. Each such department contributes one occurrence of its location.
In general, if z=r(x) for some x in cw, the resulting collection cr will contain an occurrence of z for each occurrence of every x in cw such that z=r(x). The number of occurrences of z in cr is therefore obtained by summing over all x's occurring in cw, accumulating those for which z=r(x). The contents of cr can thus be characterized by
cr(z) = SUM[x](cw(x) * z=r(x))
= SUM[x](cd(x) * w(x) * z=r(x)).
The value of the expression z=r(x) is 0 or 1. For a given x occurring in cw, if z¬=r(x), then nothing is contributed to the result. Otherwise, cw(x) occurrences of z are contributed to the result.
The profile of the above sample query is
cr(z) = SUM[x](Department(x) * Hiring(x) * z=Location(x)).
This expression defines the number of times a particular z will occur in the query result cr. For the summation, it suffices to let x range over all departments. The collection cr will contain a certain location z if x is a department, x is hiring, and z is the location of x. If two departments which are hiring had the same location, that location would show up twice in the summation.
To fully characterize the collection cr, it is sufficient to let the variable z run over the range of the function r, since that contains all possible things that could occur in cr. For the last example, this would be the set of all locations. More narrowly, z can run over the "domain image" of r, which in this case would be the set of locations belonging to any department.
The simplest result expression is just the variable x (denoting the identity function r(x)=x), yielding cr=cw.
A query is a form which takes up to three expressions as its basic components, which can be illustrated in the syntactic form
SELECT r(x) FROM d x WHERE w(x).
The sequence of components, and the capitalized keywords, are illustrative only. The query is evaluated as follows:
cw(x) = cd(x) * w(x).
cr(z) = SUM[x](cw(x) * z=r(x))
= SUM[x](cd(x) * w(x) * z=r(x)).
Duplicates can arise as follows:
Many features commonly associated with queries are not included in this basic query paradigm. Such features typically operate on the whole collection cr to do such things as count, sort, remove duplicates, or extract an element. These are treated in our descriptive model as general operations on collections, not necessarily unique to queries. In effect, they do "post-processing" on the query result.
For example, to model DISTINCT, we can define MakeSet as a function in the descriptive model which takes a collection as argument and returns a set. For any collection c, its behavior is defined as
MakeSet(c)(x) = [c(x)>0],
i.e.,
Occurs(x,MakeSet(c)) = [Occurs(x,c)>0].
A thing x will occur once in the set returned by MakeSet(c) if it occurs at least once in c, since the expression Occurs(x,c)>0 will then have the value 1.
With this definition, a target query of the form
SELECT DISTINCT r(x) FROM d x WHERE w(x)
can be expressed in the descriptive model as
MakeSet(SELECT r(x) FROM d x WHERE w(x)).
Such post-processing could be composite, e.g., to sort the results after removing duplicates.
Some post-processing operations yield something other than a collection. Operations which return the cardinality or extract an element fall into this category. Queries which include this kind of post-processing can be treated simply as expressions which return whatever the post-processing expression yields.
Post-processing operations like sorting or duplicate removal yield another collection cp. The profile of cp depends on the specific post-processing operations. For sorting, cp(z)=cr(z), since the cardinality of the contents is unchanged. For duplicate elimination, cp(z)=[cr(z)>0]. These may be composed over intermediate collections if, for example, duplicates are eliminated before or after sorting.
In
Section 3.1 we will show a model-independent definition of Group-by as a composition of two basic queries linked by a post-processing operation.Group-by can be described in terms of the facilities described above, still in model-independent fashion. This exercise suggests that it might also be possible to define other interesting query facilities in model-independent fashion.
Group-by essentially works in three stages. First there is a simple (inner) query. Then its results are partitioned into groups by a GROUP-BY clause. Finally, another query iterates over this collection of groups, with the HAVING clause serving as the WHERE clause for this second (outer) query.
We first extend the descriptive model with a Group operator which takes as arguments an unevaluated grouping expression g(x) and a collection c, returning a collection cg=Group(g(x),c). The Group operator partitions c into equivalence classes based on equal values of g. The collection cg is the collection of these equivalence classes (which we call groups), i.e., cg is a collection of collections. The Group operator also confers values of g onto the groups themselves, as will be described below.
A query with Group-by has the form
SELECT r(u) FROM d x WHERE w(x) GROUP-BY g(x) u HAVING h(u)
which is equivalent to
SELECT r(u)
FROM Group( g(x), [SELECT x FROM d x WHERE w(x)] ) u
WHERE h(u).
In effect, we start with an inner query
Qi ::= SELECT x FROM d x WHERE w(x).
Then we use Group as a post-processing operator to partition the result into a collection of groups
do ::= Group(g(x),Qi)
::= Group(g(x), [SELECT x FROM d x WHERE w(x)])
and then use this as the search domain for an outer query (using the variable u for the outer query to iterate over the groups)
Qo ::= SELECT r(u) FROM do u WHERE h(u)
::= SELECT r(u) FROM Group(g(x), [SELECT x FROM d x WHERE w(x)]) u WHERE h(u).
The HAVING clause provides the WHERE clause for the outer query Qo, whose search domain do is a collection of groups.
Let's continue to use cd, cw, and cr to label the collections for the inner query Qi as before. The corresponding collections for the outer query Qo will be labeled cdo, cwo, and cro. We have:
cro(z)
= SUM[u](cwo(u) * z=r(u))
= SUM[u](cdo(u) * h(u) * z=r(u)).
We know all we need to know except the profile for cdo. We have to define the behavior of the Group operator arithmetically. (This illustrates in general how the behavior of an operation needs to be defined arithmetically for use in our descriptive model.)
Let cg=Group(g(x),c).
As before, the descriptive model says little about g(x) other than that it must return a value for each x in c. Depending on other facilities in the target language/model, it could be a composite of various forms. For example, if c contains employees, then g(x)=<Salary(x),Age(x)> might be used to group together employees having the same salary and the same age, while g(x)=Salary(x)+Bonus(x) might be used to group together employees for whom the sum of salary and bonus is the same.
The variable x ranges over elements of c. The collection cg does not contain such elements; it contains groups (which in turn contain such elements). Hence we use a different variable u to range over the elements of cg. The correspondence is that if a group u occurs in cg, then a thing x which occurs in c might also occur in u.
The Group operation partitions c into a set of non-empty equivalence classes (i.e., the groups) based on equal values of g(x). The result is a collection (actually a set) cg whose members are groups u1...un, where n is the number of distinct values of g(x) occurring for any values of x in c.
Each group ui is a non-empty collection characterized by a value ki such g(x)=ki for every x in the group ui. Since ki is a characteristic property of the group ui, we can "overload" the expression g so that g(ui)=ki. In effect, values of g are conferred on the groups themselves. (We could formalize this by defining the Group operation to return a set of pairs <ui,ki>, associating each group with its grouping value. For simplicity, we avoid that formalization at the moment.) This is crucial to an understanding of Group-by.
For example, if c contains employees and g is the Salary function, then cg will contain groups of employees such that each employee in group ui earns salary si. There are as many groups ui as there are distinct salaries si earned by employees in c. The Salary function is now implicitly defined on the groups as well, with each group ui having its characteristic salary si, so that Salary(ui)=si=Salary(x) for any x in ui.
Since the expression g(x)=g(u) returns 1 if true and 0 if false, the membership of each group u can be characterized as
u(x) = c(x) * g(x)=g(u).
That is, a group of employees will contain all the occurrences of employees in c whose salary equals the group salary.
How do we know which such groups will occur in cg? In a sense, there is a "potential" group for each possible salary, i.e., there is a candidate group u corresponding to each value in the range of g(x). However, cg is a collection of non-empty groups ui which satisfy the description given above for u(x). Hence the membership of cg can be characterized in terms of such groups whose cardinality is greater than zero:
cg(u) = Card(u)>0
= [SUM[x]u(x)]>0
= [SUM[x] (c(x) * g(x)=g(u))]>0
That is, a potential group u will actually occur in cg if u is non-empty, i.e., if there is any employee x in c whose salary equals the salary associated with u.
Recall that if an expression e returns a counter, then the expression e>0 returns 0 if e=0 and 1 otherwise. If u occurs in cg then cg(u)=1, otherwise cg(u)=0. The collection cg is a set.
Thus, in general, for a collection defined by cg=Group(g(x),c), the profile is cg(u) = [SUM[x] (c(x) * g(x)=g(u))]>0. We were looking in particular for the profile of the collection cdo=Group(g(x),cr), which can now be written as
cdo(u) = [SUM[x] (cr(x) * g(x)=g(u))]>0.
Since we had cr(x) = cw(x) = cd(x)*w(x), further substitution yields
cdo(u) = [SUM[x] (cd(x) * w(x) * g(x)=g(u))]>0.
We can at last finish defining the profile function for the collection cro returned by the outer query Qo. We had gotten as far as
cro(z) = SUM[u](cdo(u) * h(u) * z=r(u)).
Substituting our new definition of cdo(u) finally yields
cro(z) = SUM[u]({[SUM[x] (cd(x) * w(x) * g(x)=g(u))]>0} * h(u) * z=r(u)).
Although such profile formulas may appear dense and arcane, they are little more than the laborious composition of summations and multiplications. They can readily be followed one step at a time, and they are easily computable so long as summation domains are appropriately specified.
Without HAVING, h(u)=1. If there is a HAVING clause, the expression h may involve any operations defined on the groups ui, such as the expression g as well as any collection operations provided in a particular target model, such as Count or Max.
For example, in a target model with the appropriate capability, the HAVING clause might take the form
Count(u)>10 AND Salary(u)>50000.
Thus, the query
SELECT Salary(u) FROM Employee x WHERE Age(x)>30
GROUP-BY Salary(x) u HAVING Count(u)>10 AND Salary(u)>50000
returns the salaries over 50000 which are earned by more than 10 employees over the age of 30. The equivalent composed form of this query is
SELECT Salary(u)
FROM Group(Salary(x), [SELECT x FROM Employee x WHERE Age(x)>30] ) u
WHERE Count(u)>10 AND Salary(u)>50000.
The example query will [Figure 1]
For the example query, the profile expression is
cro(z) =
SUM[u][ { [SUM[x] (Employee(x) * Age(x)>30 * Salary(x)=Salary(u))]>0 } * Count(u)>10 * Salary(u)>5000 * z=Salary(u) ].
In order to compute this sum it suffices to let x range over the set of employees, since Employee(x) will be 0 otherwise. Similarly, it suffices to let u range over bags such that if u(x)>0 then u(x)=Employee(x) (saying it in this general way allows for the case where Employee might be a bag, containing duplicates). More simply, it suffices to let u range over the bags obtaining by partitioning the search domain cd into equivalence classes based on equal values of g.
It should be noted that the result expression r(u) may include any operations on the groups ui, including operations which manipulate the contents of the groups. In the above example, the query result cro={53000, 55000} contains two elements, corresponding to the two groups u3 and u4. Suppose now that we have a function Names(u) which takes as argument a bag of tuples of the form of one of our groups, and returns a bag containing the names occurring in those tuples. Then the result of a query of the form
SELECT Names(u) FROM Employee x WHERE Age(x)>30
GROUP-BY Salary(x) u HAVING Count(u)>10 AND Salary(u)>50000
would still be a collection cro containing two elements, but these two elements would be bags of names:
cro = {[Paul,Mary,+ 9 others], [George,Jack,+ 10 others]}.
Query models have historically been associated with data models, which in turn are often reflected in data sublanguages embedded in host languages. The result is a chaotic overlap of specifications for query, data models, data sublanguages, and host languages. In the long run it would be useful to partition language features into orthogonally defined components which can then be used in various combinations. The present paper is an attempt to start by isolating query semantics from other features.
Although a full orthogonalization of language features is far beyond the scope of the present paper, we can suggest a plausible distinction between data model and computational features. General computational features include such aspects as the underlying computational paradigm (e.g., expression evaluation, resolution, or something else), composition of expressions, overloading, coercion, binding and scope of variables, recursion, synchronization, exceptions, special forms, flow control, etc.
Data-related features involve operations which construct, access, and determine equality of data things. These can be partitioned according to the kind of data involved. Thus there are operations for arithmetic, logic, string manipulation, data aggregates, etc.
Data models are essentially concerned with aggregate data structures, involving operations which construct them, access them, and determine their equality. These again can be partitioned according to the type of aggregate, e.g., lists, relations, arrays, various notions of "object", etc.
Some interesting issues arise in such partitioning. Union and difference of relations might not be an inherent part of the relational model: a relation is a relation whether or not the language supports a difference operator. The same with elementary data types: the semantics of relations aren't inherently different if they can or can't have dates or floating point numbers in them ("...a query language should treat data values as essentially uninterpreted objects" [1]).
In any case, however, specification of a query operator can be separated from specification of other language features.
Standards can certainly be orthogonally defined in this fashion. The core behavior of query can be standardized once in a generic way, in a manner applicable to a large variety of languages and models. Multiple standards can be defined for different sorts of data constructs, with each of these combined with the basic query standard. Multiple standards can also be defined for different sorts of computational expressiveness, which can again be combined with the other standards to arrive at the semantic component of a specific language standard. After that's been done, then one or more syntactic forms for expressing such semantics can be standardized.
MIQP illustrates a core query model whose semantics are independent of any particular data model or computational facilities. It defines the semantic behavior of queries in neutral terms, relying only on arithmetic. This level of specification can be adopted, agreed to, standardized by all languages/models claiming to support query. Further refinements can then be superimposed by standardizing details of data model and computational facilities. In the query form
SELECT r(x) FROM d x WHERE w(x),
separately standardized data model operations for constructing collections can be plugged into the specification of d for defining the search domain and also into the result generator r. Standardized data model operations for accessing collections can be plugged into all of the expressions. Similarly, general computational features can also be plugged into all of the expressions.
Such separation can organize the description of an orthogonally designed language, or it can organize the "complete" description of a separately defined query capability. For example, consider the expression d for specifying the search domain cd. In an orthogonally designed language, there would be a separate general specification of collection-valued expressions, with a simple statement that these may be used to define the search domain d. In the other case, there would be a special-case description of what sorts of expressions could be used to specify the search domain. The specification would not apply to collection-valued expressions usable elsewhere in the language.
We mention a few more examples of such orthogonality:
Logical (Boolean-valued) expressions typically occur in the filter expression w of a query, but they can also occur elsewhere, e.g., in if-then and while statements. The definition and behavior of the components of such logical expressions can be defined separately, including such things as treatment of comparisons, negation, nulls, etc. The only requirement is that the overall expression w(x) return a counter (or Boolean) value for each x in cd. How it does so does not affect inherent query semantics.
Such things as nested queries or unions of queries should not require special definition, if queries are treated the same as any other collection-valued expression.
What might make a query object oriented?
Those are all secondary characteristics, layered on top of the fundamental query paradigm. Different object models may have different capabilities and semantics for aggregate constructors and accessors. The query paradigm described here can accommodate any of them, hence is adaptable to various object models. The query paradigm will be standardized to the extent that such aggregate constructors and accessors are standardized.
Even identity of result objects is an orthogonal issue. Query is just one possible collection-valued operation. Others might include union (concatenation), intersection, difference, construction (e.g., {x,y} or [x,x]), and evaluation of collection-valued or set-valued functions like Children(Sam). The question of whether query execution creates new objects depends on whether execution of any of those operations creates a new object.
MIQP can help analyze and define the behavior of difficult queries. This involves two stages:
We illustrate this in terms of queries involving multi-valued attributes, disjunction, and duplicates
Suppose Sam has two-year-old twins Ed and Al, and three-year-old triplets Ted, Hal, and Sue.
The question "Which parents have a child named Ed or a child named Ted?" illustrates a profound difference between functional and relational semantics. In functional terms, we think of Sam as having a set of five children, and we include Sam in the answer - once - because he has children satisfying the selection criterion.
In tuple-oriented first-normal-form semantics, we don't think of Sam's children. We think of a set of parent-child pairs. The question is applied to each such pair, extracting the parent from the pair if the child is named Ed or Ted. Sam would thus show up in the answer twice (assuming duplicates are not removed).
This distinction shows up in the syntax, which in this case turns out to be more than sugaring. The relational form of the query would include terms of the form "Person.Child=Ed" and "Person.Child=Ted" (we're eliding the quotes around people's names). It is often claimed that the dot notation Person.Child is semantically equivalent to the functional notation Child(Person). This does not work for multi-valued attributes. If Child(Sam) was a function invocation, it would have one fixed value so long as there were no updates. Child(Sam)=Ed and Child(Sam)=Ted could not simultaneously be true unless Ed and Ted were the same thing.
If we introduce Children(Sam) as a function having a set of children as its value, then a term of the form "Children(Sam)=Ed" still makes little sense. Ed is not a set, and hence can't equal the value of Children(Sam). One expedient would be to reinterpret x=y - when one is a set and the other isn't - to be true if the non-set thing belongs to the set thing. But this approach raises further complications when we go beyond first normal form: how should we interpret x=y when one is a set and the other is a set of sets? A much cleaner approach is to provide a distinct operator x IN y to be true when x is an instance of y.
Due to first normal form, the Children function could not be modeled as a base relation whose extension is stored in terms of argument-result pairs. It can only be modeled as a query which returns a set of things based on a Parent-Child predicate which can be modeled as a base relation. The relationship between Sam and his child Ed can only be modeled as a pair (tuple) which is a thing x in its own right, such that Parent(x)=Sam and Child(x)=Ed. Note the odd construction here: Sam now appears to be the parent of a tuple rather than the parent of a child.
Clearly, due to syntactic idiosyncrasies within and among languages, the first essential step in analyzing a query is to express the query in an equivalent MIQP form. This in itself could be controversial, forcing clarification of aspects of the query's semantics.
Thus it could make a difference whether the query is expressed as
Q1: SELECT x
WHERE Ed IN Children(x) OR Ted IN Children(x);
or as
Q2: SELECT Parent(x)
WHERE Ed=Child(x) OR Ted=Child(x);
The unwritten FROM clause is quite significant. The first query iterates over people, the second over tuples.
The overall semantics depends both on how the query is expressed and on how the arithmetic of OR is defined. If the expression "x OR y" is defined to return 0 if x and y are both zero and 1 otherwise, then Q1 returns Sam once while Q2 returns him twice (there is one parent x satisfying the first WHERE clause, while there are two tuples x satisfying the second). On the other hand, both queries return the same result if we define "x OR y" simply as x+y (the filter expression in the first query would have the value 2).
Things get messier with duplicates. Suppose we wanted to know which parents have a two-year-old child or a three-year old child. Depending on how we define various things, we might get Sam back once, twice, three times, or five times. Consider the queries
Q3: SELECT x
WHERE 2 IN ChildrenAges(x) OR 3 IN ChildrenAges(x);Q4: SELECT Parent(x)
WHERE 2=ChildAge(x) OR 3=ChildAge(x);
In the relational (tuple-oriented) query Q4, we are now dealing with tuples containing a parent, a child, and the age of this child. The function ChildrenAges(x) returns the bag of ages of the children of x. It can be defined in functional form as
ChildrenAges(x) ::= SELECT Age(y) WHERE y IN Children(x);
or in tuple-oriented form as
ChildrenAges(x) ::= SELECT ChildAge(y) WHERE x=Parent(y);
In either case, ChildrenAges(Sam)=[2,2,3,3,3].
Before analyzing queries Q3 and Q4, we should observe that x IN y could be defined to return 1 if x occurs in y and 0 otherwise, or it could return the number of times that x occurs in y. Furthermore, the possible definitions of "x OR y" need to be expanded to deal with x>1 or y>1. In such cases, OR might be defined as their sum, their maximum, or possibly something else.
Based on such alternatives, the reader should be able to discover cases in which the queries Q3 and Q4 will return Sam one, two, three, or five times, and perhaps in other numbers as well.
Similar anomalies arise with conjunction. In a tuple calculus, it's hard to express the question "Which parents have a child named Ed and a child named Ted?". We essentially have to formulate it in terms of a cross-product of two copies of the original relation, so that we can seek one tuple in which Ed and Ted both occur.
Things are again messier with duplicates. Consider the question "Which parents have a two-year-old child and a three-year-old child?" If we form the cross-product as before, we actually wind up with six tuples in which Sam has a two-year-old child and a three-year-old child. Do we really want to see Sam six times in the answer? How would we rationalize that semantically?
What's the answer? How should these operators behave? MIQP does not answer such questions. MIQP frames the questions, and requires language and model definers to provide the answers. Once the questions are answered, MIQP clearly defines the semantics of the queries. Definers might choose to specify a particular behavior for operators like OR, AND, and IN, or they might choose to replace them with families of operators, each having a different behavior. Thus there might be two flavors of IN, one returning 0 or 1, the other returning the number of occurrences.
These anomalies do illustrate the perils of trying to express the semantics of queries in terms of manipulating underlying relations. It's much cleaner to define semantics in terms of functional operations with well-defined arithmetic characteristics.
The behavior of relational operators can be described arithmetically. For the following illustrations, we treat relations as collections of tuples; formal definition of tuples themselves is omitted in the present paper.
Let c be a relation (collection of tuples), and let x[i] denote the component of tuple x corresponding to attribute i. Let i1...in be any sequence of attributes (allowing repetition). The projection
cp = pi [i1...in]c
yields a collection cp of tuples of the form y=<x[i1...x[in]> whose membership profile is
cp(y) = SUM[x][c(x) * y[1]=x[i1] * ... * y[n]=x[in] ].
(We let y be indexed by the "attributes" 1...n.) Thus a tuple y will occur in cp if there is a tuple x in c such that y[1]=x[i1] AND ... AND y[n]=x[in]. There may be several distinct tuples x in c which contribute the same y. Each of these contributes c(x) occurrences of y to cp.
Selection is easy to characterize in terms of a selection expression S(x) which yields a Boolean number. The selection
cs = sigma [S(x)]c
yields a relation cs with the same attributes as c whose membership profile cs(x) is
cs(x) = c(x) * S(x).
A tuple x occurs n times in cs if x occurs n times in c and S(x) is true.
Let x1 and x2 range over tuples in relations c1 and c2, respectively, and let p(x1,x2) be an arbitrary join predicate.
Let INJ denote the inner join of c1 and c2. Its profile is given by
INJ(z) = (z=<x1,x2>) * p(x1,x2) * c1(x1) * c2(x2).
If z=<x1,x2>, and if p(x1,x2) is true, then the number of occurrences of z in INJ is the product of the number of times x1 occurs in c1 and x2 occurs in c2. Otherwise, z does not occur in INJ.
For outer joins, we first define "appendages" R and L relative to c1 and c2. Ø denotes null.
R(z) = (z=<x1,Ø>) * c1(x1)>0 * 0=SUM[x2][p(x1,x2) * c2(x2)].
L(z) = (z=<Ø,x2>) * c2(x2)>0 * 0=SUM[x1][p(x1,x2) * c1(x1)].
For R, if:
then z will occur once in R. Otherwise z will not occur. Similar analysis applies to L.
Let ROJ, LOJ, and SOJ denote the right, left, and symmetric outer joins of c1 and c2. These are respectively defined as the concatenation of INJ with R, with L, and with both. Concatenation is expressed as simple addition. Hence
ROJ(z) = INJ(z) + R(z),
LOJ(z) = INJ(z) + L(z),
SOJ(z) = INJ(z) + R(z) + L(z).
The correctness of query transformations can be proved via arithmetic analysis. Such analysis might also lead to the discovery of new transformations. To illustrate, we prove the correctness of the familiar transformation involving projection and selection.
Consider the projection
cp = pi [i1...in]c
The k-th element of a tuple y occurring in cp is given by y[k], and is equal to x[ik] for some x in c. If we then select from cp using a selection criterion on this k-th element, we have
cps = sigma [S(y[k])]cp = sigma [S(y[k])]pi [i1...in]c,
where S(y[k]) signifies a selection on y[k]. We are thus selecting on the k-th component of a projection. The membership profile of cp is
cp(y) = SUM[x][c(x) * y[1]=x[i1] * ... * y[n]=x[in] ].
The membership profile of cps is
cps(y) = cp(y) * S(y[k])
= [ SUM[x][ c(x) * y[1]=x[i1] * ... * y[n]=x[in] ] ] * S(y[k]).
Now let's reverse the order of the operations. Since y[k]=x[ik], we have S(y[k])=S(x[ik]), and an equivalent selection on c takes the form
cs = sigma [S(x[ik])]c.
Since cs has the same attributes as c, we can still write the projection as
csp = pi [i1...in]cs = pi [i1...in]sigma [S(x[ik])]c.
The membership profile of cs is
cs(x) = c(x) * S(x[ik]).
The membership profile of csp is
csp(y) = SUM[x]( cs(x) * y[1]=x[i1] * ... * y[n]=x[in] )
= SUM[x]( c(x) * S(x[ik]) * y[1]=x[i1] * ... * y[n]=x[in] ).
If we simply rearrange terms, and substitute S(y[k])=S(x[ik]), we see that
cps(y) = S(y[k]) * [ SUM[x]( c(x) * y[1]=x[i1] * ... * y[n]=x[in] ) ],
csp(y) = SUM[x]( S(y[k]) * c(x) * y[1]=x[i1] * ... * y[n]=x[in] ).
From the distributivity of multiplication over addition, we see that these are equal. QED.
MIQP provides an independently verifiable test of the correctness of implementations. It should be possible to set up an automatic computation scheme to check the correct results of test queries, for verifying the correctness of implementations. MIQP does represent an executable paradigm (of doubtful efficiency) so long as finite summation domains can be identified.
Language features can be grouped into data model, computation, and basic query. The categories do overlap: query and data model operations are part of the computational power, and queries make use of the data model and computational features. In an orthogonally designed language these features can be described separately; in other languages queries can be described as a special-case combination of the three aspects.
In any case, the external behavioral semantics of queries can be described arithmetically, regardless of the data model, by means of formulas for calculating the contents of collections produced by queries. Basic formulas describe basic semantics for any model. Relevant aspects of the data model and computational features can then also be expressed arithmetically to describe their effect on queries.
This paradigm inverts the traditional explanation process. Rather than trying to explain query semantics by describing what an underlying execution algorithm (implementation) yields, this paradigm sets an independent criterion which the execution algorithm is obliged to satisfy. This paradigm rises above the tuple or domain calculus as a vehicle for defining query semantics.
In some ways, this paradigm represents a way of "passing the buck" on the problem of defining query semantics. Once the behaviors of the constituent expressions of the query form are well defined, the behavior of the query is well defined. Most issues in query semantics are really issues in the behavior of the constituent expressions. Thus a question about the behavior of a query may be sidestepped by rephrasing it as a question about the behavior of one of the expressions. This refocuses attention on the appropriate aspect of the target language/model. The behavior of the query is well defined, once those other aspects of the language/model are defined.
Of course, the paradigm itself needs further testing and refinement. An open question: if we couldn't figure out how to express the behavior of a certain query in MIQP, should we then conclude that MIQP is deficient, or that the semantics of the query are ill-defined?
In any case, the following seems to be a natural approach to query standards:
Arithmetic specification of query semantics facilitates such orthogonal specification. They also provide an independently verifiable test of the correctness of implementations. It is possible to set up an automatic computation scheme to check the correct results of test queries, for verifying the correctness of implementations.
We close with a set of recommendations:
Many thanks to Rafi Ahmed, Jurgen Annevelink, Amelia Carlson, Surajit Chaudhuri, Mike Heytens, Stephanie Leichner and other colleagues for their valuable critiques of prior versions.
It would be useful to characterize the types and signatures of the collections and expressions involved in the query paradigm. We first need to extend the descriptive model with some relatively neutral type-related concepts and terms, sufficient to characterize the behavior of queries. These are used for descriptive purposes only, and don't have to be supported in the target model being described.
There is a set of things called "types". A thing x has an associated set of types given by Types(x). To say that "x has type t" or "x is of type t" or "x is an instance of t" means t IN Types(x), also written Types(x) CONTAINS t. We have x IN Instances(t) <=> Types(x) CONTAINS t. We sometimes write x IN t to mean x IN Instances(t). If t2 is a supertype of t1 then x IN t1=>x IN t2. To say that a variable has a type means that it may only be bound to instances of that type.
There is a type called Type. Types(x) CONTAINS t => Types(t) CONTAINS Type. We also assume that Bag is a type.
[At this point we switch terminology, using "bag" for "collection". In the context of a type system, it seems more natural to call the types Bag and BagType, rather than Collection and CollectionType. However, we do not imply that the collections are necessarily unordered.]
A type constructor is a function which returns a type. Type constructors can generally return an infinite set of possible types, hence the set of types is infinite. The type returned by a type constructor may be called a "constructed type". BagType is a particular type constructor, which takes one type as an argument. To say that x has type BagType(t) means that x is an instance of T, where T is the type returned by BagType(t). If a thing x has BagType(t) as its type, it means that x is a bag and each thing occurring in x has type t. Thus
Types(x) CONTAINS BagType(t) <=> x IN Bag AND [x(y)>0 => y IN t].
Bag is a supertype of BagType(t) for any type t. If there is a type T which is the supertype of all other types, then Bag is equivalent to BagType(T).
Set is a subtype of Bag. x IN Set => x(y)=<1. The type constructor SetType can be defined as
Types(x) CONTAINS SetType(t) <=> x IN Set AND [x(y)>0 => y IN t].
BagType(t) is a supertype of SetType(t) for any type t.
An expression e has signature e:tx->ty, meaning that e(x)=y => x IN tx AND y IN ty. Expressions are total, i.e., x IN tx=>e(x) IN ty. In the following, wherever we require a signature e:tx->ty, we can accept a signature e:tz->ty if tz is a supertype of tx.
In the descriptive model, a variable is much like an expression with no arguments. A variable v of type t effectively has the signature v:->t, and the expression f(v) can be treated as a syntactic shorthand for f(v()).
Counter is a type whose instances are the non-negative integers.
Types and signatures in a query are as follows:
We we assume the members of the search domain cd are instances of some arbitrary type td. Hence cd has type BagType(td), and the variable x has type td.
cd->cw: The filter w has signature w:td->Counter.
The bag cw has the same type as cd, i.e., BagType(td).
cw->cr: The result expression r has signature r:td->tr, where tr is any type.
The bag cr has the type BagType(tr). If the result expression is just the variable x, then tr=td. (The latter is the case when there is grouping.) cr is the query result when there is no grouping.
cr->cdo: The Group operator has signature Group:expression, BagType(td)->SetType(BagType(td)).
The bag cdo has the type SetType(BagType(td)) [a subtype of BagType(BagType(td))].
cdo->cwo: The filter h has signature h:BagType(td)->Counter.
The bag cwo has the same type as cdo, except it might have duplicates: BagType(BagType(td)).
cwo->cro: The result expression ro has signature ro:BagType(td)->tro, where tro is any type.
The bag cro has the type BagType(tro). If the result expression is simply the variable u, then tro=two.
It would be nice to define the basic query paradigm in such a way that the query result preserves any ordering defined on the search domain. However, it seems difficult to define this in a canonical way, independent of specific data structures. There are too many possibilities. For example, without duplicates being injected by the filter w, we might wish:
When w injects n>1 copies of x, we might additionally wish that
In the following sections we illustrate how model-specific or language-specific aspects of queries can be described in the descriptive model.
The search domain cd can take many useful forms. In relational queries, the search domain is a set of tuples, obtained as the (flattened) cross product of relations. In the OSQL style of object-oriented query [3][4], the search domain is a set of tuples obtained as the cross product of types.
Other more general forms are useful for query refinement. In a static form, the search domain is itself specified as a query. In a dynamic form, the search domain might simply be a variable holding the results of a previous query, perhaps modified. There are other uses as well for generalized expressions specifying the search domain.
Section 10.1.2 through Section 10.1.6 illustrate some constructs useful for specifying search domains. Then Section 10.1.7 and Section 10.1.8 illustrate the use of such constructs in grouping and result generation as well.Some query syntaxes allow the search domain to be specified as a type, so that
SELECT FROM T x WHERE ...
signifies
SELECT FROM Instances(T) x WHERE ...
i.e., cd can be described as Instances(T).
Queries in various models and languages often involve things like lists or tuples. Indexed things provide a generic way of describing such things in the descriptive model. An indexed thing x has an index set IndexSet(x). If i IN IndexSet(x), then Index(i,x) may return something, though it need not. If i NOT IN IndexSet(x), then Index(i,x) returns nothing, and might be treated as an error. The cardinality of the index set can be considered the length of the indexed thing. The index set might be an initial subset of the positive integers (or non-negative integers for 0-origin), making x a sequence or list. The elements of the index set might be called "attributes" or "accessors", making x a tuple. If the index set contains n-long sequences of integers, then the indexed thing is an n-dimensional array.
We use the notation x[i] for Index(i,x).
A generalized model of indexed thing construction is given by
x <- <i1:e1...in:en>,
which binds x to an indexed thing with index set i1...in such that x[ij]=ej. If omitted, the index set i1...in defaults to the integers 1...n, so that a sequence constructor takes the form
x <- <e1...en>.
This definition appears circular, since one has to provide the sequence constructor with a sequence of arguments. A non-circular definition arises by defining a sequence to be an operator which yields another sequence. Thus, if s is a sequence of length n, then s(e) returns a sequence of length n+1 whose n+1st element is the value of e. There is an empty sequence denoted <>. The expression <>(e1) returns a sequence of length 1, <>(e1)(e2) returns a sequence of length 2, and so on. The syntactic form <e1...en> is equivalent to <>(e1)...(en).
Note that the descriptive model does not specify the behavior of an indexed thing constructor when any of the ei is void (null). The constructor expression <e1...en> might itself be void (null), or it might yield an indexed thing whose i-th element is void (null). When describing a specific target model/language, it is essential to specify how the corresponding constructors behave in this situation.
The members of a regular collection all have the same finite index set, which then becomes a characteristic of the collection itself. (Though associated with the collection, it is only the index set for its members, not for the collection itself; c[i] is not meaningful.) A relation is a regular collection, usually subject to further restrictions as well.
If the search domain cd in a query is a regular collection having index set i...j (not necessarily integers), it is convenient to introduce variables xi...xj for referring to components of the members of cd, such that xi=x[i],...,xj=x[j]. So long as there is an understood ordering of the indexes (and there usually is, e.g., for relational tuples), then a query syntax of the form
SELECT FROM d xi...xj WHERE w(xi...xj)
associates the variable xk with the index k, being equivalent to
SELECT FROM d x WHERE w(xi...xj) AND xi=x[i] AND ... AND xj=x[j].
Since the collection cw obtains its members from cd, it will also be a regular collection characterized by the same index set.
For example, if cd is a collection of triplets (3-long sequences), then a query of the form
SELECT FROM d y,z,u WHERE w(y,z,u)
is equivalent to
SELECT FROM d x WHERE w(y,z,u) AND y=x[1] AND z=x[2] AND u=x[3].
The search domain cd might be defined by various sorts of constructors. Of particular interest is a cross product constructor Xprod(c1...cn) which constructs the cross product of a list of collections c1...cn. The result is a regular collection of sequences with index set 1...n. The result contains k occurrences of <x1,...,xn>, where k is the product of the number of occurrences of xi in ci:
Occurs(x,Xprod(c1...cn)) = Pi iOccurs(x[i],ci),
i.e.,
Xprod(c1...cn)(x) = Pi ici(x[i]).
The main reason behind this excursion is to consider a further special case, namely the construction of the search domain cd as the cross product of types. The syntactic form
SELECT FROM Ti xi ... Tj xj WHERE w(xi,...,xj)
can be described as a special notation for
SELECT FROM Xprod(Instances(Ti)...Instances(Tj)) x
WHERE w(xi,...,xj) AND xi=x[i] AND ... AND xj=x[j].
Note that for SQL queries, the initial cross product of relations is really a flattened cross product. Given two relations of the form {<x,y>} and {<z,w>}, the flattened cross product has the form {<x,y,z,w>}, whereas the true cross product has the form {<<x,y>,<z,w>>}.
Starting from the form which specifies the search domain as a type
[Section 10.1.2], i.e.,SELECT FROM T x WHERE ...,
the syntactic form
SELECT FROM Ti xi ... Tj xj WHERE w(xi,...,xj)
can also be explained in terms of type constructors, being equivalent to
SELECT FROM SequenceType(Ti...Tj) x WHERE ...
Here SequenceType(Ti...Tj) is a constructed type whose extension is given by Xprod(Instances(Ti)...Instances(Tj)).
Grouping by more than one property can be achieved by describing the grouping expression g as a sequence of properties:
g(x) ::= <g1(x),...,gm(x)>,
effectively creating a "vector" of functions and corresponding values. For example, grouping by salary and age could be obtained by
SELECT FROM d x WHERE w(x) GROUP-BY <Salary(x),Age(x)> u HAVING h(u).
Each group ui would contain employees having the same salary and age. The Salary and Age functions would be implicitly overloaded on the groups ui, so that Salary(ui)=Salary(x) and Age(ui)=Age(x) for any x in ui.
To tie this back into the basic model, the overloading of the function gj on the groups ui can be described as
gj(ui) ::= g(ui)[j],
i.e., the j-th element of g(ui). In the present example, letting Salary be g1 and Age be g2, the grouping function g is effectively defined as
g(x) ::= <Salary(x),Age(x)>
so that g(ui)=<si,ai>, where Salary(x)=si and Age(x)=ai for any x in ui. Then Salary(ui) is g(ui)[1], i.e., the first element of g(ui), while Age(ui) is the second element.
In this way, Salary and Age become defined for the collections c3 and c4, and can be used in the HAVING and SELECT (result) clauses.
If supported in the model, sequence constructors can be used in several ways in the SELECT clause to generate cr as a regular collection. If u is an element of c4, the result expression r can be specified as <r1(u)...rm(u)>, e.g.,
SELECT <Name(d), Manager(d)> FROM Departments d WHERE ...
If c4 is itself a regular collection (e.g., a collection of sequences or tuples), then projections or reorderings can be obtained by result expressions of the form <u[j]...u[i]>, or equivalently <uj...ui>, e.g.,
SELECT <p,d> FROM Departments d, Employees e, Projects p WHERE ...
These forms can be combined, as in
SELECT <Leader(p), Name(d)> FROM Departments d, Employees e, Projects p WHERE ...
In general, any collection-valued expression might be used to specify the search domain. It might be itself be a query, or a variable holding the results of a prior query, allowing various forms of query refinement. Any other expression, such as Children(Sam), could be used to bound the search domain.
In order to compute the actual profile for the results of a particular query, it is necessary that the profile of the search domain cd(x) be provided.
The filter expression w(x) in queries can take many forms, with subexpressions involving all sorts of values and structures, but the overriding consideration is that w(x) as a whole must evaluate to a counter value or Boolean number for any x that may occur in the search domain.
As illustrated in earlier examples, logical predicates are modeled as returning 1 for true and 0 for false. In the absence of duplicates, conjunction (AND) is simple multiplication or maximum, disjunction (OR) is minimum, and negation is a simple unary operator which inverts 1 and 0. (For most queries, the search domain bounds the domain in such a way that negation becomes a relative complement.) Plausible arithmetic treatments of logical operators in the presence of duplicates are examined in [2],[7].
Three-valued logics for dealing with nulls may use some form of "unknown" as an intermediate result for subexpressions, but in the end there are only two choices: the thing is either selected or not. Hence the overall result must still be a Boolean value.
Nested queries (subqueries) are collection-valued expressions (unless post-processing yields something else). As such, they may only occur within other expressions which ultimately evaluate to counters or Booleans, such as Occurs(x,Q) or Q1=Q2 or NonEmpty(Q) or Count(Q), etc. Techniques for unnesting nested queries may provide useful computational algorithms, but they need not be used to define the semantics of the result.
Quantification can be defined as general operations on collections such that ForAll(c,e) and Exists(c,e) return 1 if e(x) is true for all or some elements x of c. (These may have to be treated as special forms taking e as a quoted expression.)
Duplicates arising from nested subqueries should be explainable in terms of a counter-valued subexpression of the form Occurs(z, SELECT ...) occurring within the WHERE clause.
Other computational features such as arbitrary functions, composition of functions, etc., are useful in queries, but do not inherently change the query semantics. The filter still has to evaluate to a counter or Boolean value.
Recursion doesn't arise until we use queries to define named expressions, at which point it generalizes to the problem of named expressions of any kind (as in [1]). Thus the problem of recursion is not specific to the query operator.
It is generally useful to allow arbitrary operations in Group-by, e.g., grouping by the sum of some properties. Thus the grouping expression could be an arbitrary construction of subexpressions.
Arbitrary computations are also useful here, including composition of operations, operations on multiple components such as <u[1]+u[2], u[3]>, and generalized aggregate construction.
Logical expressions (predicates) occur in the in the WHERE and HAVING clauses of a query, but also in other language constructs as well, such as conditionals and iteration control. They might also occur in assignment statements to Boolean variables.
Thus the semantics of logical expressions needs to be defined in a larger context than queries. How they are defined does not affect the fundamental semantics of queries, so long as they always return a Boolean value for all possible arguments.
The same is true for identity and equality operators. Once defined in the target model or language, they can simply be incorporated into query expressions. Identity and equality are model issues, not query issues.
Example
The search domain for an SQL query is a relation, or a cross-product of relations, which is again a relation. An orthogonal specification would simply say that:
Whether or not an SQL query allows the search domain to be specified as the union of relations, or as another query, is a specific idiosyncrasy of the SQL language. It does not affect the inherent behavior of the query, nor does it affect the definition of the relational data model.
Example
Union of SQL SELECT statements isn't really a query issue. Each SELECT statement has its own well-defined behavior, yielding a relation.
Union of relations is an orthogonal language feature. It could even be argued that it is purely a computational aspect; relations are relations whether or not the language supports unions of them.
In any case, union of SELECT statements can be described simply as the union of relations, not as a unique characteristic of queries.
Example
DISTINCT isn't a particularly query-specific feature. The ability to remove duplicates, i.e., convert a bag into a set, is another feature common to many data models which can be applied to the results of a query.
Example
Elementary data types are not a query issue. It could be argued that they are not even a data model issue. A query is a query - and a relation is a relation - whether or not relations may contain floating point numbers, or dates.
Elementary data types seem to belong more to the general computational aspect of a language.
Example
The essential filtering behavior of query depends on the where-clause evaluating to true or false (or to a counter value for a more general bag query model) for each element of the search domain. Potentially any expression in the language which evaluates to a Boolean (or counter) value could serve as the where clause. These expressions could be composed of any operations supported in the language.
Similar things can be said about expressions involved in restructuring.
A basic set of operations is provided by the data model, defining how values in data constructs may be retrieved. Most other aspects of the expression
Thus, none of the following is specifically a query issue:
The treatment of negation and nulls, for example, is a general characteristic of Boolean valued expressions, which might occur in queries, if-then, while, and other statements involving truth values.
Nested queries should be describable in the same way as any subexpression which returns a collection. Such things must ultimately be involved in operations which eventually contribute to Boolean-(or counter-)valued results, e.g., Count(Q), Max(Q), Occurs(x,Q), Q1=Q2, etc.
Example
Group-by and Having can be described as generic query features, independent of data model and computational model.
Listed below is a plausible framework of orthogonal components for the specification and standardization of languages and models. It's not particularly original, and some details may be arguable. Our main purpose is to illustrate how queries can be isolated as an orthogonal component.
Independent of specific operations or data constructs. Includes such characteristics as
Types, constructors, accessors, equality.
Types, constructors, accessors, equality.
(It could be argued that some of those should be included under the heading for elementary data items. It doesn't matter.)
(Could be included under specific operators.)
If you're designing a whole language, this leads to a nice orthogonal design.
For partial language (e.g., a data sublanguage), this helps identify
The framework isolates independently standardizable areas.
The FROM clause in SQL is based on the cross product of relations. If orthogonally designed, such a cross product would be a general language feature, allowing something like
x <- R1xR2.
Then the syntax
SELECT FROM R1, R2 WHERE...
could be explained in terms of native language features, rather than with a specialized explanation for query. Furthermore, it might then be possible to also use other sorts of expressions for the search domain. Similarly, the DISTINCT keyword could be described in terms of a general operation to convert collections to sets if it were available in the language, and union of queries could simply be defined as union of collection-valued expressions if such unions were supported in the language.
What might make a query object oriented?
Those are all secondary characteristics, layered on top of the fundamental query paradigm. Different object models may have different capabilities and semantics for aggregate constructors and accessors. The query paradigm described here can accommodate any of them, hence is adaptable to various object models. The query paradigm will be standardized to the extent that such aggregate constructors and accessors are standardized.
Even identity of result objects is an orthogonal issue. Query is just one possible collection-valued operation. Others might include union (concatenation), intersection, difference, construction (e.g., {x,y} or [x,x]), evaluation of collection-valued or set-valued functions like Children(Sam). Does each execution of one of those create a new object?
One position...
Collections are extensional, their identity being determined by their contents: y and z refer to the same collection if and only if y(x)=z(x), i.e., Occurs(x,y) = Occurs(x,z), for all x.
Queries return references to collections without creating anything new, in the same way that arithmetic operations return references to existing numbers, not creating new ones. The effect of new objects can be obtained by assigned query results to be properties of distinct objects, e.g.,
Roster(x) := SELECT ...
Roster(y) := SELECT ...
Identity issues also show up in equality comparisons within expressions. They are also hidden in set constructors which eliminate duplicates, since they have to determine if x=y in {x,y}.
Consider the following relational schema:
S (s# *, name, gpa) /* student
C (c# *, cname, units, dept) /* course
E (c# *, s# *) /* enrollment
Q1. Find names of students who are enrolled in all courses.
In SQL:
SELECT name FROM S WHERE NOT EXISTS (
SELECT * FROM C WHERE NOT EXISTS (
SELECT * FROM E WHERE E.s# = S.s# AND E.c# = C.c#));
In canonical MIQP form:
Q1(): SELECT name(s) FROM S s WHERE NOT EXISTS (
Q1a(s): SELECT c FROM C c WHERE NOT EXISTS (
Q1b(s,c): SELECT e FROM E e WHERE s#(e)=s#(s) AND c#(e)=c#(c) ));
Observe: EXISTS Q ::= 0<Card(Q), and NOT EXISTS Q ::= 0=Card(Q).
Working from the inside out:
/* Q1b(s,c) selects all the enrollments for a given student s and course c */
Q1b(s,c): SELECT e FROM E e WHERE s#(e)=s#(s) AND c#(e)=c#(c); /* the sub-query */
Q1b(s,c)(e) = E(e) * [s#(e)=s#(s)] * [c#(e)=c#(c)]; /* the profile of the sub-query */
Card(Q1b(s,c)) = SUM[e] (E(e) * [s#(e)=s#(s)] * [c#(e)=c#(c)]); /* the cardinality of the sub-query */
/* Q1a(s) selects all the courses c such that for a given student s the number of enrollments of s in c is 0 */
Q1a(s): SELECT c FROM C c WHERE NOT EXISTS Q1b(s,c); /* the sub-query */
Q1a(s)(c) = C(c) * [0=Card(Q1b(s,c))] /* the profile of the sub-query */
= C(c) * [0=SUM[e] (E(e) * [s#(e)=s#(s)] * [c#(e)=c#(c)])];Card(Q1a(s)) = SUM[c] (C(c) * 0=Card(Q1b(s,c))) /* the cardinality of the sub-query *
= SUM[c] (C(c) * [0=SUM[e] (E(e) * [s#(e)=s#(s)] * [c#(e)=c#(c)])]);/* Q1() selects the name of a student s if the number of courses in which s has 0 enrollments is 0 */
Q1(): SELECT name(s) FROM S s WHERE NOT EXISTS Q1a(s);/* the query */
Q1()(n) = SUM[s] (S(s) * [0=Card(Q1a(s))] * n=name(s)) /* the profile of the query */
= SUM[s] (S(s) * [0=SUM[c] (C(c) * [0=SUM[e] (E(e) * [s#(e)=s#(s)] * [c#(e)=c#(c)])])] * n=name(s))/* that profile counts all students having the same name; the result could have duplicate names */
Q2. Get s# of students who are enrolled in at least one course being taken by at least one student who is enrolled in at least one course offered by physics department.
In SQL:
SELECT UNIQUE s# FROM E WHERE c# IN (
SELECT c# FROM E WHERE s# IN (
SELECT s# FROM E WHERE c# IN (
SELECT c# FROM C WHERE dept = "Physics")));
In canonical MIQP form:
Q2: MakeSet (
Q2a: SELECT s#(e1) FROM E e1 WHERE c#(e1) IN (
Q2b: SELECT c#(e2) FROM E e2 WHERE s#(e2) IN (
Q2c: SELECT s#(e3) FROM E e3 WHERE c#(e3) IN (
Q2d: SELECT c#(c) FROM C c WHERE dept(c) = "Physics")))));
Observe: x IN Q ::= 0<Q(x).
Working from the inside out:
/* Q2d() selects the c#'s of all the courses in the Physics department */
Q2d(): SELECT c#(c) FROM C c WHERE dept(c) = "Physics"; /* the sub-query */
Q2d()(c2d) = SUM[c](C(c) * [dept(c)="Physics"] * [c2d=c#(c)]); /* the profile of the sub-query */
/* that profile counts all the courses having the same number *//* Q2c() selects the s#'s of students involved in enrollments in courses in the Physics department, i.e., all students enrolled in courses in the Physics department*/
Q2c(): SELECT s#(e3) FROM E e3 WHERE c#(e3) IN Q2d(); /* the sub-query */
Q2c()(s2c) = SUM[e3] (E(e3) * [0<Q2d()(c#(e3))] * [s2c=s#(e3)]) /* the profile of the sub-query */
= SUM[e3] (E(e3) * [0<SUM[c](C(c) * [dept(c)="Physics"] * [c#(e3)=c#(c)])] * [s2c=s#(e3)]);/* Q2b() selects the c#'s of all the courses involved in enrollments involving students involved in enrollments in courses in the Physics department, i.e., all courses taken by students enrolled in courses in the Physics department*/
Q2b(): SELECT c#(e2) FROM E e2 WHERE s#(e2) IN Q2c(); /* the sub-query */
Q2b()(c2b) = SUM[e2](E(e2) * [0<Q2c()(s#(e2))] * [c2b=c#(e2)]) /* the profile of the sub-query */
= SUM[e2] (E(e2) * 0<SUM[e3] (E(e3) * 0<SUM[c](C(c) * [dept(c)="Physics"] * [c#(e3)=c#(c)]) * [s#(e2)=s#(e3)]) * [c2b=c#(e2)]);/* Q2a() selects the s#'s of students involved in enrollments involving courses returned by query Q2b() */
Q2a(): SELECT s#(e1) FROM E e1 WHERE c#(e1) IN Q2b(); /* the sub-query */
Q2a()(s2a) = SUM[e1] (E(e1) * [0<Q2b()(c#(e1))] * [s2a=s#(e2)]); /* the profile of the sub-query */
= SUM[e1] (E(e1) * 0<SUM[e2] (E(e2) * 0<SUM[e3] (E(e3) * 0<SUM[c](C(c) * [dept(c)="Physics]" * [c#(e3)=c#(c)]) * [s#(e2)=s#(e3)]) * [c#(e1)=c#(e2)]) * [s2a=s#(e2)]);/* Q2() eliminates duplicates from Q2a() */
Q2(): MakeSet(Q2a()); /* the query*/
Q2()(x) = 0<Q2a(x); /* the profile of the query */
= 0<SUM[e1] (E(e1) * 0<SUM[e2] (E(e2) * 0<SUM[e3] (E(e3) * 0<SUM[c](C(c) * [dept(c)="Physics"] * [c#(e3)=c#(c)]) * [s#(e2)=s#(e3)]) * [c#(e1)=c#(e2)]) * [x=s#(e2)]);
Q3. Find name of students enrolled in courses not being taken by students whose gpa's are lower than 3.0.
In SQL:
SELECT name FROM S, E WHERE S.s# = E.s# AND E.c# NOT IN (
SELECT E.c# FROM S, E WHERE S.s# = E.s# AND S.gpa < 3.0);
In canonical MIQP form:
Q3(): SELECT name(s1) FROM XProd(S, E) <s1,e1> WHERE s#(s1)=s#(e1) and c#(e1) NOT IN (
Q3a: SELECT c#(e2) FROM XProd(S, E) <s2,e2> WHERE s#(s2)=s#(e2) AND gpa(s2) < 3.0);
Observe: XProd(X,Y)(<x,y>) ::= X(x)*Y(y) (the number of occurrences of <x,y> in the cross product of X and Y is the product of the occurrences of x in X and y in Y). This can also be expressed as
XProd(X,Y)(x) ::= X(x[1])*Y(x[2]).
Thus:
/* Q3a() selects c#'s of courses involved in enrollments involving students who have gpa<3.0 */
Q3a(): SELECT c#(e2) FROM XProd(S, E) <s2,e2> WHERE s#(s2)=s#(e2) AND gpa(s2) < 3.0); /* the sub-query */
Q3a()(cn) = SUM[<s2,e2>](S(s2)*E(e2) * [s#(s2)=s#(e2)] * [gpa(s2) < 3.0] * [cn=c#(e2)] ); /* profile of the sub-query */
/* Q3() selects names of students enrolled in courses not returned by Q3a() */
Q3(): SELECT name(s1) FROM XProd(S, E) <s1,e1> WHERE s#(s1)=s#(e1) and c#(e1) NOT IN Q3a() */
Q3(n) = SUM[<s1,e1>](S(s1)*E(e1) * [s#(s1)=s#(e1)] * [0=Q3a()(c#(e1)] * [n=name(s1)] )
= SUM[<s1,e1>](S(s1)*E(e1) * [s#(s1)=s#(e1)] * [0=SUM[<s2,e2>](S(s2) * E(e2) * [s#(s2)=s#(e2)] * [gpa(s2) < 3.0] * [c#(e1)=c#(e2)] )] * [n=name(s1)]);
From G&W 3.1 [5]:
Query KQ1:
SELECT Ri.Ck FROM Ri,Rj WHERE Ri.Ch=Rj.Cm;
In MIQP:
SELECT Ck(xi) FROM XProd(Ri,Rj) <xi,xj> WHERE Ch(xi)=Cm(xj);
KQ1()(x) = SUM[<xi,xj>][Ri(xi)*Rj(xj) * Ch(xi)=Cm(xj) * x=Ck(xi)];
Query KQ2:
SELECT Ri.Ck FROM Ri WHERE Ri.Ch IS IN (SELECT Rj.Cm FROM Rj);
In MIQP:
KQ2(): SELECT Ck(xi) FROM Ri xi WHERE Ch(xi) IN
KQ2a: SELECT Cm(xj) FROM Rj(xj);KQ2a()(y) = SUM[xj](Rj(xj) * [y=Cm(xj)]);
KQ2()(x) = SUM[xi](Ri(xi) * [0<KQ2a()(Ch(xi))] * [x=Ck(xi)])
= SUM[xi](Ri(xi) * [0<SUM[xj](Rj(xj) * [Ch(xi)=Cm(xj)])] * [x=Ck(xi)]);
Compare:
KQ1()(x) = SUM[<xi,xj>](Ri(xi)*Rj(xj) * [Ch(xi)=Cm(xj)] * [x=Ck(xi)]);
KQ2()(x) = SUM[xi](Ri(xi) * [0<SUM[xj](Rj(xj) * [Ch(xi)=Cm(xj)])] * [x=Ck(xi)]);
Suppose there is just one pair <xi,xj>, with Ri(xi)=1, Rj(xj)=2, Ch(xi)=Cm(xj), x=Ck(xi). Then:
KQ1()(x) = SUM[<xi,xj>](1*2 * 1 * 1) = 2;
KQ2()(x) = SUM[xi](1 * [0<SUM[xj](2 * 1)] * 1) = 1;
Hence the two queries are not equivalent if Rj contains duplicates. There may be other cases as well.
Note that this analysis was done without any reference to join techniques.
From G&W 3.2:
Query KQ3:
SELECT Ri.Ck FROM Ri WHERE Ri.Ch = (
SELECT Agg(Rj.Cm) FROM Rj WHERE Rj.Cn=Ri.Cp);
In MIQP:
KQ3(): SELECT Ck(xi) FROM Ri xi WHERE Ch(xi) = Agg(
KQ3a(xi): SELECT Cm(xj) FROM Rj xj WHERE Cn(xj)=Cp(xi));KQ3a(xi): SELECT Cm(xj) FROM Rj xj WHERE Cn(xj)=Cp(xi);
KQ3a(xi)(x) = SUM[xj](Rj(xj) * [Cn(xj)=Cp(xi)] * [x=Cm(xj)]);
KQ3(): SELECT Ck(xi) FROM Ri xi WHERE Ch(xi)=Agg(KQ3a(xi));
KQ3()(y) = SUM[xi](Ri(xi) * [Ch(xi)=Agg(KQ3a(xi))] * [y=Ck(xi)]);
[What do we need to know about the behavior of Agg?]
Query KQ4:
SELECT Ri.Ck FROM Ri WHERE Ri.Ch = (
SELECT Rt.C2 FROM Rt WHERE Rt.C1=Ri.Cp);Rt(C1,C2) = (SELECT Rj.Cn, Agg(Rj.Cm) FROM Rj GROUP BY Rj.Cn);
[MAY BE CONTINUED. THIS IS BIZARRE SQL!!!!]
The negation of p(x) is given by the expression p(x)=0.
This returns 1 if p(x) is 0, and 0 otherwise. Double negation always returns 0 or 1, and not the value of p(x) if p(x)>1.
If f(x) then g(x) else h(x)
can be expressed as
(f(x)¬=0)*g(x)+(f(x)=0)*h(x).
EXISTS x IN X p(x) can be expressed as [SUM[x] IN Xp(x)]>0. This is computable when X is finite.
FORALL x IN X p(x) can be expressed via the usual double use of negation, i.e., [SUM[x] IN X(p(x)=0)]=0.
PUT THIS WITH AN EARLIER DEFINITION OF IN : If f is not bag- or set-valued, we can take y IN f(x) to be 1 when y=f(x) and 0 otherwise, considering this case to be isomorphic with {y}=f(x). Hence we use y IN f(x) as the general case. [We don't deal with g being bag-valued. Should we?]
y IN f(g(x)) is equivalent to EXISTS z IN Z|z=g(x) AND y IN f(z). Here Z is the intersection of the result type of g and the argument type of f. For function composition, these would normally all be the same.
Arithmetically, this becomes
[y IN f(g(x))] = [SUM[z IN Z](z=g(x))*(y IN f(z))]>0.
Recursion in general should be manageable in terms of recursive arithmetic functions. A simple form arises when Q(x) has the form
Q(x) = if f(x) then g(x) else Q(h(x)),
where f, g, and h are not recursive. The arithmetic form of if-then yields
Q(x) = (f(x)¬=0)*g(x) + (f(x)=0)*Q(h(x)).
This is computable for any x such that h*(x)=0.
Expanding the composed functions yields
[TO BE REVISED] Q(x) = (f(x)¬=0)*g(x) + (f(x)=0)*Q(h(x)).
Let's try the good old ancestor problem. First let's explore a very simple form in which Parent is a single-valued function:
Ancestors(y) = {Parent(y)} UNION Ancestors(Parent(y)).
(This doesn't even involve a query!)
The corresponding profile might be given by [CAN THIS BE FORMALIZED, NOT JUST INTUITED?]
A(y)(x) = x IN {Parent(y)} + A(y)(Parent(x))
A(y)(x) = (x=Parent(y)) + SUM[z IN Person][(z=Parent(x))*A(y)(z)].
TO BE CONTINUED!!!