William Kent

Database Technology Department

Hewlett-Packard Laboratories

Palo Alto, California

Feb 1994

**CONTENTS:**

> 1 INTRODUCTION . . . 2

>> 1.1 Terminology . . . 2

>> 1.2 Overview . . . 2

> 2 THE COMPUTATIONAL CONTEXT . . . 3

>> 2.1 Interfaces to a Black Box . . . 3

>> 2.2 Symbols . . . 3

> 3 NON-MATERIALIZED AND MATERIALIZED TYPES . . . 4

> 4 MATERIALIZED TYPES FOR NUMBERS . . . 5

> 5 IDEAL AND EFFECTIVE TYPES . . . 5

>> 5.1 Ideal Non-Materialized Numeric Types . . .
6

>> 5.2 Ideal Materialized Numeric Types . . . 7

>> 5.3 Effective Materialized Numeric Types . . .
7

>> 5.4 Effective Non-Materialized Numeric Types .
. . 7

> 6 DIMENSIONAL STUFF . . . 9

>> 6.1 Relevance of Non-Materialized and
Materialized Types . . . 9

>> 6.2 Use and Mention of Designators . . . 9

>> 6.3 An Interesting Analogy . . . 10

>>> 6.3.1 Single Designation . . . 10

>>> 6.3.2 Multiple Designations . . . 10

>>> 6.3.3 Conversions . . . 11

>> 6.4 Another Analogy . . . 12

This is a preliminary draft of work in progress. Please make appropriate allowances. Any sort of feedback is welcome.

Most of the time we gloss over a lot of fine details, relegating them to the "obvious assumptions" we can hand-wave around. Attempts to probe into such detail are annoying, academic, philosophical, pointless. We don't want to be bothered with use vs. mention, or the distinction between Bill, a picture of Bill, a picture of the appearance of Bill, ...

Every once in a while, though, our problems can be traced to ambiguities and overlooked assumptions in this "noise". This may be happening in connection with measured quantities, which seems to be getting tangled up in various aspects of concepts, representation, precision, etc. This note is a recap of elementary first principles, to try to get a clearer picture of where the problems really creep in.

Right now this is just thinking out loud, trying to use the writing process to organize my thoughts. The development and conclusions are still in formative stages.

We have the usual sorts of terminology problems, and global changes can be made if necessary.

Alternatives to "non-materialized" and "materialized" include "non-symbolic" and "symbolic", "non-lexical" and "lexical", and perhaps others. I've also considered "abstract" instead of "non-materialized", but the term "abstract type" has already been appropriated for another use (less appropriate, in my opinion).

We're sometimes careful to distinguish between a symbol and the thing it represents, but we sometimes get them confused. We know the difference between a social security number and a person, and between an oid and a person. But we don't make the same distinction between the string "10" and the abstract number it represents (we call them both numbers), or between the expression "6 feet" and the distance it represents (we call them both distances). We somehow try to make the notion of "data type" cover both kinds of concepts.

It's an old story. The number of fingers I have is not the same thing as the two-character sequence "10". The number of fingers I have can be denoted by a variety of symbols such as 10, 1010, 1111111111, A, X, ten, dix, zehn, and so on. When I say "ten", let's think of the abstract number, not the digit sequence. Until I say otherwise, whenever I mention a number I mean the abstract number, not the symbol.

On another front, the notion of data type founders on the rocky boundary between abstract semantics and the ugly realities of implementation. Thus the notion of Integer sometimes means all of the integers, and sometimes only that set of integers which can be represented in some particular machine construct.

Let's think of a computational system as a black box with interfaces. The interfaces transfer information back and forth to users (output and input), and also back and forth to persistent storage or network communication facilities. For our purposes we distinguish three relevant viewpoints:

- The input/output interface. We often call this the end-user interface, though it is also relevant to application designers.
- The storage and network interface. We will call people concerned with this interface "application implementers".
- The inside of the black box, known only to system implementers.

Depending on what we have in mind as the boundaries of the black box, the input/output interface is not limited to input/output devices in the usual sense. It may be an interface to other programming facilities, e.g., external address spaces. Thus, for example, certain variables may have their values maintained opaquely inside the black box, while other variables are external. Passing values to or from such external variables counts as crossing the input/output interface. (For this reason, I am sometimes tempted to use the terms "import" and "export" rather than "input" and "output", so that it doesn't necessarily connote an I/O device.)

Following the object-oriented principle of separating interface (semantics) from implementation, we focus primarily on the end user's input/output interface. We talk primarily about the language facilities provided for use at this interface, and how information appears when it crosses this interface. We also talk about the requirements and constraints imposed at this interface, without specifying how these are satisfied in the other viewpoints. Thus we may talk about how information looks on input, and what is consequently expected on output, without dictating how this appears inside the black box or in storage. In many cases it might be most expedient to represent things in a form similar to what was provided on input, but this is not required.

In keeping with this principle, OpenODB should be described as an object-oriented DBMS which implements the object-oriented language OSQL by mapping data into relational tables. Relational technology is used for storage, and for much of the processing inside the black box. However, this is just one of various possible implementations of OSQL. The fact that data is maintained internally in tables should not show through to end users, and should not be involved in explaining the semantics of OSQL. (That such things do occasionally show through is an unfortunate consequence of some difficult implementation choices, often dictated by performance considerations.)

The other viewpoints do need to be considered as well. Somebody does have to decide how things will look in storage and inside the black box. Our position is that these specifications should be segregated in such a way that the end-user's understanding of the system behavior does not require knowledge of these specifications. Furthermore, these "implementation" specifications should be changeable without altering the semantic behavior presented to the end user.

The black box is a symbolic system, which means that the only things which pass across the interfaces are symbols. People and abstract numbers don't pass across these interfaces; strings and pictures do. Our general definition of "symbol" is circular, taken to mean whatever can pass across these interfaces. The exact nature expands with advances in technology, including such things as text sequences, pictures, sounds, electrical signals, and even actions such as mouse movements or screen touches. For our purposes, though, it suffices to think of symbols as text sequences.

An important point: although non-symbols can't pass across the interfaces, they can have opaque representations inside the black box. We can tell the box to create a thing such as a person and put it into a local variable. What goes into that local variable does not have to be known externally. Internal operations can be requested without such knowledge, using the local variable or some other sort of expression. The thing does have to be provided in a designed symbolic form when it is to be output (exported), either to some output device or an external variable. There also has to be some convention for understanding that an input symbol designates this thing. The thing also has to be put into a designed symbolic form when it goes to storage, but that need not be a concern of the end user.

Simple examples:

- Suppose that oid formats were not exposed, so that the black box designers were free to redesign the formats used internally. One could create an instance of Person without being able to see the oid, or knowing what the oid looks like. To display a person, e.g., as the result of a query, we would have to establish some other convention for mapping the person into a symbol, such as a name or a social security number. (See the example in Section 6.3.)
- For input and output of numbers, there are conventions which determine whether the symbols are to be interpreted as decimal, binary, hexadecimal, octal, etc. However, the numbers may be represented internally in various bit patterns which may or may not be related to forms in which they were entered.
- Date and time are often supported with a variety of external formats used for input and output, while the system uses some highly encoded timestamp format. (The fact that the internal format is sometimes described in the manual is irrelevant; it shouldn't matter to the end user.)
- By extension, we can imagine any measured quantity as being maintained internally in some unspecified format. Representation in terms of a particular number and units only matters to the end user during input and output. It also matters to the application implementer when specifying how it will be maintained in storage.

This is territory where we usually don't want to be bothered with the distinctions, because we assume some familiar defaulting conventions which gloss over the detail. Thus we hate to have to say that 10 is to be interpreted as decimal digits, because that's the normal assumption. Such things only matter when the defaults aren't self-evident. Thus our overall approach should proceed in stages: first spell out the painful details involved, then seek ways to ease the pain by restoring appropriate defaults wherever possible.

The general idea is that only materialized types can be exported and imported. Any other type has to be cast (coerced?) into a materialized type for export, and an inverse casting mechanism needs to be established for import.

This is closely related to the notion of semantic and presentation objects, taken to an extreme degree. Here, semantic objects are inside the black box, in opaque form. Anything else that is input, output, or stored is in a presentation form, including simple strings of bits, bytes, or characters.

The essential idea: Integer is a non-materialized type, not presuming any symbolic representation. The corresponding materialized types include Decimal_Integer (the traditional default), Hexadecimal, Octal, Binary, etc. We might also have Word_Integer, whose instances include "one", "ten", "one hundred twenty-one", perhaps with variants for different languages. The check-writer in my financial software converts numbers to a type similar to this. It may be the case that we know the internal form used inside the black box, or in storage, but this doesn't really matter to the end user.

The concept of a *materialized numeric type* deals with the correspondence
between symbols and abstract numbers, with each such type having an associated:

- Set of symbols.
- Corresponding set of abstract numbers.
- Mapping between these.

The mapping is an important part of the type concept, since the same symbol can occur in different numeric data types to denote different numbers. Thus 10 means different things in decimal and binary, and 1E1 means different things in exponential and hex.

For the Hexadecimal type the set of symbols includes all sequences over the alphabet {0...9,A...F}. (This is the ideal type; corresponding effective types will be limited to finite sets [Section 5].)

The mapping is a pair of inverse functions for input and output, establishing correspondences between abstract numbers and symbols. Thus HexIn(A) applies the HexIn function to the symbol A, yielding the number ten (in no specific representation). Similarly, HexOut(OctalIn(10)+BinaryIn(10)) yields the symbol A.

We've actually seen these functions in many languages, though with a different syntax. HexIn(10) often takes the form x `10'. The forms x`10', o`10', and b`10' each map the string 10 into a different number. (Those are "designators".) In fact, we can think of quotation marks in a similar sense: "10" behaves like CharIn(10), mapping the string 10 to an opaque internal representation of itself.

[Say something about when the materialized numeric types are specified. Most often left to the default.]

It's interesting to note that there is no materialized type corresponding to the real numbers, if we assume that symbol sets are denumerable. A data type called "Real" is dealing at most with the rational numbers.

This is where we take into account the effect of implementation restrictions on data types. An ideal type is an abstraction which typically cannot be fully implemented in any real system. An effective type is the type that's really supported by an implementation.

We want to try to describe these in semantic terms, being minimally contaminated by "implementation" considerations. Unfortunately, "minimal" is not zero.

The essential problems are these:

- An effective type does not have all the nice characteristics of its corresponding ideal type. It does not satisfy all the axioms, most notably the closure axioms.
- There are usually many effective types corresponding to one ideal type.

Languages typically take one of the following approaches:

- They only mention the ideal types in the language, and essentially tell the user about the effective equivalents by referring to "implementation dependencies" hidden away in some appendix. An ideal type thus might bind to different effective types in different environments.
- They actually incorporate effective types into the language, introducing such things as Long and Short, or Integer_16 and Integer_32. However, they sometimes do funny things, such as not recognizing that an instance of Integer_16 is also an instance of Integer_32. (The number ten occurs in both.)

Dividing these types into non-materialized and materialized, and also into ideal and effective, yields four categories. We'll go around the matrix in the following order:

- Ideal non-materialized types contain the infinitely perfect abstract numbers.
- Ideal materialized types contain their representations without length restrictions.
- Effective materialized types contain the representations with length restrictions.
- Effective non-materialized types contain the finite subsets of the abstract numbers which can be represented.

Ideal non-materialized numeric types are the ones we learned about in school. The counters (non-negative integers) are a subset of the integers, which are a subset of the rationals, which are a subset of the reals, which are a subset of the complex numbers if we want to go that far. By "subset" we mean a naive form of inclusion whereby the integer ten is itself a rational number, not just "equivalent" to something else denoted by 10.0.

All these sets are infinite, and they are all closed under addition, subtraction, and multiplication. All but the integers are closed under division. Comparisons are perfect, i.e., two numbers either are or are not equal, unambiguously. Equality is a perfect equivalence relation (symmetric, reflexive and, hardest of all, transitive), and the orderings perfectly satisfy their axioms. Even assignment works perfectly: x=e always holds after x<-e.

These are the ones described above in Section 4, with infinite symbol sets.

These are the materialized numeric types with length restrictions. There are overall length restrictions, as well as restrictions on various components, such as the mantissa and the exponent.

In effect, these are parameterized types, parameterized by different lengths.

The effective non-materialized numeric types are the ones actually supported in computational systems. They represent places where implementation considerations show through into language semantics, being impacted by limitations of hardware and representation.

(Alternative terminology: we might also call these "practical", "pragmatic", or even "finite" numeric types. Even the term "real" - as opposed to ideal - comes to mind, but that would get confused with the real numbers.)

Note that several different effective materialized types might correspond to the same effective non-materialized type. For example, 4-character hex and 16-bit binary have the same capacity.

An ideal numeric type can have a large number of overlapping effective numeric subtypes, corresponding to different internal limitations. The essential characteristic of the effective types is that they include only a subset (usually finite) of the instances of the ideal supertype. Thus Integer_16 only contains the integers up to (2**16)-1, while Integer_32 includes those as well as all other integers up to (2**32)-1. There may be interior gaps as well as outer bounds. A type restricted to three significant digits can represent 1230 and 12300, but not 1234; the "successor" to 1230 is 1240.

This incompleteness leads to the central semantic problem of effective types: failure of closure under arithmetic operations. This is exactly what causes overflow and underflow errors, as well as truncation and rounding.

Most operations are not closed within an effective type. More generally, trying to
represent an arbitrary abstract number requires an additional *fudge* step (often
roundoff or truncation) to find an appropriate (not always nearest) abstract number which
is representable in that data type. Similarly, comparisons sometimes take such fudge
factors into account.

So, the course of a computation goes roughly as follows:

- Argument symbols are obtained from cells.
- Interpretation of the argument symbols is governed by data type specifications, which might be declared in the computational procedure, associated with the source cells, or embedded in the symbols for self-describing data. These interpretations govern the abstract number yielded by the computation.
- The abstract number resulting from the computation has to be represented in some target cell (variable, table entry, etc.). The first step is to fudge the result to the appropriate number representable in the target data type, and then place the representation of that number into the target cell.

Such data types correspond to mappings into, rather than onto, a certain kind of number (e.g., rational or integer). The set of numbers it maps symbols onto are its representable numbers. The inverse mapping is total on the representable numbers, partial on all the numbers. Alternatively, the inverse mapping is total but not invertible; an arbitrary number first gets fudged to a representable number, whose corresponding symbol is then used. Inverting this yields the representable number, not the original number.

Example: say SmallRational can represent numbers to two decimal places. SmallRational(2/3)=0.67. Number(SmallRational(2/3)) yields a number equal to 67/100, not 2/3.

We might distinguish between "exact numbers" and "true values". A symbol exactly denotes a certain number when interpreted within a specified data type. It might not be the intended "true" number for a variety of reasons. Thus 0.666666666667 very exactly denotes a certain rational number in decimal notation. It may be "wrong" because it was originally computed as 2/3, or because a ruler had finite precision, or because somebody entered the wrong number. But it is a very exact number. This is analogous to saying that 1,000,000 is a very exact number, but it might be "wrong" because it is only the estimated population of a certain city.

We'd like to relegate this all to the realm of "implementation", quite unrelated to the semantics of operations performed by users. Unfortunately, we can't. Users see the consequences in the form of various sorts of errors. Results of certain computations get too large or too small, and comparisons occasionally behave strangely. Assigning x/y to z might not yield equality between x/y and z.

Therefore, the user needs to be conscious of "subtypes" of the number kinds. It is in fact semantically meaningful to differentiate between Integer_16 and Integer_32, or between Rational(5,5) and Exponential(5,5) [that's not the right syntax, is it?]. Each of these types has a different population of representable numbers, and assigning anything to cells having such types has to be understood as implying a fudge step.

On the other hand, we need to be careful whether we have in mind different internal implementations or different sets of abstract numbers. The difference is illustrated by whether or not we consider Integer_16 to be a subtype of Integer_32. The integer ten is an instance of both, and the same can be said of any abstract number representable in Integer_32. Thus the numbers in the abstract set denoted by Integer_16 is a subset of the set denoted by Integer_32. On the other hand, a 16-bit field is not a 32-bit field. The fields declared to be of type Integer_16 are not declared to be of type Integer_32. In this sense, Integer_16 is not a subtype of Integer_32.

The same sort of argument arises between such types as Integer and Rational (aka Real). Numerically, one is a subtype of the other, but in terms of representations they are not.

An interesting and possible useful idea is to treat numeric types as parameterized types, with the "storage" type as a parameter. This would be a useful way of relating the abstract and stored types. For example, Integer might be a type generator rather than a single type. Specific types would be given by Integer(32_bit) and Integer(16_bit). Similarly, rational types might be given by Rational(5,5) and Rational(5E5).

Clearly the intent here is to think of physical quantities in terms of non-materialized types and quantity designators in terms of materialized types. However, it turns not to be quite that simple. There seem to be intermediate stages. Mapping a distance into a distance-designator doesn't yet yield something material until we further specify how to materialize the number and the units.

When I mentioned "use and mention" way back there in the beginning, I thought I was kidding. It was just a passing reference to a classic philosophical (ugh) distinction that most people find too tedious. But it's relevant. (Does anyone remember the philosopher to whom the distinction is attributed?)

We are getting confused between using and mentioning quantity descriptors.

The distinction is simple yet slippery. If we proceed carefully it's all very clear, yet in a flash it can all tumble down in a confusing mess.

We *use* symbols to *mention* the things denoted by the symbols. When I
say that I was in Manhattan, you know that I just mentioned an island in New York. If I
asked you how long Manhattan is, you might tell me that it's about twelve miles. But what
if I want to mention - that is, talk about - the name of that island? If I ask you how
many characters there are in Manhattan, you might guess there are about two million.
That's because I'm *using* the name. If I want to *mention* the name, the
normal convention is to *use* a quoted form: How many characters in
"Manhattan"? Now the answer is nine. I *use* the 11-character sequence
"Manhattan" to *mention* the nine-character name of the island. I use the
nine-character sequence to mention the island.

It's somewhat analogous to asking whether 2+2 and 1+3 are the same. Yes, if we are using these expressions to mention numbers; they both mention the same number. No, if we are mentioning the expressions; they are different expressions.

What has this got to do with units? Watch carefully.

Good old Bill, his height is six feet.

What were the units of that? Feet.

What was that last question about? It wasn't about the distance, it was about the designator.

Is Bill taller than Bruce? I don't know.

What was that last question about? It was about the distance, not the designator.

At the risk of getting too tedious, let me spell it out again. Designators and distances are different. Sometimes we're talking about (mentioning) the designator, and sometimes we're using the designator to talk about (mention) the distance.

How can we manage the distinction, to know which we're talking about when? A crude, impractical solution is to use something like quotes. We could say that it doesn't make sense to ask about the units in six feet, but we can sensibly talk about the units in "six feet".

A more useful approach involves casting. We recognize that designators and distances are different types of things. We have some operations which are defined on designators but not on distances, i.e., the ones which return the number, units, or dimension. We also have some operations which make sense on distances but not really on designators, e.g., for comparing or combining. We make such operations applicable to designators by invoking the casting concept, saying that if a designator appears where a distance is required, then we cast the designator into the corresponding distance.

This also ties in to the overloading notion mentioned by Stephanie, since casting always implies a form of overloading. Given a casting function c12 from type T1 to type T2, then any function defined on T2 is effectively also defined on T1. If x is an instance of T1, then f(x) is defined as f(c12(x)).

So, what's the bottom line? As is often the case, we seem to have traveled a long way around to come to a simple conclusion, but it's important if we've gotten it right.

Designators and distances are different things, both recognized in our type system. (Why? Because there is not a one-to-one correspondence between them. There are many designators for a given distance.) Distances only have opaque internal representations, and have to be mapped ("measured") into designators before they can be exported.

Functions can be defined to return either distances or designators. Those that return designators can be used where distances are required by virtue of casting. Those that return distances can be used where designators are required if a default designator has been established, or if an explicit mapping ("measurement") is applied.

If this conclusion is useful, and if the means of getting there has been useful, then we may have chalked one up for the usefulness of philosophy.

We can make an interesting analogy between employees and distances if we assume that oid's cannot be exposed.

A simple case arises when the default symbol for an employee is the employee number. It can be used on output or input to designate an employee, but this may or may not be the form in which employees are maintained internally.

Distance would be an analogous case if there was one standard unit for representing distances, e.g., inches. Then any distance can be designated on input or output simply by a number.

Now suppose we had more than one possible way to designate an employee, e.g., social security numbers could also be used. Now we can't simply look at a number and know which employee it designates. We need to know whether the number is an employee number or a social security number. There are several general ways to do this:

- By a self describing notation, e.g., <123, Enum>, <123, Ssn>.
- Appropriate functional expressions. Suppose that UWC is a variable containing the
internal representation of an employee who is the current United Way Coordinator.
- For internal operations, we can use UWC directly, e.g.,
If UWC=CEO then...

- To display the current UWC, we would use something like:
Display(Enum(UWC))

Display(Ssn(UWC)) - To designate an employee by employee number or social security number, we would use
something like:
UWC<-EmpWithEnum(123)

UWC<-EmpWithSsn(123)

- For internal operations, we can use UWC directly, e.g.,

There is a one-to-one correspondence between <123,Enum> and EmpWithEnum(123). They both designate a certain employee. However, we may or may not be able to think of those two constructs as identical, since one is a tuple while the other is an expression. This distinction might give us trouble.

It might be interesting to bridge the two notations with some functional equivalence. Roughly, we might like to say that

EmpDesBy(<x,y>) ::= GetFunc(y)(x),

where GetFunc(y) returns the EmpWithEnum function if y=Enum and EmpWithSsn if y=Ssn. Thus

EmpDesBy(<123,Enum>) = GetFunc(Enum)(123) = EmpWithEnum(123).

[Develop the analogy here with distances in different units. Establish here the idea of units as functions analogous to Enum and Ssn.]

Given an employee number, we may want to know the corresponding social security number, and vice versa.

With the functional notation, this involves straightforward composition of functions:

E2S(e) ::= Ssn(EmpWithEnum(e)),

S2E(s) ::= Enum(EmpWithSsn(s)).

This is a two-stage process, always getting the internal representation of the employee as an intermediate result. However, let's imagine that there is some computational algorithm e=f(s) which relates employee numbers and social security numbers, e.g., the company defines employee number to be twice the social security number. (Don't ask me why. It may not be practical, but it facilitates the analogy.)

Now we can define the conversions as one-step processes, bypassing the internal representations:

S2E(s) ::= f(s),

E2S(e) ::= f¯¹(e).

However, these are still logically equivalent to the previous forms.

The analogy with units conversion is pretty obvious.

An interesting analogy can also be made with a more abstract notion, namely geometric points in two dimensions. A measurable quantity such as a particular distance has an abstract existence in its own right, just as does a point in a two-dimensional space. These things aren't symbols, or even numbers, so there is no intrinsic way to represent them in symbols. (They could be represented in a broader set of symbols, if we included diagrams.)

These things do have various properties, which can be used to represent them uniquely.
Thus a point has an x-coordinate, a y-coordinate, an angle, and a magnitude. [Is that
appropriate terminology for polar coordinates?] There are functions which can be applied
to a point (in some opaque internal representation) to return any of these properties.
Conversely, certain functions map values to points, and can be used to uniquely *designate*
the points. Thus Rect(x,y) designates a point whose x- and y-coordinates are x and y,
while Polar(x,y) designates a point whose angle and magnitude are x and y. (Generally not
the same point, obviously.)

Similarly, units serve as a set of functions which map a distance into different numbers. Thus Inches maps a particular distance to a particular number, while Feet maps the same distance to a different number. The inverses of these functions can be used to uniquely designate specific distances. Thus InchDistance(12) designates a distance, while FeetDistance(12) designates a different distance.

If these designators are the only way to uniquely refer to the abstract item, then we need some criterion to determine whether two designator expressions refer to the same item. First of all there is a trivial set of rules of the form:

Rect(x1,y1)=Rect(x2,y2) <=> x1=x2 & y1=y2, etc.

InchDistance(x)=InchDistance(y) <=> x=y, etc.

More interesting are the equivalences of the form

Rect(x,y)=Polar(a,m) <=> tan(a)=y/x & m=sqrt(x²+y²).

InchDistance(x)=FeetDistance(y) <=> x=12y.

[Be sure there's no confusion about signs and quadrants.]

Conversion problems. First within a particular designator system. Then across several. Suppose we want to select two points x and y equally spaced between (0,0) and (0,1). We've already got problems with the number system. The three distances from (0,0) to x to y to (0,1) might not add up to the distance between (0,0) and (0,1). This is independent of units.

Then we deal with accuracy of conversion between polar and rectangular coordinates.