Using an Object Model in Pegasus to Integrate Heterogeneous Data

======= STILL NEED TO DO THE FIGURES!!!! ==========

Rafi Ahmed, Philippe DeSmedt, Weimin Du, William Kent, Mohammad Ketabchi (consultant), Witold Litwin (consultant), Abbas Rafii, Ming-Chien Shan
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California

April 1991



> ABSTRACT
> 1 INTRODUCTION . . . 2
>> 1.1 Objectives . . . 2
>> 1.2 Related Work . . . 3
> 2 AN OVERVIEW OF PEGASUS . . . 4
>> 2.1 Intelligent Information Access . . . 5
>> 2.2 Cooperative Information Management . . . 5
>> 2.3 Foreign Data Access . . . 6
> 3 THE PEGASUS DATA MODEL . . . 6
>> 3.1 Pegasus Data Language . . . 7
> 4 MAPPING . . . 8
>> 4.1 Schema Mapping For Relational Databases . . . 9
> 5 QUERY PROCESSING . . . 10
> 6 DATA INTEGRATION . . . 11
>> 6.1 Minimal Integration . . . 12
>> 6.2 Supertypes and Functions . . . 13
> 7 DOMAIN MISMATCH AND SCHEMA MISMATCH . . . 14
>> 7.1 Domain Mismatch . . . 14
>>> 7.1.1 Domain Mappings . . . 16
>>> 7.1.2 Integrator Functions . . . 17
>> 7.2 Schema Mismatch . . . 18
>> 7.3 Summary . . . 19
> 8 OBJECT IDENTIFICATION . . . 20
> 9 CURRENT STATE AND FUTURE DIRECTIONS . . . 22
> 10 CONCLUSIONS . . . 22
> 11 REFERENCES . . . 23


ABSTRACT

The Pegasus project at Hewlett-Packard Laboratories is extending the Iris prototype OODBMS and its language OSQL to support integration of multiple heterogeneous data sources.

Heterogeneity of data models is managed by mapping to a common Pegasus object model, an extension of the Iris model. The power of the object model is then applied to reconciling schema and data heterogeneity within a uniform model.

Each foreign data source has an image as a Pegasus database. The richness and extensibility of the object model facilitate mappings from other data models to the Pegasus model. Mapping capabilities for specific target models will be available as modular extensions of Pegasus and HOSQL.

A Pegasus user perceives the data as a single database, incorporating the data and schemas of attached databases as extensions of the data and schema of the database to which the user is connected. Database boundaries are transparent in this view whenever possible, but they are also visible when needed or desired through appropriate naming conventions.

Semantic integration exploits the features of the object model, including supertypes, inheritance, object identity, behavior specifications, and the treatment of meta-data (e.g., types and functions) as objects themselves. Behavior specifications, i.e., user-defined functions, provide a general mechanism by which users can specify resolution of discrepancies when there is no natural resolution. Techniques are being developed for imposing object identifiers onto non-object-oriented systems, and for reconciling identifiers among multiple systems.

1 INTRODUCTION

1.1 Objectives

Globalization of economies has made information a more valuable strategic asset than ever before. At the same time, increasingly decentralized and multi-vendor computing environments have made it more difficult to gain efficient access to the data.

In the workplace a decade from now, it is expected that most data will be in databases and there will be many databases around. For instance, a manager will usually have a database of employees who report to him. Each employee will have his or her own personal database. The company will also have employee databases, but their contents will not always coincide with the personal ones. A lab will have a database dealing with its projects and group life insurances. Departments will have databases concerning their suppliers, clients, hardware and software they produce, buy or sell. Employees' e-mail, personal documents, reference materials, schedules, pieces of work under progress, etc. will reside in databases. Many of the current productivity tools will become database applications. There will be many multimedia applications involving voice, image and text. Management and presentation of heterogeneous data residing in heterogeneous databases with local autonomy will be key concerns in such environments.

The Database Technology Department at Hewlett-Packard Laboratories is developing a heterogeneous multidatabase management system called Pegasus to support an open information environment which provides the interoperability required by diverse applications in the 1990s.

Pegasus responds to the need for effective access and management of shared data used by a wide range of applications. Pegasus will provide facilities for multidatabase applications to access and manipulate multiple autonomous heterogeneous distributed object-oriented, relational and other information systems through a uniform interface. The system is intended to interoperate with existing object management systems. Pegasus is not just a front-end to multiple databases but has complete data management capabilities.

Pegasus deals with the problem of heterogeneous database models and data languages by defining a common object data model and data language. The data definition and data manipulation language of Pegasus, called HOSQL (Heterogeneous Object SQL), is a functional language that provides non-procedural statements to manipulate multiple databases. HOSQL allows for transparent and explicit access to multiple heterogeneous databases in a single declarative statement.

The objectives of Pegasus are:

1.2 Related Work

There are a number of heterogeneous database projects and systems that have been described in the literature. [L4] is a survey paper on the subject. In the following we briefly discuss those systems which have interesting differences or similarities with Pegasus.

Multibase [Sm,LR,DH] represents an approach to access multiple existing heterogeneous databases through a single global schema using a high level query language. Multibase uses the concepts of type generalization coupled with an extensive view definition facility. The Multibase query processor decomposes a global query into subqueries against the local schemas. A local database interface translates queries received from the query processor into queries expressed in the local DML. The query language of Multibase is DAPLEX [S2], which provides certain procedural constructs like iteration. The data models of both Pegasus and Multibase are based on the functional paradigm. However, Pegasus does not require a global integrated schema and unlike multibase it is a complete DBMS. Pegasus users can create Pegasus databases without a need for an underlying foreign DBMS.

The Amoco Distributed Database System (ADDS) [BO] uses an extended relational data model to integrate relational, network and hierarchical databases. Like Pegasus, an equivalent description of the local databases is generated in terms of the ADDS data model. ADDS suggests that an extended relational algebra (with joins, natural joins, outer joins, intersection, and selects), rather than the functional approach taken in Multibase, is all that is needed to implement a composite global schema.

Mermaid [Te] started as a research project at UNISYS corporation and is marketed by System Development Corporation. Mermaid allows the user to manipulate the data stored under various relational DBMSs running on different machines using SQL and a common language called ARIEL. The SQL or ARIEL queries get translated into a distributed intermediate language (DIL). A distributor process optimizes DIL and sends subqueries to the underlying DBMSs. There is a driver process at the local site for each DBMS. The driver translates DIL subqueries to the local DBMS query language. All information about schemas, databases, users, host computers and the network is contained in a data dictionary/directory (DD/D). Mermaid provides data translation from local to global representation and vice-versa. Pegasus has a more powerful multidatabase language and it does not require a global schema.

DATAPLEX is a heterogeneous distributed database management system developed at GM Research Labs [Ch]. Like most of the other systems, the data definitions of all participating databases are transformed to an equivalent relational data definition. One of the contributions of this project is the mapping of non-relational (e.g. IMS) data definitions to relational schemas, and translation of relational subqueries to equivalent non-relational queries.

There are a number of commercial products that provide multidatabase access capabilities. For example a Sybase database can become an entry point for access from one database to other databases in the same server. The language Transac-SQL is an extension of SQL with multidatabase access features. There is also a consortium of several hardware and software vendors, as well as users, known as SQL Access [Ba], who are addressing the issues of interoperability of SQL-based applications. SQL Access uses ISO/RDA as a communication mechanism between database applications and servers. RDA specification defines the communication protocol between a single server and client. However, RDA uses the Distributed Transaction Processing (TP) standards [TP,SA] for coordination between a client and multiple servers. SQL Access does not address data and schema integration and is limited to SQL and relational DBMSs.

2 AN OVERVIEW OF PEGASUS

Figure 1 shows the major functional layers of the Pegasus system. Intelligent Information Access (IIA) provides such services as information mining, schema exploration, and free-form queries. Cooperative Information Management (CIM) deals with schema integration, query processing, and transaction management. Foreign Data Access (FDA) manages schema mapping, query and command translation, network communications, foreign system invocation, and data conversion and routing.

\begin{figure}[htbp] \vspace*{7in} \vspace*{6in} \caption{Functional layers of Pegasus.\label{pegsys}} \end{figure}

Figure 1. Functional layers of Pegasus.

2.1 Intelligent Information Access

Users or applications will use HOSQL or other preferred languages, e.g., SQL, to communicate with Pegasus. Additional user friendly interface services such as graphic browsers and information mining will be provided.

2.2 Cooperative Information Management

CIM is responsible for processing HOSQL statements and for coordinating multidatabase transactions. It handles registration of foreign systems and schema mapping and integration (e.g., supertype creation, object equivalence, and identifier management).

User requests are submitted to RM (Request Manager) through various interfaces. RM parses the requests and routes them to the appropriate operational managers which are SM (Schema Manager), QM (Query Manager), and TM (Transaction Manager).

SM implements data definition operations, catalog management, and schema integration services. All information about the mapping of schemas defined in a FS (Foreign System) will be kept in the catalog to be used for query translation and processing, and transaction management. Additional information, e.g., statistical data, are collected by SM.

QM generates efficient execution plans for queries and coordinates the execution of queries and data manipulations. Query processing is described in section 5.

TM manages all transaction-related facilities. As the new computing environments encompassing heterogeneous and autonomous information systems begin to emerge, it becomes increasingly clear that the limitations of the traditional transaction concept begin to outweigh its advantages. The main source of difficulty in applying the traditional transaction management techniques in these new environments is the requirement of local autonomy for the individual FSs which participate in the transactions.

In conventional distributed DBMSs, the execution coordinator communicates with the component databases to enforce data integrity through the well-known two-phase locking and two-phase commit protocols. This is possible because all the component databases which participate in the transaction provide the same transaction management facilities. In a heterogeneous environment, not all FSs will have the same transaction management facilities. Therefore, Pegasus is exploring new transaction management facilities which provide more flexibility[L5].

TM maps an HOSQL DCL statement into the corresponding FS transaction management commands, and submits that to the FS via a Pegasus Shell (P-shell). The mapping information and isolation/recovery protocols are recorded in the catalog during FS registration time. If a particular function, such as an undo operation, is missing from an underlying FS, it could be implemented in the associated P-shell. To support an undo operation, an approach based on compensating transactions and idempotent operation logging are being explored.

2.3 Foreign Data Access

Access to foreign systems is implemented via Mapper and Translator modules and P-shells. FDA allows the rest of the Pegasus system to be independent of specific syntax, semantics, and implementations of FSs. FDA is modular, and can easily be extended to support new kinds of FSs by providing new modules.

One Mapper and Translator module is provided for each kind of FS that is supported by Pegasus. A Mapper supports mapping a foreign schema into a Pegasus schema. Mappers for OSQL (the Iris query language) and HP AllBase SQL have been developed [AR]. A translator translates an HOSQL statement into a request to an FS.

To respect the autonomy of FSs, the P-shell is designed to be a stub to perform those functions which cannot be performed efficiently by FSs. The intent is for the Pegasus site to bear the overhead of integration, rather than imposing it on the FS. A P-shell provides such services as FS invocation, network communication, data format conversion, data blocking/deblocking, and data routing to Pegasus.

A P-shell could be enhanced to provide any desired function. If a functionality such as sorting or two-phase commit is missing from an underlying FS it can be provided by a P-shell. P-shells also could be enhanced to manage run-time execution plan adjustment and statistical data collection.

3 THE PEGASUS DATA MODEL

Pegasus provides an object-oriented model which serves as a framework for uniform interoperation of multiple data sources with different data management systems. The Pegasus model, based on the Iris object-oriented model [F2], contains three basic constructs: types, functions and objects.

A type has a unique name and represents a collection of objects that share common characteristics. Types are organized in a directed acyclic graph that supports generalization and specialization and provides multiple inheritance. A type may be declared to be a subtype of other types. A function defined on a given type is also defined on all its subtypes except on those subtypes which are explicitly prohibited by the supertype from inheriting the function. Objects that are instances of a type are also instances of its supertypes. A simplified version of the Pegasus system type hierarchy is shown in Figure 2, where the subtype relationship is shown by an arrow from a supertype to a subtype. User objects belong to user-defined types, either locally created or obtained by attaching other databases.

\begin{figure}[htbp] \begin{figure}[ht] \vspace*{7in} \vspace*{4.5in} \caption{Simplified type hierarchy.\label{types}} \end{figure}

Figure 2. Simplified type hierarchy.

Objects are instances of types and are uniquely identified by their object identifiers. Some objects such as integers are self-identifying. These objects are instances of literal type shown in Figure 2. Objects may gain and lose types dynamically. For example, an object representing a given person may be created as an instance of the Student type. Later it may lose the Student type and acquire the type Employee.

Functions are the manifestations of operations and provide mappings among objects. Properties of, relationships among, and computations on objects are expressed in terms of functions. Arguments and results of functions are typed. A type can thus be characterized by the roles it plays in the arguments and results of various functions.

3.1 Pegasus Data Language

Pegasus provides a unifying data definition and data manipulation language called HOSQL (Heterogeneous Object SQL), a superset of OSQL [F2]. HOSQL is a functional object-oriented language that incorporates non-procedural statements to manipulate multiple heterogeneous databases. HOSQL provides statements for creating types, functions, and objects in both Pegasus and underlying databases. Specifications of types and functions can also be imported from underlying databases into a Pegasus imported database, which can then be integrated into global schemas and application views.

Databases, types, functions, and instances are defined by statements of the form:

CREATE ObjectSpecification AS ObjectImplementation;

ObjectSpecification is DATABASE or TYPE followed by a user-defined name, or FUNCTION followed by a function name and the specification of its inputs and outputs. The AS ObjectImplementation clause is optional. Figure 3 shows several examples of CREATE statements.

REGISTER System-1 ...;
REGISTER DATA-SOURCE Teachers ON System-1...;
REGISTER DATA-SOURCE Students ON System-1...;
CREATE IMPORTED DATABASE TeacherDB FROM Teachers...;
CONNECT TeacherDB;
CREATE TYPE Teacher AS RDBMS1(Table=TchrTable, OID=TchrNum)

PROPERTIES

(Name AS RDBMS1(Column=TchrTable.Name),

...

 );
DISCONNECT;
CREATE IMPORTED DATABASE StudentDB FROM Students...;
CONNECT StudentDB;
CREATE TYPE Student AS RDBMS1(Table=StudTable, OID=StudNum)

 PROPERTIES

 (Name AS RDBMS1(Column=StudTable.Name),

 ...

 );
DISCONNECT;
CREATE DATABASE PeopleDB ...;
CONNECT PeopleDB;
ATTACH TeacherDB ALIAS TDB;
ATTACH StudentDB ALIAS SDB;
CREATE TYPE Person SUPERTYPE OF t Teacher, s Student,

IDENTITY(TDB.SocSecNum(t) = SDB.SocSecNum(s));
CREATE FUNCTION Name(Person) -> String;
CREATE FUNCTION Activities(Person p) -> String time, String cname, String role 

AS SELECT time, Name(c), role

FOR EACH String time, Course c, String role

WHERE (p=TDB.TeacherOf(c) AND time=TDB.Start(c) AND role="teacher")

OR (c IN SDB.Courses(p) AND time=SDB.Begin(c) AND role="student");
SELECT Activities(:john);
<1:15, Chemistry, teacher>

<2:30, Algebra, student>

<5:00, Psychology, student>

Figure 3. A data integration example.

HOSQL is a uniform functional/object-oriented query and data manipulation language. Variables, which are references to objects in the result or argument of a function, can be used in queries and update statements. Variables range over the domains of types they refer to. An object can be retrieved into a variable which can then be used to refer to the object.

An HOSQL query can be expressed using the following syntax:

SELECT list of variables or functions
FOR EACH list of all variables and their types
WHERE Predicate expression

The SELECT clause contains a list of variables or functions. The FOR EACH clause contains quantification and the types of all variables being used in the select and where clauses. The WHERE clause contains a predicate expression that may involve nested functions (possibly nested), variables, constants or nested subqueries.

Update is done by functional assignment, e.g.,

Age(:john) := 30.

4 MAPPING

Mapping generates a Pegasus schema for a foreign data source in order to

Mapping capabilities are provided in modular fashion, with a separate module for each target data model, e.g., relational models, Codasyl models, IMS, hypertext, spreadsheets, etc. Such modules could be developed, sold, and installed independently.

Each module provides:

Mapping mechanisms support variants of the AS clause of the CREATE statements described in section 3.1. For example, after installing a module for a relational DBMS (RDBMS1), Pegasus will understand the following:

CREATE TYPE Employee AS RDBMS1(Table=Emp, OID=EmpNo)
PROPERTIES
(EmpNo Integer AS RDBMS1(Column=Emp.Empnum),
Name String AS RDBMS1(Column=Emp.Name),
Skills MANY AS RDBMS1(Table=Skills, Cols=Empnum->Skill));

Mapping and translation specifications can also be hand-crafted via the foreign function facility of Pegasus, or automated (fully or partially) via special-purpose tools. The following describes automatic mapping of relational schemas to Pegasus schemas.

4.1 Schema Mapping For Relational Databases

Depending on the amount of information available from the relational schema, one of the following cases may apply [AR]:

HOSQL provides facilities for defining and referencing parameter names in a function. This facility is similar to named parameters provided by some programming languages such as Ada. It allows information provided by the attribute names of a relation to be preserved in this mapping.

For determining the mapping between HOSQL and SQL schemas, the following correspondences between functions, types and relations are maintained:

5 QUERY PROCESSING

The overall query processing flow is shown in Figure 4. The user's query is formulated in HOSQL based on a Pegasus schema. QM (see Figure 1) parses the query into an intermediate structure called an F-tree, which is the syntax tree of HOSQL functional expressions. The nodes of an F-tree include function calls, variables, and literal objects. F-trees are transformed into B-trees which are annotated with catalog information. For queries involving foreign data source this would include the database name, site network id, etc. A query against a single FS will bypass the Pegasus optimization process and will go directly to the FS. This is an attempt to minimize the overhead of Pegasus for single database queries.


\begin{figure}[htbp] \begin{figure}[ht] \vspace*{5in} \vspace*{4.5in} \caption{Query processing flow.\label{qproc}} \end{figure} Figure 4. Query processing flow.

Portions of the query referring to data at the same FS will be identified and grouped in order to reduce the number of invocations to each FS (subject to other optimization criteria). QM will enumerate possible groupings based on FS site information. For each grouping all data nodes in the B-tree belonging to the same group will be merged and become a single data node, called a virtual data node. This process decomposes the original query into a set of subqueries, each of which refers to data residing in only one FS. Pegasus trees produced for each grouping are called D-trees.

Based on the available statistical data, QM will start a cost-based optimization for a D-tree. This includes choice of join order, join methods, join sites, and intermediate data routing and buffering, etc. The result of this process is a globally optimized execution plan for each particular group. The cheapest plan is chosen as the query execution plan.

Once the query execution plan is determined, QM will translate the FS access requests represented by a virtual data node into the FS-specific application program interface languages and will set up other structures to facilitate the run-time process. Steps in a final execution plan are either FS DML statements acceptable by their destination FSs or are operational primitives. Examples of operational primitives are commands to perform joins in Pegasus, to move data across FSs, and to synchronize steps executed in parallel.

QM distributes the relevant execution steps to each P-shell associated with the involved FSs and coordinates the overall execution. This includes establishing network connections, control blocks and buffer allocation, parameter binding of FS queries, invocations of FS, and assembly of results.

Unlike distributed DBMSs, Pegasus has limited statistical information for data residing in FSs. Moreover, Pegasus has no control over the optimization of subqueries sent to each FS. For example, Pegasus has no right and cannot enforce a particular access path to be used at an FS site. Otherwise, it will violate site autonomy. Therefore, Pegasus emphasizes global optimization and tries to find the best possible decomposition and grouping of queries. Accordingly, research on query processing in Pegasus is focused on:

6 DATA INTEGRATION

The interrelationships among systems and data sources are illustrated in Figure 5. A client deals with one Pegasus system. Any other system is a foreign system.

\begin{figure}[htbp] \vspace*{7in} \caption{Interrelationships among systems and databases.\label{pegrels}} \end{figure}

Figure 5. Interrelationships among systems and databases.

A data source is typically a database, though it might be some other type of data source, such as a file system. A foreign system may support one or more data sources. Data sources not in Pegasus are foreign data sources.

A foreign data source is represented in Pegasus by an imported schema, which looks like a native Pegasus schema except that its underlying data is maintained at the foreign system. To preserve the full autonomy of the foreign databases, only that part of the foreign database which is exported is visible through Pegasus. The exported schema of a foreign database is a foreign schema for Pegasus.

A native database is created in Pegasus in the usual fashion, having both schema and data in Pegasus. Other databases (imported or native) may be attached to a native database, providing access to multiple data sources.

A root database is a native database to which a client is connected, and through which he may have access to other attached databases.

The following facilities are provided for creating Pegasus multidatabases:

6.1 Minimal Integration

Basic attachment creates a union of database schemas. As an example, Figure 6 shows the schema of the working environment that a client has when he connects to db0 as the root database and then attaches database db1.

\begin{figure}[htbp] \vspace*{7in} \caption{Minimal schema integration.\label{minschem}} \end{figure}

Figure 6. Minimal schema integration.

The user-created types, functions, and other objects in the root and attached databases are presumed distinct and disjoint from each other. Names of types and functions may be prefixed by their database name or alias to prevent ambiguity. Ambiguities involving the root database will be resolved in favor of the root. Other approaches to resolve ambiguities are being investigated.

System types and functions, as well as literal objects, are shared by all Pegasus databases. Pegasus has one set of system types and functions, and these are assumed to be the same objects in every Pegasus schema.

6.2 Supertypes and Functions

One fairly natural integration technique which is supported in the first Pegasus prototype is the creation of supertypes of types defined in attached databases. If db1 and db2 each contain a Student type, and both are attached to db0 we might create a common supertype in db0:

CREATE TYPE Student SUPERTYPE OF db1.Student, db2.Student;

The new type Student (also known as db0.Student) will continue to exist in db0 even if db1 or db2 are detached. The instances of db0.Student will include the instances of db1.Student and db2.Student if and only if those databases are attached. For now, we assume that the students in the two databases are disjoint, so no student exists in both. This restriction will be relaxed when multidatabase object identification schemes are provided in Pegasus.

What functions are applicable to db0.Student? Initially, none. The approach we have adopted for now is to require functions to be defined explicitly on the supertype. In the future we will support a notion of upward inheritance.

Suppose, for example that functions called Name exist for the Student type in both db1 and db2 i.e., db1.Name is defined on db1.Student and db2.Name is defined on db2.Student. In db0 after

CREATE FUNCTION Name(Student)->String;

the query

SELECT Name(x) FOR EACH Student x;

will return db1.Name(x) if x is an instance of db1.Student and db2.Name(x) if x is an instance of db2.Student.

Simple mismatches of function names can be handled by using the ALIAS feature of HOSQL, allowing functions with different names to participate in inheritance as though they had the same name:

CREATE FUNCTION Name(Student)->String
ALIAS db1.SName, db2.StudentName;

Semantic or behavioral differences among functions in different databases cannot in general be reconciled automatically. The Pegasus mechanisms for defining derived or foreign functions allow a database administrator to specify the appropriate reconciliation strategy. Suppose, for example, the Grade function in db1 and the Points function in db2 did not behave consistently. The user can define functions Map1 and Map2 which convert each to a common result, and then

CREATE FUNCTION Score(Student x)->Real r AS
IF db1.Student(x) THEN Map1(Grade(x))
ELSE IF db2.Student(x) THEN Map2(Points(x))
ELSE ERROR;

7 DOMAIN MISMATCH AND SCHEMA MISMATCH

Domain mismatch and schema mismatch are two of the important semantic integration problems for interoperating heterogeneous databases [K1,K2,BL,DH], and will be focused on in the second Pegasus prototype.

Domain mismatch generally arises when several databases treat some common conceptual territory in different ways, the simplest example being differences in units of measurement. Schema mismatch is much the same thing at the schema level; an interesting "cross-over" problem arises when things in the data of one database correspond to things in the schema of another.

The first step is to understand and structure the problems, along the lines described below. This analysis clearly indicates that the complexity of the problem requires a general-purpose database programming language for its solution. In the future we will assess the power of the Iris Programming Language (IPL) extension of Iris OSQL [An,Ly,F1,F2] to solve these problems, and develop extensions where needed.

7.1 Domain Mismatch

The domain mismatch problem begins when some common conceptual territory is treated in different ways by different domains in different spheres. Consider the "conceptual territory" of money. The different "domains" correspond to different currencies in which money might be represented. A "sphere" is some scope in which a single domain, i.e., a single currency, is used. While we often think of a sphere as a database, a sphere might be a subset of a database in which several currencies are used, or it might span several databases.

\begin{figure}[htbp] \begin{center} \begin{tabular}{r||c|c|c|l} Conceptual&\multicolumn{3}{c|}{Spheres}\\ \cline{2-4} Territories&US &France&Germany& \\ \hline money &m_1&m_2 &m_3 &\leftarrowdomain group \\ jobs &j_1&j_2 &j_3 &\leftarrowdomain group \\ colors &c_1&c_2 &c_3 &\leftarrowdomain group \\ \hline \end{tabular} \end{center} \vspace*{.1in} \caption{Domains.\label{domfig}} \end{figure}

Figure 7. Domains.

A domain group is a set of domains d_i which cover some conceptual territory. Each domain d_i is typically in a distinct sphere s_i. In Figure 7, the domains m_1, m_2, m_3 might be different currencies in which money is represented.

Currencies represent a relatively simple sort of domain mismatch, involving computational conversion among literal data values. More complex discrepancies arise when the same conceptual territory is perceived as being populated, or partitioned, in different ways. The concept of "job" might be common to several spheres, yet each sphere has a different notion of what the specific jobs are. One sphere might have engineer, secretary, and salesman as jobs, while the jobs in another might include technician, designer, engineer, secretary, administrative assistant, and customer representative. The same thing might arise with the sets of skills one might possess, or with ethnic groupings, or with organizational units within different business entities (projects, departments, sections, labs, divisions, groups, operations, sectors, etc.).

Other examples might include different palettes of colors covering the same spectrum, different grading systems at different schools, different rating systems for restaurants (or for movies, or for hotels, etc.), terms in different languages for the same or similar concepts, different kinds of geographic units (counties vs. postal codes vs. voting districts vs. assessment districts, etc.). Another kind of mismatch arises if things are represented in one sphere as character strings but in another as persistent objects.

The domain mismatch problem can generally be decomposed into two aspects:

A typical domain mismatch problem requires presentation of information maintained in different spheres s_i in some globally unified form in an integrating sphere s*. For example, corporate headquarters (the integrating sphere) may wish to see the starting salaries for all jobs. Different divisions (the local spheres s_i) may have different definitions of jobs, different algorithms for defining starting salaries, and different currencies in which they are expressed. Thus there is a global integrator function

f*: ad* -> rd*

which provides a global view of a function (e.g., starting salaries) from an argument domain ad* (e.g., the job set defined at the corporate level) to a result domain rd* (e.g., the common currency used at the corporate level).

The integrator function in turn has to be mapped to a set of localized functions

f_i: ad_i -> rd_i

which provide the corresponding information in each local sphere s_i.

The domain mapping aspect concerns the establishment and maintenance of correspondences between the domains ad* and ad_i (the global and local sets of jobs), and also between the domains rd* and rd_i (the global and local currencies). The integrator aspect concerns the correspondence between the global function f* and the local functions f_i. This includes such problems as how to identify the relevant functions, how to convert them to a uniform semantics (e.g., uniform algorithm for computing starting salaries), and how to reconcile results from several functions (if different divisions report different starting salaries for the same job).

7.1.1 Domain Mappings

A domain mapping µ_D_ij translates from elements of d_i to elements of d_j in a domain group D. We are mostly interested in mappings to and from an integrating domain µ_D_i* and µ_D_*i.

An ideal domain mapping is an invertible computation on a stable population of literal data values, as with currency or units conversions. Real mappings may not be.

The populations may not be stable, requiring the mapping to be updated. Jobs may be added or dropped in various divisions. New letter grades might be adopted at certain schools, requiring an update of the mapping to numeric grades. New colors might be introduced.

The mapping may not be invertible, perhaps being many-to-one. Numeric grades map unambiguously into letter grades, but not vice versa. Lead Programmer might be a distinct job in some divisions, which the corporate level chooses not to distinguish from Programmer; both jobs map into the same one at the corporate level.

Non-literal values, e.g., persistent objects, require object existences to be managed. If corporate has an object-oriented database, then the appearance of a new job name in a relational database at one of the divisions may require explicit creation of a new job object in the corporate database.

The easiest solution to implement puts the burden of responsibility on users, requiring them to manually maintain the domains and mappings by appropriately creating and deleting objects, and by modifying mapping rules or data. In this case, when a mapping encounters an unfamiliar value it simply returns an error. The complexity of the problem still requires this solution as a fall-back for the general case.

These problems are explored in detail in [K2]. An example of possible automation arises when the correspondence between jobs can be established on the basis of their names. The mapping from job names to job objects, with automatic creation of job objects as needed, might be achieved as follows in a database programming language (in the style of the Iris IPL):

CREATE FUNCTION MapName2Job(Char n) -> Job AS
BEGIN
VAR j;
j := SELECT Job jj WHERE Name(jj)=n;
IF IsNull(j) THEN j := CREATE Job;
Name(j) := n;
RETURN j;
END;

7.1.2 Integrator Functions

An ideal integrator function

f*: ad* -> rd*

would have a simple mechanism for evaluating y*=f*(x*):

Real integrators aren't always so simple. The domain mappings might not be well behaved, as described earlier. Among other things, a given x* might map into several x_i's in several domains d_i, e.g., if a given job exists in several divisions.

The corresponding functions f_i might not be so readily identifiable. They might not be identifiable by naming conventions, having a variety of names, perhaps in several languages. They might not exist at all, in which case some sort of default action needs to be specified. It might be possible to fabricate a missing localized function f_i from other information. One division might have a range of starting salaries depending on prior experience, for which the average should be reported for this inquiry. Another division might specify starting salaries indirectly, via job categories.

When x* maps into several x_i's in several domains d_i, then the results of several local functions f_i need to be reconciled by some algorithm in f*. This might be a sum or average of values, or a list of all of them, perhaps annotated with some information about the sources.

In some cases, operator overloading can be used to automatically implement the integrator function. Integrators and overloaded operators both require choosing from a set of other functions to be executed. The simplest such case arises when the argument domains ad_i correspond to disjoint types, and all the functions f* and f_i have the same name.

Assume we have disjoint sets of employees AEmployee...ZEmployee, perhaps in different divisions of the company, and there is a Salary function on each, all returning salaries in US dollars.

In order to access the salaries of all employees, all we have to do is define a supertype spanning all the employees [DH], with a Salary function defined on it (Figure 8):

\begin{figure}[htbp] \begin{center} \verb Employee \\ \verb | Salary \\ \verb | \\ \verb ---------------------- \\ \verb | | \\ \verbAEmployee ... ZEmployee \\ \verb Salary Salary\\ \end{center} \vspace*{.1in} \caption{\label{simp2}} \end{figure}

Figure 8. 

CREATE TYPE Employee SUPERTYPE OF AEmployee ... ZEmployee;

CREATE FUNCTION Salary(Employee) -> Number AS 0;

The Salary function on Employee is defined to provide a default value, in case the Employee supertype is instantiable. Via late binding and overload resolution [Ly], an invocation of Salary(x) will bind to the Salary function defined on whichever subtype x belongs to.

If the salary functions don't all have the same name, overload resolution could be extended with a simple aliasing mechanism to allow them to behave as though they had the same name.

Again, the complete solution requires a general-purpose database programming language, as illustrated in [K2]. Updating via integrator functions is also discussed there.

7.2 Schema Mismatch

Schema mismatch arises when similar concepts are expressed differently in the schema. A common byproduct is that data instances in one sphere correspond to schema elements in another. Depending on the model, the schema elements might be such thing as relations and attributes, entities and relationships, classes and methods, types and functions, etc. Our work is expressed in terms of the types and functions in the Iris functional object model [F1,F2].

Many schema mismatch problems are really domain mismatch problems, except that some of the domains are in the schema instead of in the data. Jobs, for example, are often modeled as types, i.e., subtypes of Employee. Instead of AsgJob(Sam)=Engineer, we know that Sam is an engineer because he is an instance of the type, i.e., Engineer(Sam) is true.

For this sort of mismatch, one of the domains is a set of types (a type group), i.e., those subtypes of Employee which correspond to jobs. (Not all subtypes of Employee qualify. There may be other subtypes corresponding to Male and Female, or Exempt and Nonexempt, or Active and Retired, etc.)

An example involving functions is adapted from the stock market examples of [KL]. We have a sphere StockSphere_1 containing a base stock market Activity function on three arguments

Activity: Company x Reading x Date -> Price

whose extension at the moment is shown in Figure 9.

\begin{figure}[htbp] \begin{center} \begin{tt} \begin{tabular}{|l|l|l||c|} \multicolumn{4}{l}{Activity}\\ \hline Company&Reading& Date & Price \\ \hline hp & close & 1/3/91 & 50 \\ hp & close & 1/4/91 & 51 \\ hp & high & 1/3/91 & 52 \\ hp & high & 1/4/91 & 53 \\ ibm & close & 1/3/91 & 52 \\ ibm & close & 1/4/91 & 51 \\ ibm & high & 1/3/91 & 55 \\ ibm & high & 1/4/91 & 54 \\ \hline \end{tabular} \end{tt} \end{center} \vspace*{.1in} \caption{StockSphere_1.\label{aa}} \end{figure}

Figure 9. StockSphere_1.

Another sphere StockSphere_2 might maintain the same data in separate functions for each company, such as

HPActivity: Reading x Date -> Price
IBMActivity: Reading x Date -> Price
.
.
.

The corresponding extensions of two of these functions are shown in Figure 10.

\begin{figure}[htbp] \begin{center} \begin{tt} \begin{tabular}{|l|l||c|c@{\ \ \ }|l|l||c|} \multicolumn{4}{l}{HPActivity} & \multicolumn{3}{l}{IBMActivity}\\ \cline{1-3} \cline{5-7} Reading & Date & Price & & Reading & Date & Price \\ \cline{1-3} \cline{5-7} close & 1/3/91 & 50 & & close & 1/3/91 & 52 \\ close & 1/4/91 & 51 & & close & 1/4/91 & 51 \\ high & 1/3/91 & 52 & & high & 1/3/91 & 55 \\ high & 1/4/91 & 53 & & high & 1/4/91 & 54 \\ \cline{1-3} \cline{5-7} \end{tabular} \end{tt} \end{center} \vspace*{.1in} \caption{A function group in StockSphere_2.\label{bb}} \end{figure}

Figure 10. A function group in StockSphere_2.

In StockSphere_1 the domain of interest is a set of Company instances in the data. In StockSphere_2 the corresponding domain is a set of functions, i.e., a function group.

Domain mappings involving type groups and function groups are described in [K2].

7.3 Summary

[This could be shortened or eliminated if we want to conserve space.]

Domain mismatch and schema mismatch are complex problems. They can best be understood by structuring the environment in terms of domain groups corresponding to conceptual territories, with different domains occurring in different spheres. Integration then occurs using integrating domains in an integrating sphere. The domain mismatch problem separates into two parts, the definition and maintenance of domain mappings, and the definition and update of integrator functions. Schema mismatch can in many cases be reduced to the domain mismatch problem by treating type groups and function groups as domains in themselves. The problems can generally be decomposed into a mapping aspect (corresponding to the mappings in [KL]) and an integrating aspect (corresponding to the rules in [KL]).

Although such analysis and decomposition is helpful, the solutions generally require sophisticated language capabilities. The role of a database programming language is to permit the solutions to be expressed and maintained with the database, rather than in application code.

Thus behavior specification is an essential contribution of object-orientation to the solution of the mismatch problem. Subtypes and supertypes are another essential feature for reconciling disparate domains. Overloaded operators are also useful. Object identity, on the other hand, seems to add problems: maintaining domain mappings can require explicit creation or deletion of persistent objects.

In a context which includes type systems, persistent objects, and non-trivial correspondences between domains, desirable language facilities include:

OSQL and IPL are still evolving, and we have not completed our analysis of the extent to which they currently support these requirements. This will be continued in the Pegasus project.

8 OBJECT IDENTIFICATION

Object identification in single database systems is relatively simple. Most single database management systems have developed and adopted a workable approach. However, object identification in a heterogeneous multidatabase management systems is difficult because [Ne]:

These problems are being investigated for implementation in the second phase of the Pegasus project, which will begin after the completion of the first Pegasus prototype scheduled for the summer of 1991. Correctness criteria to be used in evaluating candidate solutions are:

The solution approaches being pursued include:

9 CURRENT STATE AND FUTURE DIRECTIONS

The design and implementation of a comprehensive flexible heterogeneous multidatabase system such as Pegasus is a significant undertaking. Such a system has to meet a large set of diverse applications requirements, should interoperate with existing information systems, and should function in heterogeneous distributed computer systems environments. It is important to have extensibility in mind from the beginning and develop a modular system architecture which lends itself to future extension.

The development of the Pegasus multidatabase management system will proceed in phases. In the first phase we are developing a prototype with the following functionality:

The performance of data retrieval in this prototype should exceed the performance requirements of multidatabase applications which we intend to use as testbeds. This prototype will not support multidatabase update. Moreover, the prototype will not be a DBMS, i.e., all the data will reside in the participating databases. Only relational database management systems and Iris will be supported as participating databases. This prototype is due for completion by the summer of 1991.

In the second prototype we will support non-traditional foreign data systems such as HP NewWave. Multidatabase query optimization, transaction management, update, multidatabase object identification schemes, and extensive data integration facilities will be provided in this phase.

10 CONCLUSIONS

There is consensus that flexible, efficient, and comprehensive heterogeneous multidatabases are needed to support the trend toward the extensive use of computers and information as a competitive tool in today's complex business world. However, the design and implementation of such a system is a major undertaking. There are several problems which must be solved before robust general purpose heterogeneous multidatabase management systems become possible. Distinguishing equal but logically different objects, consolidating different representations of the same object, materializing the views of existing applications, resolving the semantic heterogeneity of information stored in multiple databases, maintaining consistency of data in the presence of multidatabase concurrent transactions, and doing all this efficiently, are examples of problems that must be solved.

The Pegasus project at the Database Technology Department in Hewlett-Packard Laboratories has created an environment within which the problems in developing heterogeneous multidatabase management systems are being addressed.

The Pegasus system described in this paper is an integration of some of the advances that we have made so far. As the Pegasus project continues, the results and solutions which are being developed will be considered for inclusion in Hewlett-Packard's DBMS products.

11 REFERENCES

[AD] R. Ahmed, P. DeSmedt, W. Kent, M. Ketabchi, W. Litwin, A. Rafii, M. C. Shan, "Pegasus: A System for Seamless Integration of Heterogeneous Information Sources", Proc. IEEE COMPCON, March 1991, San Francisco, Calif.

[AR] Rafi Ahmed and Abbas Rafii, "Relational Schema Mapping and Query Translation in Pegasus", Workshop on Multidatabases and Semantic Interoperability, Nov. 2-4, 1990, Tulsa OK. Also HPL-DTD-90-4, Hewlett-Packard Laboratories, Oct. 9, 1990.

[An] Jurgen Annevelink, "Database Programming Languages: A Functional Approach", HPL-DTD-90-12, Hewlett-Packard Laboratories, Nov. 30, 1990.

[Ba] J. H. Balboni, "SQL Access Overview," Proceedings of CompCon91, Feb. 91, San Francisco.

[BL] C. Batini, M. Lenzerini, and S.B. Navathe, "A Comparative Analysis of Methodologies for Database Schema Integration", ACM Computing Surveys 18(4), Dec. 1986.

[BG] P. Bernstein, E. Goodman, E. Wong and J. Rothnie, "Query Processing in a System for Distributed Databases (SSD-1)", ACM Trans. on Database Systems, December 1981.

[BO] Breitbart, Olson and Thompson, "Database Integration in a Distributed Heterogeneous Database System", Data Engineering Conference, 1986.

[BS] Y. Breitbart, A. Silberschatz, G.R. Thompson, "Reliable Transaction Management in a Multidatabase System", Proc ACM SIGMOD, May 1990.

[Ch] Chin-Wan Chung, "DATAPLEX: An Access to Heterogeneous Distributed Databases", CACM, Vol. 33, No. 1, January 1990.

[DH] Umeshwar Dayal and Hai-Yann Hwang, "View Definition and Generalization for Database Integration in a Multidatabase System", IEE Trans. Software Engrg, Vol SE-10 No 6, Nov. 1984.

[ED] A.K. Elmagarmid and W. Du, "A Paradigm for Concurrency Control in Heterogeneous Distributed Databases Systems", Proceedings of IEEE Data Engineering, 1990.

[F1] D.H. Fishman et al, "Iris: An Object-Oriented Database Management System", ACM Transactions on Office Information Systems, 5(1), January 1987. Also in Readings in Object-Oriented Database Systems, Zdonik and Maier, editors, Morgan Kaufmann, San Mateo, California, 1989.

[F2] D.H. Fishman, et al, "Overview of the Iris DBMS", Object-Oriented Concepts, Databases, and Applications, Kim and Lochovsky, editors, Addison-Wesley, 1989.

[Fu] I.J. Fuller, "An Overview of the HP NewWave Environment", HP Journal, Vol. 40, No. 4, August 1989.

[HM] D. Heimbigner, D. McLeod, "A Federated Architecture for Information Management", ACM Trans. on Office Information Systems, Vol 3, No. 3, July 1985.

[HK] D.K. Hsiao and M.N. Kamel, "Heterogeneous Databases: Proliferation, Issues, and Solutions", IEEE Transactions on Knowledge and Data Engineering, Vol. 1, No. 1, March 1989.

[TP] ISO/IEC DIS 10026-2, OSI, "Distributed Transaction Processing".

[KD] M. Kaul, K. Drosten, and E.J. Neuhold, "ViewSystem: Integrating Heterogeneous Information Bases by Object-Oriented Views", Proceedings of IEEE Data Engineering, 1990.

[K1] William Kent, "The Many Forms of a Single Fact", Proc. IEEE COMPCON, Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard Laboratories, Oct. 21, 1988. [html]

[K2] William Kent, "Solving Domain Mismatch and Schema Mismatch Problems with an Object-Oriented Database Programming Language", HPL-91-25 Hewlett-Packard Laboratories, March 1991. [Submitted for publication.] [pdf]

[KL] Ravi Krishnamurthy, Witold Litwin and William Kent, "Language Features for Interoperability of Databases with Schematic Discrepancies", Proc ACM SIGMOD Int'l Conf on Mgmt of Data, Denver, Colorado, May 29-31 1991. Also HPL-DTD-90-14, Hewlett-Packard Laboratories, Dec. 17, 1990.

[LR] T.A. Landers, R.L. Rosenberg, "An Overview of Multibase- A Heterogeneous Database System", North-Holland, 1982.

[L1] W. Litwin, "An Overview of the Multidatabase System: MRDSM", ACM Conference, 1985.

[L2] W. Litwin, A. Abdellatif, "Multidatabase Interoperability", IEEE Computer, Dec. 1986.

[L3] Litwin, Abdellatif, Zeroual and Nicolas, "MSQL: A Multidatabase Language", Elsevier Science Publishing, 1989.

[L4] W. Litwin, L. Mark, N. Roussopoulos, "Interoperability of Multiple Autonomous Databases", ACM Computing Surveys, Vol. 22, No. 3, Sept. 1990.

[L5] Witold Litwin and Ming-Chien Shan, "Value Dates for Concurrency Control and Transaction Management in Interoperable Systems", The 1st international Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.

[Ly] Peter Lyngbaek, "OSQL: A Language for Object Databases", HPL-DTD-91-14, Hewlett-Packard Laboratories, Dec. 17, 1990.

[Mo] Amihai Motro, "Superviews, Virtual Integration of Multiple Databases", IEEE Transactions of Software Engineering, Vol. SE-13, No. 7, July 1987.

[Ne] Erich J. Neuhold, William Kent, and Ming-Chien Shan, "Object Identification in Interoperable Systems", The 1st international Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.

[PP] Pegasus Project, "Pegasus: An Interoperable Heterogeneous Information Management System", [Submitted for publication.]

[R1] A. Rafii, et al, "Overview of Multidatabase Management in Pegasus," Technical Report HPL-DTD-90-2, Sep 1990.

[R2] A. Rafii, R. Ahmed, P. DeSmedt, W. Kent, M. Ketabchi, W. Litwin, M. Shan, "Multidatabase Management in Pegasus," The 1st international Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.

[Sa] H. S. Sabharwal, "SQL Access and ISO/RDA," Proceedings of CompCon91, Feb. 91, San Francisco.

[Sh] M.C. Shan, "Unified Access in a Heterogeneous Information Environment," Office Knowledge Engineering, August 1989.

[S2] D. Shipman, "The Functional Data Model and the Data Language DAPLEX", ACM Transactions on Database Systems 6:1, 1981.

[Sm] Smith, et al, "Multibase, Integrating Heterogeneous Distributed Database Systems", Proc. of AFIPS, 1981.

[TP] Tandem Performance Group, "A Benchmark of NonStop SQL on the Debit Credit Transaction," Proc. 1988 ACM-SIGMOD Conference on Management of Data, Chicago, Ill, June 1988.

[Te] Templeton, et al, "Mermaid- A Front-End to Distributed Heterogeneous Databases", Proc. IEEE, Vol 75, No. 5, May 1987.

[XO] X/OPEN group, "X/OPEN Portability Guide, Data Management", Elsevier Science Publishing, 1987.