Real Time Operating Systems

What Goes Around Comes Around

Michael Stonebraker Joseph M. Hellerstein

Abstract

This paper provides a summary of 35 years of data model proposals, grouped into 9 different eras. We discuss the proposals of each era, and show that there are only a few basic data modeling ideas, and most have been around a long time. Later proposals inevitably bear a strong resemblance to certain earlier proposals. Hence, it is a worthwhile exercise to study previous proposals.

In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.

Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.

I Introduction

Data model proposals have been around since the late 1960’s, when the first author “came on the scene”. Proposals have continued with surprising regularity for the intervening 35 years. Moreover, many of the current day proposals have come from researchers too young to have learned from the discussion of earlier ones. Hence, the purpose of this paper is to summarize 35 years worth of “progress” and point out what should be learned from this lengthy exercise.

We present data model proposals in nine historical epochs:

Hierarchical (IMS): late 1960’s and 1970’s Network (CODASYL): 1970’s Relational: 1970’s and early 1980’s Entity-Relationship: 1970’s Extended Relational: 1980’s Semantic: late 1970’s and 1980’s Object-oriented: late 1980’s and early 1990’s Object-relational: late 1980’s and early 1990’s

Semi-structured (XML): late 1990’s to the present

In each case, we discuss the data model and associated query language, using a neutral notation. Hence, we will spare the reader the idiosyncratic details of the various proposals. We will also attempt to use a uniform collection of terms, again in an attempt to limit the confusion that might otherwise occur.

Throughout much of the paper, we will use the standard example of suppliers and parts, from [CODD70], which we write for now in relational form in Figure 1.

Supplier (sno, sname, scity, sstate) Part (pno, pname, psize, pcolor) Supply (sno, pno, qty, price)

A Relational Schema Figure 1

Here we have Supplier information, Part information and the Supply relationship to indicate the terms under which a supplier can supply a part.

II IMS Era

IMS was released around 1968, and initially had a hierarchical data model. It understood the notion of a record type, which is a collection of named fields with their associated data types. Each instance of a record type is forced to obey the data description indicated in the definition of the record type. Furthermore, some subset of the named fields must uniquely specify a record instance, i.e. they are required to be a key. Lastly, the record types must be arranged in a tree, such that each record type (other than the root) has a unique parent record type. An IMS data base is a collection of instances of record types, such that each instance, other than root instances, has a single parent of the correct record type.

This requirement of tree-structured data presents a challenge for our sample data, because we are forced to structure it in one of the two ways indicated in Figure 2. These representations share two common undesirable properties:

1) Information is repeated. In the first schema, Part information is repeated for each Supplier who supplies the part. In the second schema, Supplier information is repeated for each part he supplies. Repeated information is undesirable, because it offers the possibility for inconsistent data. For example, a repeated data element could be changed in some, but not all, of the places it appears, leading to an inconsistent data base.

2) Existence depends on parents. In the first schema it is impossible for there to be a part that is not currently supplied by anybody. In the second schema, it is impossible to have a supplier which does not currently supply anything. There is no support for these “corner cases” in a strict hierarchy.

What Goes Around Comes Around 3

Two Hierarchical Organizations Figure 2

IMS chose a hierarchical data base because it facilitates a simple data manipulation language, DL/1. Every record in an IMS data base has a hierarchical sequence key (HSK). Basically, an HSK is derived by concatenating the keys of ancestor records, and then adding the key of the current record. HSK defines a natural order of all records in an IMS data base, basically depth-first, left-to-right. DL/1 intimately used HSK order for the semantics of commands. For example, the “get next” command returns the next record in HSK order. Another use of HSK order is the “get next within parent” command, which explores the subtree underneath a given record in HSK order.

Using the first schema, one can find all the red parts supplied by Supplier 16 as:

Get unique Supplier (sno = 16) Until failure do

Get next within parent (color = red) Enddo

The first command finds Supplier 16. Then we iterate through the subtree underneath this record in HSK order, looking for red parts. When the subtree is exhausted, an error is returned.

Notice that DL/1 is a “record-at-a-time” language, whereby the programmer constructs an algorithm for solving his query, and then IMS executes this algorithm. Often there are multiple ways to solve a query. Here is another way to solve the above specification:

Supplier (sno, sname, scity, sstate)

Part (pno, pname, psize, pcolor, qty, price)

Part (pno, pname, psize, pcolor)

Supplier (sno, sname, scity, sstate, qty, price)

4 Chapter 1: Data Models and DBMS Architecture

Until failure do Get next Part (color = red) Enddo

Although one might think that the second solution is clearly inferior to the first one; in fact if there is only one supplier in the data base (number 16), the second solution will outperform the first. The DL/1 programmer must make such optimization tradeoffs.

IMS supported four different storage formats for hierarchical data. Basically root records can either be:

Stored sequentially Indexed in a B-tree using the key of the record Hashed using the key of the record

Dependent records are found from the root using either

Physical sequentially Various forms of pointers.

Some of the storage organizations impose restrictions on DL/1 commands. For example the purely sequential organization will not support record inserts. Hence, it is appropriate only for batch processing environments in which a change list is sorted in HSK order and then a single pass of the data base is made, the changes inserted in the correct place, and a new data base written. This is usually referred to as “old-master-new-master” processing. In addition, the storage organization that hashes root records on a key cannot support “get next”, because it has no easy way to return hashed records in HSK order.

These various “quirks” in IMS are designed to avoid operations that would have impossibly bad performance. However, this decision comes at a price: One cannot freely change IMS storage organizations to tune a data base application because there is no guarantee that the DL/1 programs will continue to run.

The ability of a data base application to continue to run, regardless of what tuning is performed at the physical level will be called physical data independence. Physical data independence is important because a DBMS application is not typically written all at once. As new programs are added to an application, the tuning demands may change, and better DBMS performance could be achieved by changing the storage organization. IMS has chosen to limit the amount of physical data independence that is possible.

In addition, the logical requirements of an application may change over time. New record types may be added, because of new business requirements or because of new government requirements. It may also be desirable to move certain data elements from one record type to another. IMS supports a certain level of logical data independence, because DL/1 is actually defined on a logical data base, not on the actual physical data base that is stored. Hence, a DL/1 program can be written initially by defining the logical

What Goes Around Comes Around 5

data base to be exactly same as the physical data base. Later, record types can be added to the physical data base, and the logical data base redefined to exclude them. Hence, an IMS data base can grow with new record types, and the initial DL/1 program will continue to operate correctly. In general, an IMS logical data base can be a subtree of a physical data base.

It is an excellent idea to have the programmer interact with a logical abstraction of the data, because this allows the physical organization to change, without compromising the runability of DL/1 programs. Logical and physical data independence are important because DBMS application have a much longer lifetime (often a quarter century or more) than the data on which they operate. Data independence will allow the data to change without requiring costly program maintenance.

One last point should be made about IMS. Clearly, our sample data is not amenable to a tree structured representation as noted earlier. Hence, there was quickly pressure on IMS to represent our sample data without the redundancy or dependencies mentioned above. IMS responded by extending the notion of logical data bases beyond what was just described.

Two IMS Physical Data Bases Figure 3

Suppose one constructs two physical data bases, one containing only Part information and the second containing Supplier and Supply information as shown in the diagram of Figure 3. Of course, DL/1 programs are defined on trees; hence they cannot be used directly on the structures of Figure 3. Instead, IMS allowed the definition of the logical data base shown in Figure 4. Here, the Supply and Part record types from two different data bases are “fused” (joined) on the common value of part number into the hierarchical structure shown.

Supplier (sno, sname, scity, sstate)

Supply (pno, qty, price)

Part (pno, pname, psize, pcolor)

6 Chapter 1: Data Models and DBMS Architecture

Basically, the structure of Figure 3 is actually stored, and one can note that there is no redundancy and no bad existence dependencies in this structure. The programmer is presented with the hierarchical view shown in Figure 4, which supports standard DL/1 programs.

An IMS Logical Data Base Figure 4

Speaking generally, IMS allow two different tree-structured physical data bases to be “grafted” together into a logical data base. There are many restrictions (for example in the use of the delete command) and considerable complexity to this use of logical data bases, but it is a way to represent non-tree structured data in IMS.

The complexity of these logical data bases will be presently seen to be pivotial in determining how IBM decided to support relational data bases a decade later.

We will summarize the lessons learned so far, and then turn to the CODASYL proposal.

Lesson 1: Physical and logical data independence are highly desirable

Lesson 2: Tree structured data models are very restrictive

Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data

Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard.

Supplier (sno, sname, scity, sstate)

Supply(pno, qty, price)

Part (pno, pname, psize, pcolor)

What Goes Around Comes Around 7

III CODASYL Era

In 1969 the CODASYL (Committee on Data Systems Languages) committee released their first report [CODA69], and then followed in 1971 [CODA71] and 1973 [CODA73] with language specifications. CODASYL was an ad-hoc committee that championed a network data model along with a record-at-a-time data manipulation language.

This model organized a collection of record types, each with keys, into a network, rather than a tree. Hence, a given record instance could have multiple parents, rather than a single one, as in IMS. As a result, our Supplier-Parts-Supply example could be represented by the CODASYL network of Figure 5.

Supplies Supplied_by

A CODASYL Network Figure 5

Here, we notice three record types arranged in a network, connected by two named arcs, called Supplies and Supplied_by. A named arc is called a set in CODASYL, though it is not technically a set at all. Rather it indicates that for each record instance of the owner record type (the tail of the arrow) there is a relationship with zero or more record instances of the child record type (the head of the arrow). As such, it is a 1-to-n relationship between owner record instances and child record instances.

A CODASYL network is a collection of named record types and named set types that form a connected graph. Moreover, there must be at least one entry point (a record type that is not a child in any set). A CODASYL data base is a collection of record instances and set instances that obey this network description.

Supplier (sno, sname, scity, sstate)

Supply(qty, price)

Part (pno, pname, psize, pcolor)

8 Chapter 1: Data Models and DBMS Architecture

Notice that Figure 5 does not have the existence dependencies present in a hierarchical data model. For example, it is ok to have a part that is not supplied by anybody. This will merely be an empty instance of the Supplied_by set. Hence, the move to a network data model solves many of the restrictions of a hierarchy. However, there are still situations that are hard to model in CODASYL. Consider, for example, data about a marriage ceremony, which is a 3-way relationship between a bride, a groom, and a minister. Because CODASYL sets are only two-way relationships, one is forced into the data model indicated in Figure 6.

Participates-1 Participates-2

Participates-3

A CODASYL Solution Figure 6

This solution requires three binary sets to express a three-way relationship, and is somewhat unnatural. Although much more flexible than IMS, the CODASYL data model still had limitations.

The CODASYL data manipulation language is a record-at-a-time language whereby one enters the data base at an entry point and then navigates to desired data by following sets. To find the red parts supplied by Supplier 16 in CODASYL, one can use the following code:

Bride

Ceremony

Groom

Minister

What Goes Around Comes Around 9

Find Supplier (SNO = 16) Until no-more { Find next Supply record in Supplies

Find owner Part record in Supplied_by Get current record -check for red—

}

One enters the data base at supplier 16, and then iterates over the members of the Supplies set. This will yield a collection of Supply records. For each one, the owner in the Supplied_by set is identified, and a check for redness performed.

The CODASYL proposal suggested that the records in each entry point be hashed on the key in the record. Several implementations of sets were proposed that entailed various combinations of pointers between the parent records and child records.

The CODASYL proposal provided essentially no physical data independence. For example, the above program fails if the key (and hence the hash storage) of the Supplier record is changed from sno to something else. In addition, no logical data independence is provided, since the schema cannot change without affecting application programs.

The move to a network model has the advantage that no kludges are required to implement graph-structured data, such as our example. However, the CODASYL model is considerably more complex than the IMS data model. In IMS a programmer navigates in a hierarchical space, while a CODASYL programmer navigates in a multi-dimensional hyperspace. In IMS the programmer must only worry about his current position in the data base, and the position of a single ancestor (if he is doing a “get next within parent”).

In contrast, a CODASYL programmer must keep track of the:

The last record touched by the application The last record of each record type touched The last record of each set type touched

The various CODASYL DML commands update these currency indicators. Hence, one can think of CODASYL programming as moving these currency indicators around a CODASYL data base until a record of interest is located. Then, it can be fetched. In addition, the CODASYL programmer can suppress currency movement if he desires. Hence, one way to think of a CODASYL programmer is that he should program looking at a wall map of the CODASYL network that is decorated with various colored pins indicating currency. In his 1973 Turing Award lecture, Charlie Bachmann called this “navigating in hyperspace” [BACH73].

10 Chapter 1: Data Models and DBMS Architecture

Hence, the CODASYL proposal trades increased complexity for the possibility of easily representing non-hierarchical data. CODASYL offers poorer logical and physical data independence than IMS.

There are also some more subtle issues with CODASYL. For example, in IMS each data base could be independently bulk-loaded from an external data source. However, in CODASYL, all the data was typically in one large network. This much larger object had to be bulk-loaded all at once, leading to very long load times. Also, if a CODASYL data base became corrupted, it was necessary to reload all of it from a dump. Hence, crash recovery tended to be more involved than if the data was divided into a collection of independent data bases.

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Benefits of our college essay writing service

  • 80+ disciplines

    Buy an essay in any subject you find difficult—we’ll have a specialist in it ready

  • 4-hour deadlines

    Ask for help with your most urgent short tasks—we can complete them in 4 hours!

  • Free revision

    Get your paper revised for free if it doesn’t meet your instructions.

  • 24/7 support

    Contact us anytime if you need help with your essay

  • Custom formatting

    APA, MLA, Chicago—we can use any formatting style you need.

  • Plagiarism check

    Get a paper that’s fully original and checked for plagiarism

What the numbers say?

  • 527
    writers active
  • 9.5 out of 10
    current average quality score
  • 98.40%
    of orders delivered on time
error: