OED on CD-ROM



The history of the Oxford English Dictionary (continued)



The New Oxford English Dictionary project



Data structure


Once firm plans had been made, it was intended that the conversion of the text into
electronic form should begin as soon as possible. Preparations at ICC were by now well
advanced. But for data capture to begin, a system for structuring the text had to be
agreed upon. It was resolved that the tagging language inserted into the electronic
version should do more than simply express the typographical features - layout,
typeface, type size, font - of the printed text. It must, as its primary function,
identify the structural elements which combine to form a dictionary entry. This was a
prerequisite both for the development of the database in the future, and, as it turned
out, for the automatic processes applied to the text in the course of integration.

Several months were devoted to the analysis of the structure of the OED and its
Supplement, resulting in an inventory of the most important structural elements
(amounting to between forty and fifty) and their current typographical realizations. The
translation of this scheme into a system of tags, though not without its difficulties,
was straightforward compared to the immense task of ensuring that each element of
Dictionary text was supplied with the correct tag. It emerged from discussions with ICC
that a tagging scheme of such size and complexity would be very hard to insert
accurately into the text at the stage of initial data capture. It would require so much
knowledge that the training of keyboarders would be very long and the typing very slow.
It would also require extensive pre-editing of the text, which again would take an
excessively long time and require much training. On the other hand, a more modest scheme
would be manageable. Accordingly, a compromise mark-up scheme was devised. The fifteen
or so most prominent textual elements received tags with structural meaning, while all
other features of the text were coded with tags that had a conventional typographical
meaning. Further coding was deferred to a later stage. Even with this scheme, ICC found
it necessary to carry out a considerable amount of preliminary mark-up, conduct lengthy
training sessions, and undertake several proof-reading cycles, before the text was ready
to be shipped to Oxford.

On 15 May 1984, at a press conference in the premises of the Royal Society in London, a
formal announcement of the launching of the New Oxford English Dictionary Project was
made, including the news that IBM UK Ltd. would be making a substantial donation to the
first phase of the project. Meanwhile, work on the means of carrying out the integration
of the text was continuing in collaboration with IBM. Matters needing development were
identified as: the database management system for holding and protecting the electronic
text, the software tool by which it might be edited, and a means of correcting
cross-references affected by integration. There was also the problem of enhancing the
system of tagging introduced by ICC so that it should be an entirely ?generalized?
mark-up language, that is to say, one having structural, not typographical
signification. At first this planning was conducted by means of a regular meeting
between staff from OUP and IBM, but at length, in mid-July, the first secondee from IBM
arrived at OUP as the project's computer group manager, and began to build up his team.
Form then on, the main instrument by which progress was monitored and problems were
identified was a formal system of meetings, some at half-yearly and monthly intervals,
at which representatives of the management of IBM were present, others occurring weekly
and dealing with the minutiae of the project team?s work.

During the following autumn the project gathered momentum. In September the University
of Waterloo was granted Canadian Government funding with which to establish a Centre for
the New OED as a focus for database research, from the point of view of both the
academic user and the computer scientist. Early sketches of a potential database
structure had already been made, and, more importantly, the project had attracted the
interest of several researchers who might be able to provide parsing software which
would facilitate the enhancement of the mark-up language. After some months of
experimentation at the University of Waterloo, work was begun on this part of the system
by the project's computer group, a vital contribution at the start being made by a
secondee from Waterloo.

Also in September 1984, ICC sent to Oxford test data consisting of 100 pages of
Dictionary text on magnetic tape. This not only proved the feasibility of the scheme for
data capture but also made it possible to try out methods of proof-reading.

In October the project team drew up a formal Statement of User Requirements, which set
out the aims of the first phase and the operations which the computer system would be
required to perform. This gave the computer group a basis on which to develop their
detailed design of the system, an activity which occupied their attention over the two
succeeding years. An Editorial Board was constituted, consisting of about forty scholars
in a wide range of disciplines; the idea being that they should give advice to the
project team especially when the revision, updating, and enhancement of the dictionary
were planned.


Data capture


At the beginning of November 1984 the computer equipment from IBM was installed. At the
same time, ICC began data capture in earnest. A team of ICC copy editors, based in Fort
Washington, Pennsylvania, began to insert structural mark-up on enlarged copies of the
Dictionary pages. These were passed to the data conversion personnel (both on the same
site and in Tampa, Florida) for keyboarding. Data-validation routines and sample
proof-reading were carried out by ICC before the proofs were shipped to Oxford. It was
stipulated that the rate of errors should be no more than 7 in 10,000 keystrokes; and
this requirement was met.

The first batch of magnetic tapes and proofs arrived in January 1985, and proof-reading
immediately got under way. From then until June 1986 a regular cycle of data capture,
proof-reading, and data correction was maintained. A team of some fifty freelance
proof-readers was directed from Oxford. They were required to check not only the
accuracy of the text but also the selection and positioning of the computer codes. They
were provided with a detailed manual describing the structure of the Dictionary and the
correct application of the tagging system. Double proof-reading - the reading of the
same section of text by two people independently, followed by cross-collation - was
employed for a trial period. It proved, owing mainly to the very low error rate
maintained by ICC, not to reveal a markedly higher number of errors than a single
reading; certainly not enough to justify the double outlay of expense and editorial
effort. A single reading was therefore conducted, but experienced staff checked,
emended, and supplemented all the corrections before the proofs were returned to ICC. In
addition, a system of monitoring the proof-readers' work by detailed rechecking of
random samples was carried out until satisfactory standards had been achieved. During
the same stage, a prototype of the parsing program was run on most of the electronic
text to validate its structure: this functioned rather like an additional (and, within
certain limits, infallible) proof-reader.

When ICC returned the corrected tapes, these were subjected to a further check, on the
screen, to ensure that the corrections had been carried out within the agreed margins.
This left the text with an estimated residual error-rate of only 1 in 235,000
characters. Since most of these were minor errors of punctuation and spacing, and the
text would subsequently be proof-read a second time, this was felt to be an acceptable
level at which the data could proceed to automatic processing by computer.


Computer development


In July 1985 the computer group issued an Outline System Design, describing the
essential components and features of the New OED computer system. Over the following
eighteen months, in close consultation with the lexicographers, the group built a unique
dictionary system tailored to the special needs of the project.

Once the text had been captured, it was loaded on to the project's IBM 4341 mainframe at
OUP. It was important that it should be stored in a database system that would allow the
necessary access and processing facilities. The operating system used was IBM's VM370;
the database management system was SQL/DS. Every new version of the data created by each
successive stage of processing and editing was retained in the database; no older
version was overwritten, and the whole was regularly archived on to magnetic tape and
stored at a remote site for safety.

The structure devised by Sir James Murray and used by him and all his successors for
writing Dictionary entries was so regular that it was possible to analyse them as if
they were sentences of a language with a definite syntax and grammar. They could
therefore be parsed, and this was the next process to which the text was submitted. The
objective of parsing, as already mentioned, was to transform the text into a version
categorized by a system of generalized mark-up, known as SGML (Standard Generalized
Mark-up Language), in which each element is identified by its function, not its printed
appearance. The programs used for parsing were written by staff of the University of
Waterloo. The ?grammar? of the Dictionary text with which they operated was written at
Oxford. It was developed by running a postulated grammar against the Dictionary text to
establish whether the latter could be transformed without rejection of the input or
ambiguity in the output. Revised versions of the grammar were run repeatedly until the
closest possible approximation was achieved. The grammar had to be descriptive, not
prescriptive, since the computer could not be allowed to override lexicographical
judgement, and only the most minor rewriting of the text to accommodate computerization
was acceptable.

A particularly important proposal in the outline design was that the computer system
should automatically carry out as much as possible of the integration of corresponding
OED and Supplement entries, leaving the lexicographical team the task of correcting
errors, harmonizing adjacent text, and coping with difficult cases. The integration
routines used the mark-up to create a single sequence of text from the two component
parts, following the main structural cues (headwords and sense divisions) and the
instructions in the Supplement that were identified as ?integration instructions? during
parsing. Subsequent analysis of the integration program's performance showed that it
successfully handled about 80 per cent of the text, and spared the lexicographers and
keyboarders between 50 per cent and 60 per cent of the number of tasks which they would
otherwise have been obliged to perform interactively at the computer screen.

Integration caused the targets of thousands of cross-references to be changed, rendering
the cross-references inaccurate. To cope with this problem, every cross-reference
identified by the parser was numbered and copied; after integration, the stored copies
were automatically matched with their targets, changed wherever necessary, and returned
to the text. In a similar way the pronunciations were copied, translated into the
International Phonetic Alphabet, and restored.

The problem arose of finding editorial software suitable for emending and integrating
entries interactively at the computer terminal. Failing to discover any proprietary
software that was adequate, the team resolved to develop its own. The product of this
development was a new kind of text editor, designed for structured text, and originally
known as LEXX. The initial work was carried out by an IBM secondee, and then taken over
and extended by the OUP staff. This highly versatile editorial tool was designed to
interface with a number of programs that controlled access to the Dictionary data held
on the computer, allowed entries to be proofed for immediate checking, and provided a
complete working environment with checks and controls to protect the integrity of the
text. The combined sub-system was eventually named the OED Integration, Proofing, and
Updating System (OEDIPUS).

Once editing was complete, the text was to be transferred for composition of galley and
page proofs. It was decided that this part of the process should be performed by an
outside supplier.

During 1986 data capture of the main OED and Supplement text was completed (the
remaining text - the entirely new entries and the bibliography - was keyboarded during
the following half-year). The last of the eighteen monthly batches of proofs was
returned, corrected, to ICC in mid-August. A month later the automatic processing of the
Dictionary data on the computer system began. First the text was read on to the system
and validated. Next the parser was run. Structural errors encountered by the parser were
corrected on-line by the editorial group. During the three months that elapsed, 5,711
corrections were made. Automatic integration itself began in March 1987, and the
automatic processing of the whole text of the Dictionary was completed at the end of May.


The editing of the integrated text


After subjecting OEDIPUS to acceptance trials, the editorial group was given access to
the system at the end of June 1987. The most efficient working method had already been
determined by experimentation. Proofs, or more strictly speaking, printouts, of all
entries that were subject to integration and the modifications resulting from it were
run off by the computer system. The lexicographical group would work through these,
examining the results of automatic integration and making corrections and other
emendations. These alterations would be entered into the text on-line by a separate
group of keyboarders. Galley proofs of the complete integrated text would then be
produced by an outside supplier. Accordingly, editing of the printouts began in June,
and, at the same time, a team of keyboard operators was engaged, trained, and assigned
to the task of ?interactive integration?.

After the first few months, during which no galley proofs were composed, the editorial
group found itself occupied on several fronts simultaneously. On account of its huge
size, the text was handled by the computer in forty alphabetical ranges or ?tables?. At
any one time, the group would be editing up to half a dozen text tables. Each of these
would be undergoing one of four consecutive editorial processes. The first was the
editing by lexicographers of proofs of all entries that had in any way been modified by
the integration and cross-referencing programs. Next, these marked-up proofs were passed
to the keyboard operators, who made the necessary emendations to the electronic text. At
this stage, a number of other corrections had also to be made, some unconnected with the
action of integration; also, many complicated problems of integration came to light
(including entries that had wrongly eluded automatic integration) and had to be
resolved, at the keyboard, by the lexicographical staff. Once the integration of a table
had been approved, a magnetic tape was produced and sent to the composition suppliers,
Filmtype Services Ltd., of Scarborough, North Yorkshire.

Galley proofs of the entire Dictionary text for each text table were produced and
distributed to the team of proof-readers (now increased to more than sixty). On their
return, the third stage began. The editorial group checked all proof-readers'
corrections, and carried out many additional systematic checks, some facilitated by
specific computer scans. Cross-references were dealt with at this stage. Once approved,
the table was again put on tape and sent for composition. This time fully formatted page
proofs were produced, and the breaks between volumes were inserted. The fourth stage
consisted of the checking of these proofs to ensure that all galley proof corrections
appeared correctly on them, and that no errors had crept into the text for any other
reason, such as the malfunctioning of the composition programs. The final corrections to
the page proofs were again keyboarded into the database at Oxford; they were applied to
the printed version by Filmtype Services either by the processing of a new magnetic tape
copy or by simple keyboarding. When the final proof pages for a volume were deemed
acceptable, the volume was passed for press.

Back to contents 

Copyright  Oxford University Press 2009
