OED statistics: a look around the leader board

By John Simpson

Each quarter the OED Online database changes in several ways. The most obvious way relates to the new and revised entries added to the database for the first time. But there are other ways of looking at the changes which the OED database undergoes and the sections below give a round-up of several of these.

Visualizing words

From time to time we are asked if it’s possible to provide a visual summary of the entries we’ve revised in comparison with those that are still awaiting revision. Just recently we’ve found some time to start working on this, and the following image is a static prototype of the sort of working image we are hoping to provide online in future.

visualisation
Red lines shows revised entries, and black lines indicate unrevised entries, A to Z across the panel from left to right.

So we can read this to show that the first half of the letter A in the OED has been fully revised, as have sections in the range from mid-A to the end of L. Then all entries in the range M to the end of R have been fully revised, as have two discrete sections at the end of S (sub- and super- words). Elsewhere the dictionary entries have been revised in short stretches only.

The central block of solid red represents the period early in the revision process where we worked solely in consecutive alphabetical ranges. By the time we reached the end of R the OED’s new computer system made it feasible for us to select the entries we edited. We no longer needed to follow strict alphabetical order. As a result – as noted in previous publication notes – we have subsequently selected entries for revision and update based largely on factors indicating how significant they are to the language and how frequently they are consulted by online readers.

The dog ears between some letters along the top indicate letters containing too few entries to fit on the main index: the letters J, K, Q, W, X, Y, and Z.

Users would then be able to click on the various lines for further information and view an image like the one below – we hope to make this feature available soon. One further developmental stage is shown by the image below, linked to the previous horizontal panel:

Letter L

If the letter L is clicked in the horizontal panel, then this visual appears. This image shows that 12% of the letter L has now been revised and updated since OED2 (1989), and that the letter L consists of 2.79% of the dictionary database in terms of headword-count. By headword we mean a word which begins a separate entry in the dictionary.

The words displayed are all revised entries which would link through to the entry itself. Subsequent clicks on L would retrieve some further selections of revised entries in this range.

By providing different views of the data (not just by letter of the alphabet) we expect to offer new ways into the OED’s hoard of words. Of course, since the OED is an ongoing research project, all of these numbers are liable to change as more entries are added and updated.

OED statistics

Statistics2

From time to time we run a check to see whether any new entries have crept into the top ten entries measured by size (character count). The present leader board shows the following ranking order:

  1. run (verb)
  2. red (adjective, etc.)
  3. put (verb)
  4. time (noun)
  5. be (verb)
  6. pre- (prefix)
  7. set (verb/1)
  8. make (verb/1)
  9. black (adjective, etc.)
  10. over- (prefix)

All of these entries have been revised and updated except set, which we can expect to climb back up the leader board at that point. But whether it will regain the top spot which it held in OED (1989) is a moot point. Notice that the largest entries are all short words (monosyllables) in their base form. The verbs are verbs of doing and being, the adjectives are the big colours (which spawn many compounds), and we have several major prefixes (but not re-).

For the record, the smallest entries (by character count) are these:

  1. ysunged (verb)bouge 2
  2. capulin (noun)
  3. banche (verb)
  4. bouge (noun/5)
  5. han’t/ha’n’t (verb)
  6. hain’t/haint (verb)
  7. benow (adverb)
  8. cruciately (adverb)
  9. jumpish (adjective)
  10. ef (noun)

These are all unrevised entries, without any illustrative quotations. Some will be subsumed elsewhere after revision.