FOA Home
In the last chapter, we described the FOA process linguistically, in
terms of words that occur in documents, morphological features of these
words, structures organizing the sentences of documents, etc. We now
want to treat all of these words --- which have meaning to their
authors and to us reading them --- as a \rikmeaning-less stream of data
- word after word after word. (Imagine it coming from some SETI radio
telescope, eaves-dropping on the communication of some other planet!) We
will now seek common patterns and trends to this data, using the same
sorts of statistical tricks that physicists typically use on their data
streams. What can we learn from looking at the statistics of our data
stream, treating text as \rikmeaning-less and attempting to infer a new
notion of meaning from those statistics?
But now let's narrow our focus,
all the way down to the bits and characters used to represent the
corpus, for example as a file on a physical device, like a hard disk.
Imagine that you are an archaeologist, trying to study some civilization
that had left this evidence behind. How might you interpret this modern
Rosetta Stone?
Let's ignore those issues relating to the basic ASCII
encoding. That is, suppose we have special knowledge of a character set.
Then the frequency of these characters' occurrences would already give
us a great deal of information. Anyone who has studied simple cipher
techniques (or played Scrabble!) knows that a table of most frequently
used letters (cf. Figure (FOAref) [Welsh88] ) can be well-exploited for
breaking simple codes.
\hline Letter & Frequency\\
\hline E & .120 \\ T & .085 \\ A & .077 \\ I & .076 \\ N & .067 \\ O &
.067 \\ S & .067 \\ R & .059 \\ H & .050 \\ D & .042 \\ L & .042 \\ U &
.037 \\ C & .032 \\ F & .024 \\ M & .024 \\ W & .022 \\ Y & .022 \\ P &
.020 \\ B & .017 \\ G & .017 \\ V & .012 \\ K & .007 \\ Q & .005 \\ J &
.004 \\ X & .004 \\ Z & .002 \\ \hline \caption{English letter
frequencies}
In this chapter we will move another level above
characters. We will consider morphological transformations we can
perform on character sequences that help us to identify root words. We
will briefly mention phrases by which multiple words can be joined into
simple phrasal units.
At each level we will ask very similar questions:
What is our unit of analysis; i.e., just what are we counting? Then,
what does the distribution of frequency occurrences across this level of
features tell us about the pattern of their use? What can we tell about
the meaning of these features, based on such statistics [Francis82] ?
In fact, many influential
thinkers have looked at such patterns among symbols. Going back to some
of our most ancient writings suggests that statistical analyses of the
original Hebrew characters and their positions within the
two-dimensional array of the page reveals new ``codes'' [Witztum94] [Drosnin97] .
Donald Knuth, one of
computer science's most reknowned theoreticians, has analyzed an
apparently random verse (Chapter 3, verse 16) from 59 of the Bible's
books and used these as the basis of STRATIFIED SAMPLING of the
approximately 30000 Biblical verses [Knuth90] . He found, for example, that
the 3:16 verses were particulary richin occurrances of {\tt YHWH}, the
ancient Hebrew name for God. Personally, Knuth found this analysis the
source for ``historical and spiritual insights,'' as part of a Bible
study class he lead. But speaking scientifically, how can we
find meaning in text, and when are such attempts merely
numerology?
Top of Page
Microscopic semantics and the statistics of communication