FOA Home
For some time the study of evolution has demanded an especially
interdisciplinary approach, and so it is no wonder that profound
difficulties in communication arise as scientists trained in paradigms
as varied as biology, psychology and even computer science [REF1123] attempt to communicate with one
another. Now, of course, theories of evolution are increasingly informed
by huge volumes of concrete data, generated by the Human Genome Project
and related efforts. Serendipitously, the field of molecular biology is
also one of the first (but quite certainly not the last) disciplines to
undergo a qualitative change because of the WWW. The nearly simultaneous
growth of the WWW and genomic databases has meant that computational
biology as a science has grown up with a very advanced notion of
publication. Beyond formal publication channels, even beyond informal
email and discussion groups, the genomic databases at the heart of
molecular biology today may point to forms of communication among
scientists which are arguably, like the image-based WWW traffic,
POST-VERBAL .
The flood of biological sequence data -- nucleic
acid, proteins, and now gene expression networks, metabolic pathways --
into sequence databases, with the related flood of molecular biology
literature, represents an unprecedented opportunity to investigate how
concepts learned automatically from various data sets relate to the
words and phrases used by scientists to describe them. Learning this
linkage -- between molecular biology concepts and the
genomic data relating to them -- can be described as
annotating the data. It is now possible to learn many of these
correspondences automatically, guided by the RelFbk of practicing
scientists, as a natural by-product of their browsing through genome
data and publications related to them. RelFbk provides a key
additional piece of information to learning algorithms, beyond the
statistical correlations that may exist within the genome data or
textual corpora treated independently: It captures the fact that a
scientist who understands both the sequence data and the journal
articles deeply does (or does not) believe that a particular sequence
and particular keyword/concept share a common referent. Sequences are
posted, annotations are often automatically constructed based on
HOMOLOGOUS relations to other sequences found in the databases. A
different variety of ``sequence search engines,'' specially developed to
look for similarities among sequences rather than among documents,
become the basis for retrievals. These retrievals can and often do
connect the work of one scientist to that of another without a single
verbal expression passing.
Figure (figure) sketches the basic
relations. On the bottom are the most fundamental classes of molecular
data, namely gene and protein sequences. On the top is a set of
scientific documents, such as those found in MEDLINE. The primary
relation connecting between the raw genetic data and textual corpora are
ANNOTATION links that scientists have (manually) established
between articles and sequences are both significant and useful. They are
significant because they help to establish the construction of the
genome as a piece of the scientific enterprise, linking it to the
traditions of academic publication. They are also useful to many
scientists who, for example, are interested in a particular gene or
protein and want to find out all that others might know about it. But
annotation is not done consistently by all participating scientists, nor
has a precise semantics for what exactly an annotation should mean been
established. The Entrez
interface to MEDLINE makes it convenient for a user with a particular
sequence in mind to find its corresponding publication, and vice versa.
Together with the MESH thesaurus of medical terms (cf. Section §6.3 ), these features make the National
Library of Medicine's resource one of the most advanced on the WWW.
In
addition to expediting the searches of scientists and doctors, the
identification of significant patterns in one modality (i.e., in text or
in sequence data) can be used to suggest hypotheses in the other
(similar to suggestions made by Swanson (cf. Section §6.5.3 ). Also shown in the figure are
$\mathcal{S}im$ arcs relating ``similar'' data. In the case of genetic
or protein sequence data, these similarity measures are typically based
on a notion of ``edit distance'' generated by string-matching tools such
as BLAST and FastA, but the investigation of new methods for this
problem is one of the most active areas within machine learning (cf. [Glasgow96] ). The investigation of
inter-document similarities has been an important problem within the
field of information retrieval (IR) for many decades. Most document
similarity measures are based on correlations between ``keywords''
contained by pairs of documents, but other methods (e.g., based on a
bibliometric analysis of shared entries in the documents'
bibliographies) have also received considerable attention.
Top of Page
FOA(Evolution)