The popularity of the worldwide web (WWW) for disseminating information to a large pool of users has dramatically increased the amount of documents available through the Internet. This continuous growth of publicly available information has given rise to many new problems related to the search for information of interest to a user. Entirely new areas of research have sprung up in an effort to address such problems. Examples are the resource discovery, knowledge discovery and data mining fields.
My research interests focus on the integration of multiple information sources and their use in locating relevant information for a user. Traditional database approaches cannot deal with the information available over the Internet because these approaches rely on highly structured data while the WWW contains mostly unstructured or semi-structured data. Through my previous experience in the database field while at UCSD and summer employment at AT&T Bell Labs and Los Alamos National Laboratory, I have gained expertise in extending traditional database methods to tackle problem in this new environment. In my dissertation, I have addressed three problems in this domain. The first problem is the matching of records from possibly many heterogeneous information sources. To solve this problem, I have developed two approximate record matching algorithms. The second is the development of algorithms for detecting duplicate records in databases. Finally, I have designed and implemented an application, WebFind, for finding scientific papers over the Internet. WebFind is an example of how to integrate information sources available over the Internet in order to solve the resource discovery problem.
Approximate record matching
The record matching task is to determine whether or not two syntactically different records describe the same real-world entity. For example the different address records "Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. 0114, La Jolla, CA 92093" and "UCSD, Computer Science and Engineering Department, CA 920930114" designate the same department. Record matching is necessary to combine information from heterogeneous knowledge sources. This task is sometimes called record linkage. Record matching is also required for detecting duplicate records in databases.
Record matching is important in many areas: identifying medical records for the same individual in different databases, correlating different pieces of information about the same taxpayer when social security numbers are missing or incorrect, fraud detection, money laundering, etc.
I have studied several domain-independent record matching algorithms, evaluating their performance in realistic applications. My research shows how to use the well-known Smith-Waterman edit-distance algorithm for record matching. This is the first time the Smith-Waterman algorithm is proposed for general knowledge discovery applications. The Smith-Waterman algorithm is a string matching procedure that was first developed to find similarities between related gene or protein sequences. Given costs for mutations and insertion/deletions, the algorithm uses dynamic programming to find the lowest cost set of changes that converts one sequence into the other. The other algorithms that I have developed take advantage of the recursive and hierarchical nature of records being matched. General heuristics are used to deal with abbreviations. In future research I am interested in pursuing solutions to this problem that have the flexibility of the Smith-Waterman algorithm but are less expensive computationally.
Alvaro E. Monge and Charles P. Elkan. Domain-independent record matching algorithms for knowledge discovery applications. November, 1996.
WebFind is a knowledge discovery tool developed for locating scientific papers published on the worldwide web. WebFind combines external information sources as a guide for locating where to look for information on the web. The external information sources used are MELVYL and NETFIND. MELVYL is a University of California library service that includes comprehensive databases of bibliographic records, including a science and engineering database called INSPEC. NETFIND is a white pages service that gives Internet host addresses and people's e-mail addresses. Separately these services do not provide enough information to locate papers on the web. WebFind integrates the information provided by each in order to find a path for discovering on the web the information actually wanted by a user. Figure 1 shows an example of a WebFind discovery session.
Many knowledge discovery and database mining applications need to combine information from heterogeneous sources. Different information sources, such as relational databases or worldwide web pages, provide data about the same real-world entities, but describe these entities differently. To integrate these information sources, we need to find the correspondence between records in different sources. To this end, a record matching algorithm is used to resolve discrepancies in how entities are described. Once discrepanci es are resolved, records from one source are matched with the corresponding records of other sources and the sources can be integrated.
Detecting database records that are approximate duplicates , but not exact duplicates, is an important task that has received much attention recently because of the growth of the knowledge discovery and data mining fields. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, unstandardized abbreviations, or difference s in the schema of records across multiple databases, among other reasons. A set of records that refer to the same entity can be interpreted in two ways. One way is to view one of the records as correct and the other records as duplicates containing erroneous information. The task then is to cleanse the database of the duplicate records. Another interpretation is to consider each matching record as a partial source of information. The aim is then to merge the duplicate records, yielding one record with more complete information.
In my dissertation work, I have developed an efficient algorithm for recognizing clusters of approximately duplicate records. Three key ideas distinguish the algorithm. First, I have identified the Smith-Waterman algorithm for computing minimum edit-distance as the most appropriate domain-independent method to recognize pairs of approximately duplicate records. Second, I have also identified the most appropriate data structure to keep track of duplicates. The union/find data structure and algorithm are used to keep track of clusters of duplicate records incrementa lly, as pair-wise duplicate relationships are discovered. Third, I have implemented a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This typically reduces by over 75% the number of times that the expensive pair-wiserecord matching (Smith-Waterman or other) is applied, without impairing accuracy. Comprehensive experiments on synthetic databases and on a real database of bibliographic records confirm the effectiveness of the new algorithm.