Wendy Ark

COGS 200: Finding Out About

April 20, 2001

Topic: Susan Dumais, Bringing Order to the Web ... and Beyond

The beauty of the Web!

Issues which need to be resolved to keep the web a beautiful place

- lots of info

- getting the right info to the right people

- easily accessible

- millions of people use the web (surfing, posting, developing, etc)

- all in one place

- the internet is huge and what we get is a pinhole

view to see all that's out there

info overload, remembering urls, etc…

*is spatial information important? Do we want to move around in the internet space?

Internet searching is a skill!!!!!!!!
*side note: there was a job notice sent out recently and one of the eligibility requirements was: experience with computers and ability to perform internet searches
*getting the right info at the right time is a powerful skill

Relevance + Reliability of a Document (not just the search engine): It would be ideal to structure the results of a search so that the confidence in the answer can be explicit (ex: student believes teacher for a particular subject more than other students or peer reviewed articles are typically more highly regarded than web publications).

Enter: The Search Engine

There are search engines which categorize (trained professionals to categorize new items):

Yahoo!, LookSmart, Dewey, MeSH, CyberPatrol

These are costly, time-consuming and cannot be done in real-time. Therefore, the search engines are limited in the number of web pages they will search.

The SVM search engine consists of 2 main components:

1. a text classifier that categorizes web pages on the fly

2. a user interface that presents the web pages within the category structure and allows the user to manipulate the structured view.

**If used for an entire web search, how does the search engine get the web pages so it can search them? Is there a database of urls which need to constantly updated? Does it use links on the document page to find new urls? Is there a database of past search terms and their corresponding search results (which would need to be updated b/c the web is dynamic) or are the searches done ‘on the fly’ (how can this be real time?) ?

Support Vector Machine (SVM) for text classification:

- A linear SVM is a hyperplane that separates a set of positive examples (words or pages that are part of the category) form a set of negative examples (words or pages that are not part of the category) with maximum margin. The input feature vectors consist of k words with highest mutual information with each category. For example, the SVM representation for the category "interest" includes the words prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) with large positive weights and the words group (-.24), year (-.25), sees (.33), world (-.35) and dlrs (0-.71) with large negative weights.

**Are the negative examples related? How is the set of negative examples formed? All the words on the page which are not in the category? It is possible to get more than 2 support vectors—what would this mean? The number of times words/categories are mentioned on a page does not affect its rating. Wouldn’t this help?

** In the CIKM ’98 paper, the authors found that initial NLP analyses did not improve classification accuracy. For example, the phrase "interest rate" is more predictive of the Reuters category "interest" than is either the world "interest" or "rate". ( "New York" vs "new" or "york") In fact, for the SVM, the NLP features actually reduce performance on the 118 Reuters categories by 0.2%. (that was probably ns?) Why wouldn’t NLP-derived phrases help?
*a search engine may fail to find a word such as "New York" because it would need BOTH words and it could degrade performance. Also, if "York" is the search criteria, it is likely that it is part of "New York" anyway.
*However, perhaps the verb phrase could be informative
*Note: negation is a hard case for the search engine
*Perhaps if the user could control the weights for each of the words in the search query, it would be more helpful. Currently this is hidden from the user

** The Reuters database is a bit biased in that each article is written on a specific topic in mind. Plus, it probably does not have any links. How would the SVM scale up to web documents, which are not necessarily intended for newspaper type reading ? Web docs have less words per page, so there is less points to use in the feature vector.
*It does not do as well. However, it gets about 70% correct categorization on the web documents versus the 90+% on the Reuters database.

** FOA mentions combining classifiers so that there might be particular expert search engine systems (collection fusion). Is there any particular classifier which might complement SVM? Another option to fusing search engines (skimming and chorus effects) is expert opinions. For example, OneLook Dictionaries is useful because if you look up the word "mean", OneLook will tell you the expert websites which will give different information based on the category which that website is an expert.

- It uses supervised learning to set the weights correctly.

Advantages of SVMs: good classification accuracy, fast to learn, fast to classify new instances

Disadvantages:

*SIDE DISCUSSION: Latent Semantic Indexing (LSI)
What is it?
-It is work done in conjunction with Tom Landauer and others
-It gives you a way to associate words through dimension reduction
-How do humans associate words? Is it the same as LSI? Probably not. However, there are similarities....
-LSI infers that words are related based on context use

Other Algorithm Questions:

** People often don’t know the best way to structure a search. People won’t take the time to read the rules which direct a search. Results of a search may vary depending on whether the person used quotation marks, AND or OR. This may be a novice versus advanced user issue, but what is the best way to approach this? Do we state the rules, offer other search options, or just let the user struggle?

** Another important point that FOA makes is that the web publishers want search engines to work. They want people to find out about their information that they have to offer the world. It behooves them to know how to write their website so that it gets the correct classification. Knowing how the search engine parses their web pages is important!!

** If text is organized into categories, are people more likely to find what they want because they use the categories or the machines use the categories?

** The level of breakdown to classifications is inversely proportional to usefulness?

** In FOA: "One critical, simplifying assumption shared by both models is that word features occur independently in the documents. As we have discussed a number of times, any such NAÏVE BAYESIAN model will miss a great deal of the interactions arising among real words in real documents. It is somewhat curious, then, that such naïve classifiers do as well as they do [Domingos 97]."

Interface Questions:

** Unlike Yahoo!, it did not seem like you had a beginning interface which allowed people to follow the path of a particular category (to narrow their search space) before entering the search text. Is this because you found that people tend not to use this option?

General Questions:

** Part of the problem with searching is that people may not know the common verbiage to make a correct search and will therefore, have to make several attempts for a correct search (or people will misspell words, eg, Hilary Clinton vs Hillary; or purposeful misspellings, eg, Napster).

** Sometimes people will search and then want to revisit it later. How do we save search options? Or rank the most useful docs of searches for ourselves?

--Last Updated April 23, 2001