What is a Support Vector Machine (SVM)

Wendy Ark

COGS 200: Finding Out About

April 20, 2001

Topic: Susan Dumais, Bringing Order to the Web ... and Beyond

The beauty of the Web!	Issues which need to be resolved to keep the web a beautiful place
- lots of info	- getting the right info to the right people
- easily accessible	- millions of people use the web (surfing, posting, developing, etc)
- all in one place	- the internet is huge and what we get is a pinhole view to see all that's out there info overload, remembering urls, etc… *is spatial information important? Do we want to move around in the internet space?

Internet searching is a skill!!!!!!!!
*side note: there was a job notice sent out recently and one of the eligibility requirements was: experience with computers and ability to perform internet searches
*getting the right info at the right time is a powerful skill

Relevance + Reliability of a Document (not just the search engine): It would be ideal to structure the results of a search so that the confidence in the answer can be explicit (ex: student believes teacher for a particular subject more than other students or peer reviewed articles are typically more highly regarded than web publications).

Enter: The Search Engine

There are search engines which categorize (trained professionals to categorize new items):

Yahoo!, LookSmart, Dewey, MeSH, CyberPatrol

These are costly, time-consuming and cannot be done in real-time. Therefore, the search engines are limited in the number of web pages they will search.

The SVM search engine consists of 2 main components:

1. a text classifier that categorizes web pages on the fly

2. a user interface that presents the web pages within the category structure and allows the user to manipulate the structured view.

**If used for an entire web search, how does the search engine get the web pages so it can search them? Is there a database of urls which need to constantly updated? Does it use links on the document page to find new urls? Is there a database of past search terms and their corresponding search results (which would need to be updated b/c the web is dynamic) or are the searches done ‘on the fly’ (how can this be real time?) ?

Support Vector Machine (SVM) for text classification:

- A linear SVM is a hyperplane that separates a set of positive examples (words or pages that are part of the category) form a set of negative examples (words or pages that are not part of the category) with maximum margin. The input feature vectors consist of k words with highest mutual information with each category. For example, the SVM representation for the category "interest" includes the words prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) with large positive weights and the words group (-.24), year (-.25), sees (.33), world (-.35) and dlrs (0-.71) with large negative weights.

**Are the negative examples related? How is the set of negative examples formed? All the words on the page which are not in the category? It is possible to get more than 2 support vectors—what would this mean? The number of times words/categories are mentioned on a page does not affect its rating. Wouldn’t this help?

** In the CIKM ’98 paper, the authors found that initial NLP analyses did not improve classification accuracy. For example, the phrase "interest rate" is more predictive of the Reuters category "interest" than is either the world "interest" or "rate". ( "New York" vs "new" or "york") In fact, for the SVM, the NLP features actually reduce performance on the 118 Reuters categories by 0.2%. (that was probably ns?) Why wouldn’t NLP-derived phrases help?
*a search engine may fail to find a word such as "New York" because it would need BOTH words and it could degrade performance. Also, if "York" is the search criteria, it is likely that it is part of "New York" anyway.
*However, perhaps the verb phrase could be informative
*Note: negation is a hard case for the search engine
*Perhaps if the user could control the weights for each of the words in the search query, it would be more helpful. Currently this is hidden from the user

** The Reuters database is a bit biased in that each article is written on a specific topic in mind. Plus, it probably does not have any links. How would the SVM scale up to web documents, which are not necessarily intended for newspaper type reading ? Web docs have less words per page, so there is less points to use in the feature vector.
*It does not do as well. However, it gets about 70% correct categorization on the web documents versus the 90+% on the Reuters database.

** FOA mentions combining classifiers so that there might be particular expert search engine systems (collection fusion). Is there any particular classifier which might complement SVM? Another option to fusing search engines (skimming and chorus effects) is expert opinions. For example, OneLook Dictionaries is useful because if you look up the word "mean", OneLook will tell you the expert websites which will give different information based on the category which that website is an expert.

- It uses supervised learning to set the weights correctly.

Advantages of SVMs: good classification accuracy, fast to learn, fast to classify new instances

Disadvantages:

*SIDE DISCUSSION: Latent Semantic Indexing (LSI)
What is it?
-It is work done in conjunction with Tom Landauer and others
-It gives you a way to associate words through dimension reduction
-How do humans associate words? Is it the same as LSI? Probably not. However, there are similarities....
-LSI infers that words are related based on context use