Projects, 2004-2007
Many colleges and individual researchers are grappling with a flood of electronic data, as weblogs, courseware, digitized print collections, and online journals continue to grow in size and scope. These resources often lack any kind of semantic markup, yet are full of useful information. Automated semantic search technologies like LSI and CNS can play a critical role in helping organize and navigate this material.
Our projects aim to make these technologies available to scholars and the wider public. Please contact us if you are interested in collaborating on either an existing project or a new one.
Andrew W. Mellon Foundation Grant Proposal (2003)
The original proposal to the Andrew W. Mellon Foundation is provided here [PDF: 708K] as a complete description of the research being conducted and the topic areas being investigated. More detailed descriptions of current project areas are described below.Blog Census
Weblogs are on-line, often personal, journals, but topics range widely. The blogosphere has grown extremely rapidly in both size and scope, and this project aims to create a catalog of active weblogs, across all languages. Analysis of this data collection will begin to show some of the underlying structure of the blogosphere. The blog census project page is available at http://www.blogcensus.net
A subset of the blogosphere consists of political writers from all sides of the political spectrum. A collection of the most popular political bloggers and columnists were assembled here for conducting discourse analysis of the day's news.
Literary Analysis Tool
Literature provides a unique challenge in indexing. Metaphors, anaphora, and allusions will all potentially stump a computer, and so the texts must be carefully indexed. This tool has been used to index a number of public domain texts. Visualizations, key topic lists, and search capabilities are all available.
Sample visualization of the literary engine include a Jane Austen novel. Of particular interest in these character visualizations is in the penultimate chapter of Emma, in which the computer was able to identify those characters who marry each other. Each is displayed in pairs, accordingly.
More recently, we have indexed two editions of the novel Clarissa, perhaps the longest novel ever written in English. And as an experiment with Unicode, we have also created visualizations for the 18th Century Chinese novel The Dream of the Red Chamber (You will need to have Chinese fonts installed in order to view the graphs).
Refinement of search and clustering algorithms
This project will continue to develop and deploy a Contextual Network Search engine that is based on the graph traversal algorithm. Combined with distributed processing, this set of techniques will give us the ability to scale up to very large collections (e.g., for text, millions of documents).
Some sample search engine interfaces are available for exploration. The largest collection contains a group of well known syndicated columnists whose articles appeared in the Spring of 2005. Secondly, the British Museum in London has graciously allowed us to experiment with the descriptions of the art objects contained in their online musem catalog, the COMPASS. And finally, we worked with Steven Johnson, the popular science author, to organize his research notes in a search engine that allowed for discovery and exploration. He used a version of this tool in his research for a book.
Development of information management tools and user interface features
This project is developing archive management tools (including a stand-alone desktop application) with information management features such as auto-clustering, individual user views, and support for integrating user feedback; a graphical user interface, and peer-to-peer search capabilities, and data visualization options. Furthermore, such features as topic subsumption, auto completion and summarization have been integrated into a number of demonstration sites.
One such application is available as an example of what is possible with semantic visualization. The Semantic Explorer allows you to enter a search query and watch as the resulting sub-graph is layed out on screen, visually clustering documents and terms together.
In order to semantically index a document collection, additional tools are needed to calculate the statistical distribution of words and phrases. A part of speech tagger was developed for this purpose and is used in conjunction with the Semantic Engine toolkit. The tagger itself uses data derived from the Penn Treebank, and its accuracy ranges from 95-98% per word. It is also capable of extracting noun phrases and other useful grammatical constructs. The tagger is being widely used, by students and researchers in the field of Natural Language Processing.
Bioinformatics
Latent Semantic Indexing and Graph Theory are, at their core, language agnostic algorithms. To test this, we have conducted some research into their application in Bioinformatics, a field in which large data sets reign. Some investigations of protien sequences have shown promising results.