NAME

Semantic::API - Perl extension for a graph-based search


DESCRIPTION

The Semantic Engine has emerged from an attempt to improve on standard keyword searches. By analyzing the statistical patterns in natural language, concept-based relationships can be established between distinct texts that may not share particular key words. By representing a text collection as a graph-theoretic network, similarities and relationships can easily be found in otherwise unstructured data.


SYNOPSIS - Indexing

  use Semantic::API;
  my $semantic = Semantic::API::Index->new( 
                            collection => 'my_collection',
                            storage => 'mysql',
                            database => 'my_database',
                            host => 'localhost',
                            username => 'my_user',
                            password => '***',
                            min_term_frequency => 2,
                            max_document_frequency => '0.2' );
  -- OR --
  my $semantic = Semantic::API::Index->new( 
                            collection => 'my_collection',
                            storage => 'sqlite',
                            database => 'my_database' );
  $semantic->add_word_filters( 
                            too_many_numbers     => 10,
                            minimum_length       => 3,
                            maximum_word_length  => 15,
                            maximum_phrase_length=> 3,
                            blacklist            => \@blacklist,
                            whitelist            => \@whitelist);
  $semantic->index_file( $filename ); # read this file and index it!
  ...
  $semantic->set_default_encoding( "utf8"); # use this encoding for any incoming text
  $semantic->index( $id, $text );
  ...
  $semantic->index( $id, $text, $weight ); # give this item a different weight
  $semantic->finish(); # commit everything to the database


SYNOPSIS - Searching

  use Semantic::API;
  my $semantic = Semantic::API::Search->new( collection => 'my_collection',
                                             storage => 'mysql',
                                             database => 'my_database',
                                             username => 'my_user',
                                             password => '***',
                                             host => 'localhost' );
  -- OR --
  my $semantic = Semantic::API::Search->new( collection => 'my_collection',
                                             storage => 'sqlite',
                                             database => 'my_database' );
  my ($results, $terms ) = $semantic->semantic_search( 'query' );
  -- OR --
  my ($results, $terms ) = $semantic->keyword_search( 'query' );
  my @term_list   = sort { $terms->{$b}   <=> $terms->{$a}   } keys %$terms;
  my @result_list = sort { $results->{$b} <=> $results->{$a} } keys %$results;
  foreach( @result_list ){
      ...
  }


METHODS

new( %PARAMETERS )
Parameters: takes a named parameter list (see Synopsis above) specifying the storage policy, a collection name, and some database parameters:
    storage    => 'mysql' or 'sqlite'
    collection => 'collection name'
    database   => 'database name'
    username   => 'mysql username' (optional)
    password   => 'mysql password' (optional)
    host       => 'mysql host' (optional)
    min_term_frequency => 'minimum occurrence of a term' (optional)
    max_document_frequency => 'maximum percent of collection in
                               which a term occurs' (optional)

Additional parameters are listed below.

Indexing

Additional optional parameters:

    lexicon => 'path/to/lexicon.gz'
    default_encoding => 'iso-8859-1'
    parsing_method => 'nouns'
add_word_filters( %FILTERS )
Various filters can be added to trim the list of words that are indexed. All nouns are, by default, added to the index, but some other words will sometimes slip through. Filters available for use include:
    minimum_length        => $num  # omit words with fewer characters than $num
    maximum_word_length   => $num  # omit words with more characters than $num
    maximum_phrase_length => $num  # omit phrases with more words than $num
    too_many_numbers      => $num  # omit words containing more numbers than $num
    blacklist             => \@array # omit words in this array
    whitelist             => \@array # keep only words in this array

set_parsing_method( $METHOD )
This controls what classes of words are extracted from a document. Default is 'nouns'; other values include: 'proper_nouns', 'noun_phrases', 'adjectives', 'verbs'

set_default_encoding( $ENCODING )
This sets the encoding to use when indexing text. The default encoding is set to ISO-8859-1 (latin1). Everything will be converted to utf8.

index( $ID, $TEXT, [$WEIGHT=1] )
This method will read the text, extract nouns, apply any filters and add the data to the semantic index.

index_file( $FILENAME, [$WEIGHT=1] )
See `index` above

reindex( $ID, $TEXT, [$WEIGHT=1] )
If the text for this document has changed, the old one will be removed and the new document will be added to the index.

reindex_file( $FILENAME, [$WEIGHT=1] )
See `reindex` above

unindex( $ID )
Remove this document (or term) from the index

finish()
VERY IMPORTANT! This will save the entire index to the storage medium. If you do not call this function, nothing will be saved.

merge( $FIRST => $SECOND )
Merge the two documents or terms. The $FIRST item will be merged into the $SECOND. (All reference to the $FIRST item will be removed.)

Searching

Additional optional parameters (with default values):

    depth          => 4    # depth of graph traversal
    trials         => 100  # number of trials for random walk
    keep_top_edges => 0.3  # percent of edges kept before traversal
                           # set this to `1' to do no pruning
semantic_search( $QUERY )
Parameters: query is the raw search query from a user.

keyword_search( $QUERY )
Parameters: same as above, however the search results are returned using a simple keyword search, versus a Semantic search.

find_similar( @DOCUMENT_IDS )
Parameters: same as above, however the search begins on the given document node(s) rather than a term node.

summarize( @DOCUMENT_IDS )
Returns a summary of the given document(s). If more than one document is provided, it will find the best summary for the document set.

get_document_text( $DOCUMENT_ID )
Returns the text of the given document

Utilities

These are exported by Semantic::API by request only

Semantic::API::have_sqlite()
Returns true if SQLite support is enabled

Semantic::API::have_mysql()
Returns true if MySQL support is enabled


SEE ALSO

For more information, please visit http://www.knowledgesearch.org


AUTHORS

    Aaron Coburn, <acoburn@middlebury.edu>
    Gabe Schine, <gschine@middlebury.edu>


COPYRIGHT AND LICENSE

Copyright (C) 2006 by Aaron Coburn and Gabe Schine

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.