Gerard Salton

Professor
gs@cs.cornell.edu

Ph.D. Harvard University, 1958
Natural-language text processing is a rapidly expanding field of research and development. Large masses of machine-readable text now exist that can be cheaply stored on high-density optical storage media and rapidly retrieved on demand. Furthermore, sophisticated methods are available for analyzing document texts, formulating appropriate user queries, conducting rapid file searches, and ranking the retrieved items in decreasing order of importance to the users.

At Cornell, we design and operate large, general-purpose text processing environments where texts can be handled without restrictions as to size or subject matter. In the absence of knowledge bases that would be useful for unrestricted text databases, we use corpus-based text analysis systems that determine the meaning of words and expressions by a refined context analysis using statistical and probabilistic criteria. Using the corpus-based approaches, we are able to determine text similarity with a high degree of accuracy. There are two main applications:

  1. The automatic generation of structured text collections (hypertext) where semantically similar pieces of text are automatically linked. Hypertext representations of large databases provide flexible browsing capabilities for general-purpose text access.

  2. The automatic retrieval of interesting text excerpts in response to available search queries.
We have done extensive work with an automated encyclopedia consisting of about 25,000 encyclopedia articles (the Funk and Wagnalls New Encyclopedia). In addition, we are also processing the TREC collection consisting of about 800,000 full-text documents covering a number of different subject areas (over 2 gigabytes of text).

A sophisticated search and retrieval service exists, as well as a text linking system capable of relating different text sections, paragraphs, and sentences. The main test vehicle continues to be the current version of the Smart text analysis and retrieval system, operating under UNIX on Sun Sparc Stations and Sun-4 terminal equipment.

University Activities

Professional Activities

Lectures

Publications

Software


Return to:
List of Faculty
1993-1994 Annual Report Home Page
Departmental Home Page

If you have questions or comments please contact: www@cs.cornell.edu.


Last modified: 9 November 1994 by Denise Moore (denise@cs.cornell.edu).