Skip to content Home Contact Mobile MyRMIT Library A-Z
RMIT UniversityResearch Repository
 

Federated text retrieval from independent collections

Shokouhi, M 2007, Federated text retrieval from independent collections, PhD Thesis, School of Electrical and Computer Engineering, RMIT University.

Document type: Thesis
Collection: Theses
Attached Files
Name Description MIMEType Size Downloads
Shokouhi.pdf Thesis application/pdf 2.48MB 200

Title Federated text retrieval from independent collections
Author(s) Shokouhi, M
Year 2007
Abstract Federated information retrieval is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot index uncrawlable hidden web collections; federated information retrieval systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated information retrieval systems acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. In this thesis, we propose new approaches for each of these problems. Our suggested methods, for collection representation, collection selection, and result merging, outperform state-of-the-art techniques in most cases. We also propose novel methods for estimating the number of documents in collections, and for pruning unnecessary information from collection representations sets. Although management of document duplication has been cited as one of the major problems in federated search, prior research in this area often assumes that collections are free of overlap. We investigate the effectiveness of federated search on overlapped collections, and propose new methods for maximizing the number of distinct relevant documents in the final merged results. In summary, this thesis introduces several new contributions to the field of federated information retrieval, including practical solutions to some historically unsolved problems in federated search, such as document duplication management. We test our techniques on multiple testbeds that simulate both hidden web and enterprise search environments.
Degree PhD Thesis
Institution RMIT University
School, Department or Centre School of Electrical and Computer Engineering
Keyword(s) Information retrieval
Search engines
 
Versions
Version Filter Type
Access Statistics: 68 Abstract Views, 200 File Downloads  -  Detailed Statistics
Created: Wed, 16 Feb 2011, 11:37:14 EST