Improved cross-language information retrieval via disambiguation and vocabulary discovery

Zhang, Y 2006, Improved cross-language information retrieval via disambiguation and vocabulary discovery, Doctor of Philosophy (PhD), Computer Science and Information Technology, RMIT University.


Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Zhang.pdf Thesis application/pdf 2.30MB
Title Improved cross-language information retrieval via disambiguation and vocabulary discovery
Author(s) Zhang, Y
Year 2006
Abstract Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR.

In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy.

Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Computer Science and Information Technology
Keyword(s) Machine translating
Versions
Version Filter Type
Access Statistics: 241 Abstract Views, 771 File Downloads  -  Detailed Statistics
Created: Mon, 29 Nov 2010, 16:09:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us