Using wikipedia for cross-language named entity recognition

Fernandes, E, Brefeld, U, Blanco Gonzalez, R and Asterias, J 2016, 'Using wikipedia for cross-language named entity recognition', in Big Data Analytics in the Social and Ubiquitous Context, Nancy, France, 15 September 2014, pp. 1-25.


Document type: Conference Paper
Collection: Conference Papers

Title Using wikipedia for cross-language named entity recognition
Author(s) Fernandes, E
Brefeld, U
Blanco Gonzalez, R
Asterias, J
Year 2016
Conference name MUSE SenseML 2014
Conference location Nancy, France
Conference dates 15 September 2014
Proceedings title Big Data Analytics in the Social and Ubiquitous Context
Publisher Springer
Place of publication Germany
Start page 1
End page 25
Total pages 25
Abstract Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.
Subjects Information and Computing Sciences not elsewhere classified
DOI - identifier 10.1007/978-3-319-29009-6_1
Copyright notice © Springer International Publishing Switzerland 2016.
ISBN 9783319290089
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 0 times in Scopus Article
Altmetric details:
Access Statistics: 59 Abstract Views  -  Detailed Statistics
Created: Tue, 29 Aug 2017, 08:18:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us