Sense-tagged corpora should ideally represent all the senses of words (including rare senses), the variety of contexts, and provide information on sense distribution. These three requirements are often contradictory in practice.Our approach of trying to meet these three requirements in DutchSemCor consisted of three phases:
1.) we manually created a lexical-sample corpus with double annotations that met the first requirement (2.874 lemmas, 274.344 tokens annotated by at least two annotators with IAA 94%);
2.) we semi-automatically extended this corpus through Active Learning after training a supervised WSD system using the data from the first phase and letting the WSD system select more examples to be validated by annotators (1.133 lemmas, 132.666 tokens annotated by the WSD-system and checked by at least one annotator. The agreement with the WSD-system was 44%);
3.) finally, we automatically annotate the whole SoNaR corpus for the selected lemmas ( Machine Sonar) to add more examples that better represent the context variety and the sense-distribution of the selected lemmas.
The WSD systems ( TiMBL-WSD, SVM-BC-WSD and UKB-WSD) were built from the final sense-annotated corpus. UKB-WSD was trained using circa 1.8 million relations (1 million derived from Cornetto and WordNet; 800.000 derived from the manually-tagged data). The systems were tested in three independent evaluations:
TiMBL: 84.42 (Noun) 83.74 (Verb) 73.45 (Adjs)
SVM: 82.51 (Noun) 84.80 (Verb) 73.62 (Adjs)
UKB: 74.12 (Noun) 59.56 (Verb) 53.58 (Adjs)
TiMBL 54.25 (Nouns) 48.25 (Verbs) 46.50 (Adjs)
SVM 64.10 (Nouns) 52.20 (Verbs) 52.00 (Adjs)
UKB 49.37 (Nouns) 44.15 (Verbs) 38.13 (Adjs)
Combo 60.70 (Nouns) 53.95 (Verbs) 50.83 (Adjs)
TiMBL 55.76 (Nouns) 37.96 (Verbs) 49.09 (Adjs)
SVM 64.58 (Nouns) 45.81 (Verbs) 55.70 (Adjs)
UKB 56.81 (Nouns) 31.37 (Verbs) 35.93 (Adjs)
Combo 66.09 (Nouns) 45.68 (Verbs) 52.24 (Adjs)
At the beginning of the project, we indexed and made available the first two releases of the Dutch Corpus, SoNaR. As a following step, a domain classification was conducted at the paragraph level. Also a corpus annotation format was developed by the UvT named Folia.
Named Entity recognition and Wikification were also carried out on the whole corpus but independent of the Cornetto database. Each Named Entity received a link to the corresponding Wikipedia page if present. Besides representing a separate semantic annotation, the Named Entities can also be used as features for WSD.
Besides processing the corpora, we have written the Semantic Annotation Guideline for sense tagging, a document which contains our annotation protocols and procedures and which can also be used in similar projects in the future. The applications developed during the project are open source and are also available for the research community. The Semantic Annotation Tool (SAT), for instance, provides human annotators with an ergonomic and easy to use web-based environment in which an optimal result can be reached for computer assisted semantic annotation. In cases where the number of examples in the corpus is not sufficient the so called Snippet-tool was used by the annotators to import examples from the Internet (At the end of the manual annotation phase, of all annotations, 67% came from SoNaR, 5% from CGN and 28% from the Internet in the form of snippets). In order to analyse the annotations in the log file, we have created the Loganalyser.