HUMAN.log contains all annotations made by human annotators to train WSD-systems in the DutchSemCor project.
(Download HUMAN.log.zip in DSC-XML)
1.2.1.HUMAN_ANNOTATIONS.zip contains three DSC-XML files for each Part-of-speech (noun, verb, adjective) including the manual annotations of the DutchSemCor project. The .zip file also contains the annotations resulting from the Active Learning process(second annotation phase, see below). When human annotators disagreed with the machine, the tag assigned by the human is selected.
(Download 1.2.1.HUMAN_ANNOTATIONS.zip in DSC-XML)
The file table.annotations.DSC.08-01-12.sql is an SQL dump of the table used to store all the annotations done in the DutchSemCor project. This table contains annotations made manually (by humans), automatically (by a machine learning system) and also annotations for the all-words corpus.
(Download 1.1.1.SAT-SQL_DUMP.zip)
The file table.annotations.DSC.08-01-12.csv is an CSV file created from the table used to store all the annotations done in the DutchSemCor project. This table contains annotations made manually (by humans), automatically (by a machine learning system) and also annotations for the all-words corpus.
(Download 1.1.2.SAT-CSV_EXPORT.zip)
FIRST ANNOTATION PHASE
In the first phase of the project, 282,503 tokens for 2,870 nouns, verbs and adjectives (11,982 senses) were annotated manually by two annotators. The annotators needed to reach a high agreement and were instructed to select 25 diverse examples for each sense. The examples were selected using a sense annotation tool SAT(Görög & Vossen 2010; Van Gompel 2010; Vossen et al. 2011) from the 500 million token SoNaR corpus of written Dutch (Oostdijk et al., 2008) and the CGN Spoken Dutch Corpus (Eerten, 2007). The SAT is a user-friendly web application for semantic tagging developed for DutchSemCor. The SAT user interface combines lexicographic information from the Cornetto database with corpus data from available corpora.
(Download Human1stPhase.zip)
SECOND ANNOTATION PHASE (ACTIVE LEARNING)
In the second phase of the project, the manually annotated corpus of the first annotation phase is extended with more examples using a supervised WSD system and through Active Learning (AL). We designed the following procedure:
1. We build and test a supervised WSD system (Initial Learning WSD, or IL-WSD) using the examples from the first phase. 2. All words with a tested accuracy above 80% are considered done and ready for completing the corpus. That is: the supervised system will be able to find more examples for the senses with sufficiently high confidence and high precision. 3. All words that perform lower than 80% will undergo AL to obtain better results and more examples.
We defined the following AL procedure for all words that are tagged at less than 80% accuracy by IL-WSD:
1. Use IL-WSD to annotate all remaining SoNaR tokens of word w in WAL, where WAL is the set of words performing lower than 80% accuracy; 2. Present the annotators a selection of 50 tokens for w tagged with senses that perform badly; 3. The annotators tag these tokens with the proper sense, thus creating an Active Learning corpus (AL-corpus). Annotators can assign any of the senses to the tokens, thus also re-assign tokens to senses that already perform well;
The AL-corpus is used to improve the WSD system for words in WAL. This process can be repeated until we attain the desired results for all the words.