Update 11-04-2015: A github repository has been created to convert Folia XML to NAF: SoNar2Naf. The all words part of dutchsemcor is hence now also available in NAF (+ dependencies, entities) and contains besides Cornetto sense also Open Source Dutch WordNet 1.0 senses.


All-words.txt contains sequential annotations made by linguists to evaluate WSD-systems used in the DutchSemCor project.

Number of tokens = 23.907

Number of lemmas = 1.527

The All-words corpus covers a small number of texts, limited to a selection of genres and domains.

Overview of genres and domains.


(Download contains three XML files for each Part-of-speech (noun, verb, adjective) including the annotated examples for the allwords corpus developed in the DutchSemCor project. For each annotated example (element) the following information is stored in the DSC-XML file:

+ lemma: the lemma of the example

+ pos: the part-of-speech

+ sense_id: the lexical unit identifier of Cornetto for the example

+ token_id: the token identifier of the example in SONAR



Leave a Reply