The CLIN26 Shared Task presents the first collocated Shared Task for Dutch!
Please register at Registration CLIN26 shared task after which we will send you the evaluation data within a day.
Participants can join in one or more of the following tasks:
- Entity Recognition
- Named Entity Recognition *new*
- Entity Disambiguation
- Event Detection
- Event Factuality Classification
- Entity coreference
- Event coreference
Please direct any questions about the task or the data at clin26org.
Development data release: August 29th *updated September 9*
Evaluation data release: November 15th
System output due: November 30th
Results & presentations: December 18th (during CLIN26)
The full corpus consists of 120 files. They consist of Dutch translations of English WikiNews articles. The first 7 sentences have been annotated according to the NewsReader annotation guidelines for Dutch (Schoen et al. 2014). The first sentence is always the title. The second sentence is the document creation time. We provide 30 documents as a development corpus. The other 90 documents will be used for evaluation. Some main characteristics are provided at the bottom of this page after the downloads.
Data to Download
The development corpora can be downloaded at the following locations. The conll formats provide the relevant annotations that can be used for developing your system and, at the same time, serve as an example of the output data we expect from participating systems. Please note that the corpus provides the full articles (so also the sentences that are not annotated). This will also be the case for the evaluation data. We also provide the corpus with sentence and token numbers, so you can select relevant sentences.
update 15-11-2015: Please register at Registration CLIN26 shared task after which we will send you the evaluation data within a day.
update 14-10-2015: the format for the coreference tasks has changed. The filename is added as the first column. In addition, the document start with ‘#begin document (name_of_file);’ and ends with ‘#end document’
update 20-10-2015: the format for the factuality task has been updated.
update 22-10-2015: The first 7 sentences have been annotated according to the NewsReader annotation guidelines for Dutch (Schoen et al. 2014). The first sentence is always the title. The second sentence is the document creation time.
update 2-11-2015: The scorer for the different tasks can be found at: CLIN26 scorer shared task
- Entity Recognition and Disambiguation in CoNLL format: corpus-entities
- Named Entity Recognition and Disambiguation in CoNLL format: corpus-named-entities
- Event recognition and Factuality in CoNLL format: corpus-event-factuality
- Entity and Event coreference: corpus-coreference
- Entity coreference: corpus-entity-coreference
- Event coreference: corpus-event-coreference
The full text with token and sentence identifiers in basic XML format can be found here: apple-corpus-tokens-sentences
About the data
The Dutch NewsReader corpus is part of a multilingual evaluation corpus. The original corpus consists of 4 subcorpora from English Wikinews, each containing 30 files about a certain topic. These 120 files where translated into Dutch, Spanish and Italian. The first five sentences of each article were annotated in all four languages according to the same principles.
Please read the NewsReader annotation guidelines for details on how the annotations were carried out. A few important differences with other comparable entity recognition and factuality identification tasks are highlighted below.
The entity annotations include all entities, including embedded ones. The data has the following properties:
- All entities (not only named entities) are annotated
- The full descriptive span is annotated
- If an entity contains other entities, these are annotated as well
For instance [De directeur van [Ford]] consists of the person “the director of Ford” and the organization “Ford”. Likewise in [Dallas, [Texas]], the span Dallas, Texas is annotated as one entity and Texas as another.
Factuality annotations consist of two values:
- CERTAINTY: with values certain, probable, possible and unknown
- POLARITY: with values positive, negative and unknown
Together, they can be mapped directly to the annotation values used in FactBank.