Dutchsemcor-domain-classification-data contains the training and test data that were used to assign domain labels to paragraphs in SONAR.
The training data are derived from the Cornetto database version 2.0. Synsets in Cornetto can have one or more labels from WordnetDomains, version 1.1 [1,2]. For each synset, we took the synonyms and the definition words and stored them in a training file for each domain labels that was associated with a synset. The result is a single training file for each domain label with all the Dutch words associated with it. For example, the file “sub.txt” contains the following words:
sub zuurstoffles aqualong cilinder met samengeperste zuurstof;
The 161 training files are stored in the folder: dsc-domain-cornetto-vocabulary.
The Wordnet domains are organised in a hierarchy according to the Dewey decimal system. Some specialized domains, e.g. specific types of sports, do not need to be distinguished for the purpose of word-sense-disambiguation. We therefore created a mapping to more global domains and also replaced the English domain labels by shorter Dutch abbreviations. The mapping between the domain labels is provided in the folder: wordnet-domain-labels.
We transferred the training data from the WordnetDomain labels to the Dutch labels, resulting in less more coarse domain. These training data are found in the folder: dsc-domain-cornetto-vocabulary. The Dutch domains are distinguished as 37 folders, where some folder contain multiple English WordnetDomain folders, e.g. sp (sport) contains 36 training files.
A domain classifier can be trained from either the file names or the folder names, resulting in classifiers at different levels.
(Download 2.6.DOMAIN_CLASSIFIER.zip in DSC-XML)
The evaluation data are stored in the folder sonar_random_paragraphs_with_dsc_domains_for_testing.