DOMAINS
This is the evaluation of our domain classifier trained with a SVM approach and the data retrieved from Cornetto.
For evaluation purposes, we have compiled two different datasets from the SONAR corpus.
sonar1_2_random_paragraphs consists of a set of randomly selected set of paragraphs from SONAR, which have been manually annotated with the proper domain dutch labels (mapped from WordNet Domains). These documents can be found in the folder test_data/sonar1_2_random_paragraphs
sonar1_2_random_paragraphs_genre is similar to the previous test set, but the different genres within SONAR has been considered to be equally represented in the test set. Again these documents have been manually annotated. These documents are contained in the folder test_data/sonar1_2_random_paragraphs_genre
Our train domainClassiferDSC has been applied over both test sets, and the results, as well as the gold standard, can be found in the files sonar1_2_random_paragraphs.eval and sonar1_2_random_paragraphs_genre.eval. For each document, the identifier of the document is presented (the name of the plain text file), the set of domains manually annotated (gold) and the first five domains assigned by our svm system (svm), as well as the confidence of the machine learning algorithm.
In our evaluation, a document is considered correctly classified if there is any matching between the set of domains in the gold standard and the set of the domains returned by the system, regardless in which position is returnet within the first top five. With this approach, we want to reflect the fact that for a specific paragraph, more than one possible domain can apply. The total accuracy of our system, in terms of corrects documents divided by corrects+wrong is:
* sonar1_2_random_paragraphs: 84.62 %
* sonar1_2_random_paragraphs_genre: 79.88 %