Target Words Matrix

Target Words Matrix includes statistics for all target words in the DutchSemCor-project. There are two files (.csv and .txt) for the three word categories (nouns, verbs and adjectives), containing the same information. The csv file is a typical comma separated values format (in this case the separator is ‘|’), while the .txt file is a more human-readable plain text file.

The information represented for each lemma is the following:

+ lemma

+ part-of-speech

+ Number of instances in SONAR for the lemma

+ Manual annotations: number of instances annotated per sense for the lemma, and also how many of them had to be retrieved from the web, due to was not possible to find them in SONAR.

+ Accuracy: the accuracy for the lemma, for the fold-cross evaluation as well as for the all-words evaluation.

+ Automatic annotations: the number of instances automatically annotated per sense along with the average confidence of the system for all those instances. In the case of the CSV format, we have included a prefix for each sense representing the name of the system. For the manual annotations the prefix is ‘manual-‘.

The format of the CSV is one line for each lemma-pos containing the following information:

lemma | pos | total instances in sonar | timbl acc. in fold-cross | svm acc. fold-cross | ukb acc. fold-cross| timbl acc all-words | svm acc. all-words | ukb acc. all-words| manual-sense | num manualinstances | num instancesfrom web | …| [timbl|svm|ukb]-sense | num. automatic annotated by [timbl|svm|ukb] | avg. confidence | … |


