The program SVM implements a machine learning Word Sense Disambiguation system based on Support Vector Machines. We use a bag-of-words model for representing the features. The program has been developed in python,and requires no installation. There are some external libraries and resources that do need to be preinstalled:

1) SONAR corpus ( 2) Python SVM_light package ( 3) Pynlpl NLP library (

Once packages 2 and 3 are installed, we can run the system simply by calling the main script, for training, and for testing.

For training, the system requires a list of annotated examples as well as the SONAR corpus. This annotated examples have to be included in an input file with the following format:

tokenid#1 lexicalUnit#1 lemma#1 pos#1 Annotators DateTime tokenid#2 lexicalUnit#2 lemma#1 pos#1 Annotators DateTime tokenid#3 lexicalUnit#1 lemma#1 pos#1 Annotators DateTime

We can see that there is a line for each annotated example, and for each example there are 5 tab separated fields representing: the SONAR token identifier, the cornetto lexical unid identifier, the lemma and part-of-speech,and two more fields that actually are not used by the program the annotators and datetime of the annotation.


For training a WSD system we have to run: >> pathToOutputFolder pathToTrainFile PartOfSpeech pathToSonar

For instance: >> /exp/myWSDmodel /exp/data/ n /exp/corpora/sonar

Testing and evaluation

The testing (and evaluation) follows a similar interface:

>> pathToExistingWSDFolder pathToTestFile PartOfSpeech pathToSonar

In this case the WSDfolder corresponds with the folder of a WSD model created by the

The evaluation generates a set of files in the same folder of the WSD model. There will be a set of .csv files for the different lemmas. Results for all lemmas are stored in the file types.scores.csv and the overall result of the system in the file system.result.csv.

The python module

This module can be easily configurable by means of some variables in the source code. In the constructor __init__of the class SVMclassifier can be found the main variables that can be set. Two of the most important are:

self.NUMPROC: the number of processes to be run in paralel for the training self.__ctxsize: the size of the context around the target word to generate the bag-of-words features

By means of the method self.__getFeaturesForWordid(…) we can control the type of features, or add new ones. 

(Download in DSC-XML)

Leave a Reply