Evaluation Named Entity Recognizer (NER)

Evaluation Named Entity Recognizer (NER)


To evaluate the performance of the Named Entity Recognizer (NER) and Semanticizer (SEM), we took a random sample of 100 documents from the currently processed set of documents and manually evaluated the performance of both systems.

We use a broad definition of entities in our manual evaluation: persons, location, organizations, but also books, films, animal and plant classes. More general, most phrases that start with a capital letter are considered to be entities. This definition makes it more challenging for the NER to recognize all entities, on top of the issues that occur due to the wide range of document types (clean Wikipedia and news vs. noisy discussion lists, subtitles and auto_cues). Partially matching NEs are considered incorrect.

We evaluate the Semanticizer only on those NEs that were correctly recognized by the NER. For example, if for the sentence “Lionel Messi scores 6 goals for FC Barcelona against Real Madrid” the NER recognizes “Lionel Messi” and “FC Barcelona,” but not “Real Madrid,” the Semanticizer will only be evaluated on those two entities. The same goes for the situation in which the NER incorrectly identifies “Madrid” as entity and the Semanticizer then (correctly) matches it to Madrid (the capital of Spain); this result will be ignored. In case the Semanticizer fails to link an entity to Wikipedia, even though the entity has a page, it is considered a false negative.

Overall results are as follows:

NER – P:0.78, R:0.65

SEM – P:0.84, R:0.82



The results per text type are as follows:

# Type NER: P; R SEM: P; R

42 Wikipedia NER: 0.79; 0.69; SEM: 0.89; 0.89 1 no WP

26 discussion list NER: 0.90; 0.33. SEM: 0.75; 1.00 12 no entities, 6 no WP

9 web snippet NER: 0.67; 0.67. SEM: 1.00; 0.25 6 no entities, 1 no WP

7 autocue NER: 0.73; 0.63. SEM: 0.96; 0.96 1 no WP

6 newspapers NER: 0.74; 0.61. SEM: 0.90; 0.85 –

4 periodical NER: 0.72; 0.50. SEM: 0.95; 0.95 1 no WP

4 e-magazines NER: 0.83; 0.76. SEM: 0.86; 0.89 –

1 legal text NER: 0.52; 0.73. SEM: 0.56; 0.47 –

1 subtitles NER: 0.82; 0.51. SEM: 0.54; 0.49 –



The first observation is that discussion list documents (46%) and especially web snippets (67%) have a high percentage of texts without entities and the impact of NER on these text types could therefore be limited. Other than that, we find that NER works well in general, but lacks precision in legal texts and recall suffers for noisy user generated texts (discussion list), where capitalization is often missing. Once the NER identified the correct entities, the Semanticizer is capable of achieving very high precision and recall on most text types, except for legal text (very domain-specific) and subtitles (person names referring to tv show characters). Recall is low for web snippets (an artifact of the low number of entities for this type).

(Download NER eval results.zip)

Leave a Reply