Abstracts
The booklet with abstract can be downloaded here
STIL Thesis Award: Explaining relationships between entities
Nikos Voskarides
Modern search engines are increasingly aiming to understand users’ intent in order to answer information needs more effectively by providing richer information than the traditional “ten blue links”. This information might include context about the entities present in the query, direct answers to questions that concern entities and more. A recent trend when answering queries that refer to a single entity is providing an additional panel that contains some basic information about the entity, along with links to other entities that are related to the initial entity. A problem that remains largely unexplored is how to provide an explanation of why two entities are related. In this work, we study the problem of explaining pre-defined relations of entity pairs with natural language sentences in the context of search engines. We propose a method that first extracts sentences that refer to each entity pair and then ranks the sentences by how well they describe the relation between the two entities. Our ranking module combines a rich set of features using state-of-the-art learning to rank algorithms. We evaluate our method on a dataset of entities and relations used by a commercial search engine. The experimental results demonstrate the effectiveness of our method, which can be efficiently applied in a search engine scenario.
1. Machine Translation-based Language Model Adaptation for Automatic Speech Recognition of Spoken Translations
Joris Pelemans, Tom Vanallemeersch, Kris Demuynck, Lyan Verwimp, Hugo Van Hamme and Patrick Wambacq
We present the results of our work on the SCATE – Smart Computer-Aided Translation Environment – project which addresses the integration of machine translation and automatic speech recognition for the recognition of spoken translations. We propose a technique that applies language model adaptation on the sentence level based on translation model probabilities of the source language text. We show that omitting language model renormalization after adaptation, as well as applying probability weights, drastically improves the efficiency compared to a similar technique described in the literature. Disk storage per sentence is reduced by ca. 5GB for a 3-gram language model and up to 15GB for a 5-gram language model. The time needed for adaptation takes ca. 0.2s per sentence which enables the integration of the technique into existing translation environments. The effect on recognition accuracy is investigated for both word-based and phrased-based translation models and is combined with tailored models for named entities. The final model achieves a 25.3% relative error reduction compared to a 3-gram baseline without adaptation on a corpus of 167 English-to-Dutch spoken translations.
2. Augmenting Recurrent Neural Network Language Models with Subword Information
Lyan Verwimp, Joris Pelemans, Hugo Van Hamme and Patrick Wambacq
Many tasks in natural language processing (e.g. speech recognition, machine translation, …) require a language model: a model that predicts the next word given the history. Traditionally, these models are count-based and assign a probability to a sequence of n words based on the frequency of that sequence in the training text. However, these so-called n-gram models suffer from data sparsity and are not capable of modeling long-distance dependencies. Neural network-based language models (partly) solve data sparsity issues by projecting words onto a continuous space such that better generalizations can be made. Moreover, recurrent neural network language models (RNNLMs) are capable of modeling long-distance dependencies because they have a memory. RNNLMs take as input a one-hot encoded vector of the current word together with a copy of the hidden layer at the previous time step, which is then projected onto a hidden layer, from which a probability distribution for the next word is computed in the output layer. As a result, words that occur in similar contexts end up close to each other in the continuous space. However, since RNNLMs treat words as atoms, it is not possible to capture the formal/morphological properties of words. In the context of the project STON (IWT – INNOVATIEF AANBESTEDEN), we explore the addition of subword information to the input of RNNLMs. In this way, we transform the projection such that not only words occurring in similar contexts but also words with a comparable structure get vector representations that are close to each other.
4. Annotating Telugu corpus: Using the BIS-POS Tagset
Viswanatha Naidu
Linguistically-annotated resources are of immense help for processing any language in general and agglutinative languages in particular. This paper describes an on-going effort to tag Telugu corpus created as part of the Indian Languages Corpora Initiative (ILCI) project which aims to create an approximately 27 million parallel Tagged and Chunked words for 17 Indian Languages including English (http://sanskrit.jnu.ac.in/projects/ilci.jsp?proj=ilci). 50000 sentences have already been annotated manually and another 50000 are being annotated using a common standardized tagset known as the BIS-POS (Bureau of Indian Standards-Parts of Speech) tagset. There is no published work available on how to tag Telugu corpus using this BIS-POS tagset. Hence, the paper describes how Telugu corpus is (to be) tagged and issues encountered while tagging the corpus. The main aim of the paper is to develope a manual which is to be used for human-annotation tasks and for building automatic systems.
5. Very quaffable and great fun: Applying NLP to Wine Reviews
Els Lefever, Iris Hendrickx and Antal van den Bosch
People find it difficult to name odors and flavors. In blind tests with everyday smells and tastes like orange or chocolate, only 50% is recognized and described correctly. Certain experts like wine reviewers are trained in recognizing and reporting on odor and flavor on a daily basis and they have a much larger vocabulary than lay people. In this research, we want to examine whether expert wine tasters provide consistent descriptions in terms of perceived sensory attributes of wine, both across various wine types and colors. We collected a corpus of wine reviews and performed preliminary experiments to analyse the semantic fields of „flavor” and „odor” in wine reviews. To do so, we applied distributional methods as well as pattern-based approaches. In addition, we show the first results of automatically predicting „color” and „region” of a particular wine, solely based on the reviewer’s text. Our classifiers perform very well when predicting red and white wines, whereas it seems more challenging to distinguish rosé wines.
6. Exploring the Realization of Irony in Twitter Data
Cynthia Van Hee, Els Lefever and Véronique Hoste
Handling figurative language like irony is currently a challenging task in natural language processing. However, understanding irony is of key importance if we want to push the state-of-the-art in NLP tasks such as sentiment analysis. In this research, we present the construction of a dataset with ironic tweets for English and Dutch and a new annotation scheme for the textual annotation of verbal irony in social media texts. Furthermore, we present inter annotator results and some statistics on the annotated corpus. From these preliminary results we can conclude that the detection of contrasting evaluations might be a good indicator for recognizing irony in social media text.
7. Intricate Natural Language Processing made easier with Symbolic Computation software: Pattern Matching lessons from Bracmat
Bart Jongejan
An important routine in Natural Language Processing is searching for patterns in tree structured data. However, the libraries and tool-kits for querying such data do not give us the same control as programming languages, thus limiting us in what we can ask. Many interesting queries exceed the capacities of query tools, such as discovering repeating groups of labels within constituents in a treebank. This task can be solved with traditional programming languages, but it is hard. A high level solution starts with the observation that repetition is a pattern. The task is much easier if we can use tree pattern matching, associative pattern matching, and, because the pattern must be updated repeatedly during pattern matching, also expression evaluation during pattern matching. Term rewriting systems and some functional languages feature pattern matching to decompose data structures and sometimes even allow expression evaluation in the midst of a pattern matching operation, but they seldom master the art of associative pattern matching. Associative pattern matching is almost exclusively the territory of regular expressions (RE), but REs are restricted to arrays of characters. Sufficiently powerful pattern matching can only be implemented by treating it as a first class programming construct and not as an embedded pattern language in which basic language expressions cannot be evaluated. To illustrate the point, I will present Bracmat, a pattern matching programming language that evolved from a symbolic computation system to a versatile NLP tool. Bracmat has been used in areas from validation of Dutch corpora to workflow management.
8. UGENT-LT3 SCATE System for Machine Translation Quality Estimation
Arda Tezcan, Veronique Hoste, Bart Desmet and Lieve Macken
We report on the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estimation (QE), viz. English-Spanish word and sentence-level QE. We conceived QE as a supervised Machine Learning (ML) problem and designed additional features and combined these with the baseline feature set to estimate quality. The sentence-level QE system re-uses the word level predictions of the word-level QE system. We experimented with different learning methods and observe improve- ments over the baseline system for word- level QE with the use of the new features and by combining learning methods into ensembles. For sentence-level QE we show that using a single feature based on word-level predictions can perform better than the baseline system and using this in combination with additional features led to further improvements in performance.
9. A novel approach to aspect extraction taking into account linguistic structure.
Stéphan Tulkens and Walter Daelemans
One of the current open problems in sentiment analysis is the extraction of aspects, that is, given that we know that some phrase expresses positive or negative sentiment, can we also discover to what target this sentiment applies? Current approaches in aspect extraction use state of the art machine learning approaches to facilitate extraction, but, due to issues like data sparseness, fail to pick up on the syntactic regularities behind the appearance of aspect. In this talk we will highlight some ways in which current approaches to aspect extraction have been lacking, and show that a proper formalization of the concept of aspect lies at the heart of solving these issues. In addition to this, we present our own formalization and framework for analyzing aspects, showing that incorporating more linguistics into the framework leads to a stronger hypothesis. Finally, we show that using this framework with an already existing approach to aspect extraction leads to improved results for a novel Dutch dataset, but does not improve results for a comparable English dataset, which suggests that there might be a difference in how these languages express aspect.
10. Detecting racism in Dutch social media posts
Stéphan Tulkens, Lisa Hilte, Elise Lodewyckx, Ben Verhoeven and Walter Daelemans
We report on an exploratory study on automatic racism detection in Dutch social media posts, using a supervised classification method. First, we retrieved Dutch comments from two public Belgian Facebook pages which are notorious for attracting racist comments and reactions. All posts were annotated as being racist or not by two annotators, with a third annotator as tiebreaker. Because of the diverse array of racism on display, we used a very broad definition of racism, including negative or stereotypical utterances based on culture or religion. These annotations were then used as a gold standard in a supervised classification approach. The model classifies utterances as either racist or non-racist based on various linguistic features, both stylistic and content-based. We also attempt to use a novel set of Dutch word embeddings, created from a cross-genre corpus, to aid classification. As a test set, we retrieved and annotated additional posts from the same two public Facebook pages. The most relevant features are explicit racist words and expressions, which could indicate that the racist opinion in the training corpus are expressed in an open way, and that it is content, not style which unifies the racist utterances. A possible downside to this, however, is that these content-based features might generalize badly to new data, as they are strongly connected to the target of the racist opinions.
11. Wikification for Implicit MT evaluation
Iris Hendrickx and Antal van Den Bosch
Implicit MT evaluation aims to estimate the quality of an MT system without using a reference translation. Instead, we aim to detect and compare topical information elements (named entities, events, specific terms) in source texts and their generated translations by using Wikification. We make use of the fact that most Wikipedia pages have translations in many other languages. Wikification aims to identify topics in a document and link them to their corresponding Wikipedia page. We apply the following method to use wikification for implicit MT evaluation. We first align source and translated data at the phrase level using Giza++ (Och and Ney,2003). We apply a Wikifier (Ratinov et al, 2011) to find and link the topics in the translated data to their relevant Wikipedia pages. We use the alignment between source and translation to get the corresponding translation of the topics in the source language. We check whether this translated topic corresponds to the same Wikipedia page in the source language; when a match is found, we count this as a correct topic translation or as an error when no matching page is found. Using name translation as a measure for overall MT quality has been suggested before and has been shown to correlate well with human MT judgements (Hirschman, 2000); with wikification we aim to generalize this technique. These experiments are carried out in the context of the TraMOOC project (http://tramooc.eu/) that aims to automatically translate online course material of MOOCs to 11 different languages. In this presentation we will show our first results on English source material being translated to Dutch, German, and Portuguese. We discuss the coverage and accuracy of this approach and provide a qualitative error analysis.
12. BlackLab: advantages and disadvantages of using Lucene for a linguistic corpus search system
Jan Niestadt
BlackLab is an open source linguistic corpus search system built on high-performance search engine Apache Lucene. It supports flexible indexing and querying of linguistically annotated corpus material. It handles multiple input formats and multiple query languages. It enables incremental indexing and provides a Java API, web service access, and a search front-end. It is designed for ease-of use. We are working to integrate it with search servers like Apache Solr and ElasticSearch. BlackLab’s matching algorithm relies heavily on Lucene’s reverse index. In this talk we will show the advantages of this approach, as well as the interesting challenges it presents. We will focus on the hybrid search strategy we are currently working on, which combines information from the reverse and forward index to enhance performance. We will conclude with a brief discussion of future work on enabling distributed search and querying treebanks and other hierarchical structures.
13. Evaluation and analysis of term scoring methods for keyphrase extraction
Suzan Verberne, Maya Sappelli and Wessel Kraaij
We evaluate five unsupervised term scoring methods on four different keyphrase extraction tasks: author profiling, query suggestion, query expansion and terminology discovery. Keyphrases (keywords) are short phrases that describe the main topic of a document or a document collection. Manually assigning keywords to documents is an expensive process, especially when large collections are involved. The goal of automatic keyphrase extraction is to extract and rank the most descriptive terms from a document or a document collection. Many methods for keyphrase extraction have been proposed in the literature. However, it is as yet unclear how these methods compare to each other, how they perform on different tasks than the one they were designed for, and which method is the most suitable for a given (new) task. We analyse three aspects of keyphrase extraction tasks: the size of the document (collection) from which the keyphrases are extracted, the type of background collection that is used to find the most informative terms and the importance of multi-word phrases. In a series of experiments, we evaluate, compare and analyse the output of the term scoring methods for the tasks at hand. We found that the most important factors in the success of a term scoring method are collection size and the importance of multi-word terms in the domain. For modelling small collections (up to 10,000 words), the best performing method is Parsimonious Language Models (PLM). For collections larger than 20,000 words, the best performing method is Pointwise Kullback-Leibler Divergence. For this latter method, we introduced the parameter gamma with which the proportion of multi-word terms in the output can easily be tuned. We conclude that evaluation of term scoring methods is a challenge: not only characteristics of the collection but also the task and the evaluation method determine the success of a method.
14. Querying Parallel Treebanks with GrETEL
Vincent Vandeghinste, Liesbeth Augustinus and Tom Vanallemeersch
We describe GrETEL for Parallel Treebanks, an online tool which enables syntactic querying in parallel treebanks. We have parsed the Dutch and English sides of Europarl version 7 with Alpino and the Stanford parser respectively. We have word-aligned these corpora with GIZA++ and subtree-aligned them with the Dublin Tree Aligner and Lingua:Align. We provide freely available online access to this parallel treebank for Dutch and English, allowing users to query the treebank using either XPath expressions or an example, looking for similar constructions and their equivalents in the target language.
15. Sustainability report readability prediction: an NLP expansion on the tried-and-true formulae
Nils Smeuninx, Orphée Declercq, Véronique Hoste and Bernard De Clerck
In this presentation we will discuss how feasible it is to employ readability prediction techniques to distinguish between sustainability reports and assess their accessibility to a wider, non-shareholder audience. Though corporate reporting strives for objectivity, corporate narrative inevitably – sometimes intentionally – retains some subjectivity, at times even manipulation. The ARCHIVE project aims to analyse whether different language varieties express this subjectivity differently. It examines how corporate reports across four industries (mining, oil, textiles and semiconductors) and five English-speaking regions (British, American, European, Australian and Indian) differ in their use of understandable language and how these features relate to (financial and non-financial) company performance as attested by Thomson Reuters’ Datastream and ASSET4 databases. We assembled a corpus of 312 sustainability report samples (extracted from 163 reports) and presented it to experts (i.e. linguists-in-training) for readability analysis. We compare these assessments to various techniques for readability prediction ranging from classical formulae (e.g. Flesch Reading Ease, Flesch 1948) to assessment by a state-of-the-art readability prediction system (De Clercq et al. 2014) able to assess generic text using both shallow (e.g. word length, type/token ratio) and deep (e.g. parse tree depth, coreference) linguistic characteristics. We show how finer-grained readability analysis, be it through deeper-level syntactic analysis or machine learning can supplement – and in some cases supercede – the tried-and-true readability formulae that have hitherto dominated readability research in accounting studies (e.g. Courtis 1995).
16. Multimodal distributional semantic models and conceptual representations in sensory deprived subjects
Giovanni Cassani and Alessandro Lopopolo
Multimodal Distributional Semantic Models (DSMs) are increasingly used because they provide perceptually-grounded semantic representations, which have improved many Natural Language Processing (NLP) tasks, and provided useful information about the cognitive mechanisms underlying semantic representations. The goal of this study is twofold: i) we provide further evidence about the goodness of such multimodal distributed representations in approximating human semantic organisation; ii) we exploit them to test predictions from embodied cognition. To do so, we rely on feature-norms collected with a feature listing task involving congenitally blind and sighted people and we compare semantic similarity across pairs of concepts, as extracted by an image-based DSM, a sound-based DSM, and two norms-based DSMs, one for each group. Multimodal DSMs encode semantics by leveraging perceptual information extracted from labelled images and sounds. In doing so, they approximate the modality specific perceptual information that is likely to contribute to the semantic representation of concrete concepts. Feature-norms provide data about the way blind and sighted people organise their semantic knowledge, encoding differences in perceptual experience. Under embodied cognition, we would expect no differences between groups when the norms-based DSMs are compared to the sound-based DSM, since acoustic information is available to both groups. On the contrary, visual information should approximate concept relatedness worse in blind subjects, since they cannot access this modality. Thus, we tested whether this deprivation has an impact on the way semantic spaces are organised, and looked at how those models approximate modality general or modality specific norms representations. We observed that the information extracted from images correlates better with norms produced by sighted subjects, while the sound-based model better approximates blind subjects’ semantic space, falsifying the starting hypothesis. The magnitude of the differences is similar, suggesting that a possible compensation might at work.
17. Natural Language Generation from Pictographs
Leen Sevens, Vincent Vandeghinste, Ineke Schuurman and Frank Van Eynde
Being unable to access ICT is a major form of social exclusion. For people with Intellectual or Developmental Disabilities (IDD), the use of social media or applications that require the user to be able to read or write well, such as email clients, is a huge stumbling block if no personal assistance is given. There is a need for digital communication interfaces that enable written contact for people with IDD. We present a first version of a Pictograph-to-Text translation system. It provides help in constructing Dutch textual messages by allowing the user to introduce a series of pictographs and translates these messages into natural language using WordNet synsets and a trigram language model. English and Spanish versions of the tool are currently in development. It can be considered as the inverse translation engine of the Text-to-Pictograph system as described by Vandeghinste et al. (in press), which is primarily conceived to improve the comprehension of textual content. The Pictograph-to-Text translation engine relies on pictograph input. We have developed two different input methods. The first approach offers a static hierarchy of pictographs, while the second option scans the user input and dynamically adapts itself in order to suggest appropriate pictographs. Two different prototypes for this pictograph predictor have been developed so far. The first evaluations show that a trigram language model for finding the most likely combination of every pictograph’s alternative textual representations is already an improvement over the initial baseline, but there is ample room for improvement in future work.
18. The CLARIN Concept Registry
Ineke Schuurman, Menzo Windhouwer and Olha Shkaravska
Since a few months the CLARIN Concept Registry (CCR;www.clarin.eu/conceptregistry) is operational, the open access, OpenSKOS-based registry replacing ISOcat. Although using ISOcat had been encouraged by CLARIN, it had its drawbacks: a rich data model combined with a very open update strategy turned out to be too demanding. The new concept registry is much more ‘under control’: representatives from the various countries involved in CLARIN will check every entry before it becomes visible to the general public, and the model has been made less complex. This model is now based on SKOS (Simple Knowledge Organization System), a W3C recommendation widely in use for concept and vocabulary services. The OpenSKOS system used by the CCR is used by various Dutch cultural heritage institutes, e.g., Sound & Vision. They and CLARIN joined forces to develop the next version of the system.
19. Exploiting the existing: Towards a Dutch frame-semantic parser
Chantal van Son
In computational linguistics, frame-semantic parsing refers to the task of automatically extracting frame-semantic structures from text. Whereas most research in this area has focused on English, the Dutch NLP community faces a problem that is faced by many languages other than English: what are the most efficient ways for developing NLP tools, such as frame-semantic parsers, when there is little or no annotated data available? In this talk I will present an approach that exploits existing interlingual and intralingual mappings between lexical resources for frame-semantic parsing in Dutch. As a starting point, the system takes the output of an NLP pipeline developed in project NewsReader, which includes predicate-argument structures generated by a Dutch PropBank-style semantic role labeler (the SSRL) as well as conceptual information on the predicates generated by a WSD system. From there, it exploits information provided by SemLink, the Predicate Matrix and/or a corpus of cross-annotations for frame and frame element identification. These resources provide mappings between the predicates and roles of FrameNet, PropBank, VerbNet and WordNet, which makes it possible to map the English frames and frame elements from FrameNet onto the Dutch predicates and their arguments (labeled with PropBank roles), provided that there is a way to translate the Dutch predicate to its English equivalent. For this purpose, alignments between the Dutch and English WordNets are used, as well as machine translations. The results show that this approach indeed offers great potential for frame-semantic parsing in a cross-lingual setting.
21. The effects of Semantic Role Labeling on Sub-Sentential Alignment in Parallel Treebanks
Mathias Coeckelbergs
n the last decade, Semantic Role Labeling (SRL) has proven to be an important NLP task with many important results. In our exploratory study we show that adding SRL can positively affect sub-sentential alignment between treebanks. We added PropBank semantic roles to a small parallel Dutch-English treebank, and used this information as an additional feature for training the discriminative tree aligner Lingua-Align. We compared conditions with manually added semantics and automatically annotated semantics noting that the former not surprisingly outperforms the latter. If we add a dictionary-based word alignment before our test, adding semantic roles does not improve this score, due to the high baseline. Further exploration has shown that results can be improved when grouping certain PropBank roles into a coarser subdivision. We also experimented with the annotation of several sets of semantic roles in the same sentence, which currently does not yet improve our score, but which already gives a basis for further research. We used the Stockholm Tree Aligner software to visualize the changes in alignment. This allowed us to interpret the changes made by the various configurations of features, and propose additional features and changes for further improvement of alignment.
22. Lexical preferences in Dutch verbal cluster ordering
Jelke Bloem, Arjen Versloot and Fred Weerman
This study discusses the contribution of lexical associations towards explaining the word order variation in Dutch verbal clusters, such as ‘heeft gezien’ (has seen). There are two grammatical word orders for two-verb clusters, with no clear meaning difference between them. Various factors have been linked to a preference for either of the two word orders. Because human language learning makes use of statistical learning abilities, it might be expected that lexical verbs like ‘to see’ have an associated word order preference: a word that is more often heard in one of the two possible word orders, may also be produced more often in that order. Because any verb could be used in such a construction, data sparseness has made statistical tests of this phenomenon difficult. In a 1.7 million word sample of text, the fifth most frequent verb can only be found 31 times in a verb cluster construction, meaning that we can only reliably test the most frequent of verbs. However, the availability of large, automatically-annotated corpora alleviates this problem. It is now possible to obtain enough data to test lexical associations for many different lexical verbs. Using the method of collostructional analysis, we show that there are statistically significant associations between a majority of lexical verbs and either of the two word orders. Based on this evidence, we claim that these associations can only be explained fully if they are encoded in the lexicon in some way, for example, as links between lexical verbs and word order constructions.
23. Mapping Leiden: Automatically extracting street names from digitized newspaper articles
Kim Groeneveld and Menno van Zaanen
The Dutch institute “Erfgoed Leiden en Omstreken” (ELO) has developed an interactive map of Leiden that allows people to search for monumental buildings and other interesting geographical entities or properties. We aim to extend this map with functionality that allows users to select a geographical area on the map and search for newspaper articles related to that area. This functionality requires an identification of street names in the collection of digitized newspaper articles. This task falls in the area of Geographical Entity Recognition (GER), which is the subfield of Named Entity Recognition (NER) that identifies geographical entities such as countries, cities, or street names. In this research we evaluate two existing tools in the context of identifying street names: Memory Based Tagger (MBT) and the Stanford Named Entity Recognizer (Stanford NER). Based on an automatically created and manually checked Gold Standard (GS) dataset, both taggers are trained and tested using 10CV. Two experiments are performed. Firstly, learning curves (increasing the amount of training data) are created that show that with more training data, the tools are better at removing false positives (non-street names tagged as street names). Secondly, both systems are trained on data from which annotations of certain street names are removed. This allows us to investigate whether the systems are able to identify street names that are not annotated as street names in the training data. It turns out that both systems are unable to generalize over the training data.
24. The Event and Implied Situation Ontology (ESO)
Roxane Segers and Piek Vossen
We present a new release of the Event and Implied Situation Ontology (ESO), an OWL ontology which formalizes the pre and post situations of events and the roles of the entities affected by an event. The ontology relies on SRL annotated text and focusses primarily on the interpretation of event implications rather than the semantics of the event predicates. As such, the ontology is designed to infer information from text that otherwise would remain implicit. The ontology reuses and maps across existing resources such as FrameNet, WordNet and SUMO. Through these mappings, ESO serves as a hub to other vocabularies as well, such as Princeton Wordnet (PWN) and the Wordnets in the Global Wordnet grid. As such, ESO provides a hierarchy of events and their implications across languages. The ontology consists of 63 event classes with 103 mappings to FrameNet and 46 mappings to SUMO on class level. In total, 58 properties were defined to model the pre and post situations of events. For the roles of the entities affected by an event, 131 mappings to FrameNet Frame Entities were created. We show how ESO is designed and employed to assign annotations on millions of processed articles on both predicate and role level thus allowing for inferencing over various events and implications. First evaluation results on a subset of our corpus show that 50% of the automatically derived events and their participants, typed and enriched with ESO assertions, are correct. Most errors in the ESO events stem from errors introduced by the pipeline used to process the documents. The ontology, the documentation and all mappings to external resources are available at: https://github.com/newsreader/eso
25. A look inside Babelfy: Examining the Bubble
Minh Le and Filip Ilievski
BabelFy has the unique capability of simultaneously performing word sense disambiguation and named entity disambiguation in multiple languages and has been among the state-of-the-art systems for both tasks.However, its implementation is not open-sourced (only APIs are made available). This fact makes investigation of the internal working of the system and analysis of the impact of its components impossible, which hinders further development. Therefore, we set out to re-implement Babelfy and make the source code freely available for the research community and others. In this talk, we will discuss our progress, difficulties and preliminary results.
26. Initial steps towards building a large vocabulary automatic speech recognition system for the Frisian language
Emre Yilmaz, Maaike Andringa, Sigrid Kingma, Frits Van der Kuip, Hans Van de Velde, Frederik Kampstra, Jouke Algra, Henk Van den Heuvel and David Van Leeuwen
Frisian is one of the official languages of the Netherlands and mainly spoken in the province of Fryslân. In the scope of the FAME! Project, we are building a spoken document retrieval system for the archives of the local broadcaster, Omrop Fryslân, which contains more than 2600 hours of radio broadcasts. In this work, we describe the initial steps towards building a large vocabulary automatic speech recognizer (ASR) for the Frisian language that will be incorporated in this retrieval system. The native speakers of Frisian often code-switch in daily conversations due to the extensive influence of the other official language in the province, Dutch. This phenomenon introduces new challenges to the ASR task requiring dedicated handling of the ASR resources such as the bilingual pronunciation dictionary and phonetic alphabet. Moreover, the ASR architecture has to be designed accordingly, e.g. either adopting an all-in-one approach using language-independent modeling or language-dependent models activated based on the output of an embedded language identification system. We will discuss our initial efforts to collect the ASR resources and compare the recognition performance of several different recognition architectures using state-of-the-art acoustic modeling techniques.
27. Distributional bootstrapping with Memory-based learning
Giovanni Cassani, Robert Grimm, Steven Gillis and Walter Daelemans
In this work, we explore Distributional Bootstrapping using Memory-based learning. In language acquisition, this concept refers to the hypothesis that children start breaking into the language by extracting the distributional patterns of co-occurrence of words and lexically-specific contexts. We started from identifying the pitfalls of past accounts and investigated the usefulness of different kinds of information encoded in distributional patterns, that of different types of contexts, and their interaction. In greater detail, we analysed the impact of three pieces of information that children are able to extract, as shown from several experimental studies: i) token frequency, or how many times a distributional cue occurs in the input; ii) type frequency, or the number of different words a cue occurs with; iii) (average) conditional probability of context given word, or the easiness with which the occurrence of a specific cue can be predicted given the occurrence of a target word, averaging over the words it occurs with. Moreover, we investigated the information conveyed when i) only bi-grams; ii) only trigrams; or iii) both are considered. Using several corpora of Child-directed speech from typologically different languages (English, French, Hebrew), we show the impact of distributional information and contexts on cue selection, performed in an unsupervised way. The goodness of the selected set of cues is assessed using learning curves resulting from a supervised Parts-of-Speech tagging experiment, performed with Memory-based learning. This way, we do not simply get a picture of the end state, but can also compare the learning trajectories that result from the use of different models and evaluate the specific contribution of each of the different pieces of information and types of context. We show that only certain conditions make learning possible, while others do not lead to any improvement.
28. Extracting present perfects from a multilingual corpus
Martijn van der Klis, Bert Le Bruyn and Henriette de Swart
The perfect is used differently even in closely related languages (e.g. de Swart 2007). We report on preliminary results from our attempt to use corpora to provide new insights into how perfects map onto each other and onto other tenses. We aim to contribute to a linguistic analysis of the perfect and to develop translation tools. Taking inspiration from Loáiciga et al. (2014), we used the Dutch Parallel Corpus (Paulussen et al. 2006) to generate a high quality sample of tense mappings between English, French and Dutch. Our first step was to develop algorithms able to automatically extract perfects from the corpus without extending the metadata already provided. We consequently looked for auxiliary verbs (BE/HAVE, depending on the language) in combination with a past participle. Challenges for the language-specific algorithms included: dealing with words between the auxiliary and the participle; distinguishing between passive presents and active BE present perfects (Dutch/French); distinguishing between the present perfect and the present perfect continuous (English); dealing with the possibility of auxiliary verbs appearing after the past participle (Dutch). The script is available via GitHub (https://github.com/UUDigitalHumanitieslab/time-in-translation). Our second step was to check the data, establish the translation equivalents and investigate the factors that influence the mappings. This second step is different from Loáiciga et al. (2014) who start from a pre-established set of factors. We are convinced that preliminary human exploration of the data may uncover factors previously missed. One of those, prominent in our sample, is the influence of passives for mappings to Dutch.
29. Delving Deeper into the (Dutch) Twitter Tribal Language Hierarchy
Hans Van Halteren and Nelleke Oostdijk
It has often been observed that the language use in social media differs markedly from that in more traditional written texts. Investigations of Dutch tweets also clearly support this observation. Such language use is sometimes harshly classified as erroneous. A more friendly approach suggests that we are dealing with a new language variety (here of Dutch), say Twitter Dutch. We agree that what we see in Dutch tweets is not erroneous, but suggest that viewing it as a single new variety is too much of a simplification. Instead, we suggest that the tweets can be assigned to a large number of different varieties. Following the term “tribe” suggested in a 2013 article in the Guardian Datablog, we adopted the term “tribal languages” for these varieties (CTR2015). The choice of tribal language for each specific tweet is primarily determined by its author and its topic, being further influenced by such factors as target audience and communicative goal. Since authors may be part of (potentially temporary) communities, and topics part of topic areas, we expect that the tribal languages form a kind of (multiple) hierarchy. In earlier research, we have shown that broad topic areas indeed show a certain coherence in language use (CTR2015), and that the same is true with relation to author characteristics (CLIN24, CLIN25). In this talk, we will investigate whether our hypothesis holds true when we delve deeper into the hierarchy, focusing on smaller user groups and topic areas.
30. Modeling the impact of contextual diversity on word learning
Robert Grimm, Giovanni Cassani, Walter Daelemans and Steven Gillis
In this presentation, we utilise computational modeling techniques in order to investigate aspects of human language acquisition. We first establish a connection between human word learning and a distributional property of language. Building on this, we then explore the learning mechanism by inducing syntactic categories from artificially generated data. This allows us to control for confounding variables and isolates the impact of a given distributional property on learning. In particular, we quantify the predictability of words via the diversity of their Linguistic context, and relate this to learnability in three ways: (1) We find a positive correlation between a word’s contextual diversity and its age of acquisition, suggesting that words which appear in more diverse contexts are acquired later. Importantly, this relationship is independent of frequency. (2) The language directed at young children by their caretakers is marked by less diverse contexts than the language adults use to converse among themselves. This is in line with the hypothesis that child-directed speech is easier to learn from than speech exchanged among adults. (3) We show that distributional algorithms — e.g., a simple count-based approach, but also currently popular neural word embedding models –, achieve better results on artificially created input data with less diverse Linguistic contexts. Taken together, (1) and (2) provide evidence for the impact of contextual diversity on learnability: the less diverse a word’s Linguistic context, the easier it is to learn that word. A possible cognitive mechanism is suggested by (3) – for any given word, if its context is less diverse, a distributional learner will have to keep track of fewer contextual dimensions, which in turn results in faster learning.
31. Breaking the Glass Ceiling: Semantic Approaches to Machine Translation
Paul Van Eecke, Miquel Cornudella and Remi van Trijp
Currently, the most successful machine translation (MT) paradigm is statistical MT (SMT). In this approach, translation models are trained using machine-learning algorithms on large sentence-aligned parallel corpora. These models have a broad coverage and are good at capturing which sequences in one language are most often translated by which sequences in an other language. However, they do not attempt to model the semantics of the languages in any way and are not able to capture the richness and subtleties of natural languages needed for achieving human-like translation quality. Therefore, we argue that the rich conceptualisations expressed by natural language grammars should be integrated into MT systems. In this presentation, we will demonstrate a system that combines our state-of-the-art SMT system with two semantic approaches. In the first approach, a processing model (bidirectional computational construction grammar) of the source language is used to extract the meaning of the input utterance. Then, a processing model of the target language is used to produce an output utterance, expressing this meaning representation. In the second approach, the output of the SMT system is collected first. Then, a robust comprehension algorithm in combination with a processing model for the target language is used for correcting the output utterance. Both approaches are implemented in Fluid Construction Grammar and will be demonstrated for English-French and Japanese-English translation. We will show that the first approach is very successful when SMT is not meaning-preserving enough, while the second approach is very effective when the output of SMT is ungrammatical.
32. Skipping meals, skipping words, and how the latter can benefit you and the first just makes you hungry.
Louis Onrust
In this presentation we will revisit some of the earlier talks on Bayesian language modelling, discuss the difficulties that I encountered, and present new results. First I will briefly introduce Bayesian language modelling, and show how it differs from other (popular) language modelling techniques. Secondly, I will discuss the concept of skipgrams, how they relate to ngrams, and show that recently there has been a renewed interest in using skipgrams for language modelling. In the last part I will talk about the intrinsically evaluated experiments, and show the results of training and testing on four corpora (of multiple sizes), to show in-domain and cross-domain effects of the language models. We find that using skipgrams reduces the perplexity, both witin- as cross-domain, and we hypothesise about the effects of tresholding, limiting the vocabulary, and using samples of the training data in lieu of using complete billion word corpora.
33. Learning metrical stress systems using finite-state machines
Cesko Voeten and Menno Van Zaanen
Syllables may be assigned a certain level of stress. Within a word, at most one syllable may have primary stress (strong degree of stress) and multiple syllables may have secondary stress (weaker than primary stress). Languages differ in their preferred stress assignment patterns within words, although this variation across languages is constrained by recurrent linguistic regularities. These patterns can be described using finite-state machines. The research presented here investigates to what extent the regularities found in a sample of stress patterns can be identified by a finite-state-machine-based learner. Starting from stress sequences generated from an acceptor $A$, the approach relies on the Myhill-Nerode theorem, which allows for the construction of a canonical acceptor, making sure that each sequence leading to state $x$ in the acceptor has the same set of acceptable tails. Given a canonical acceptor and a sufficient sample generated from $A$, a partition $P$ of the states can be found such that state-merging based on $P$ results in an acceptor isomorphic to $A$. The task is to find the correct partition. We tested the effectiveness of several partitioning approaches on 108 finite-state acceptors describing a wide range of stress systems of natural (and some synthetic) languages from StressTyp2 (\url{http://st2.ullet.net/}). These approaches merge states when similar context (to the left and/or right of the state) is found. Additionally, we varied the amount of training data. The results show that these approaches allow for learning of a limited range of stress patterns.
34. Automatic detection and correction of preposition errors in learners’ Dutch
Lennart Kloppenburg and Malvina Nissim
In this work we address the automatic detection and correction of preposition errors in essays written in L2 Dutch by leveraging native data. Using Support Vector Machines on the Lassy Large corpus, which is supposed to exhibit correct preposition usage, we created language models for the 15 most frequent prepositions in Dutch using 20M sentences (a multiclass model of preposition selection), and for preposition presence or absence, using 2M sentences (a binary model of preposition detection). For both models, we used a set of features based on token and POS n-grams and dependency relations obtained via the Alpino parser. The binary model predicts if a context vector has a preposition label or not. The multiclass model predicts a specific preposition given a context vector. These models form a pipeline for detecting and correcting preposition errors, of the following three types: 1. Insertion (preposition invoked erroneously) 2. Deletion (preposition omitted erroneously) 3. Substitution (preposition picked erroneously) The models were evaluated on native and learners’ data. On L1 data, the binary and selection models score respective F-scores of 100% and 75%. Because the learner corpus (Leerdercorpus Nederlands) is not error-annotated, we used crowdsourcing to evaluate performance. We gathered human judgements for 1,499 cases and compared the system’s decisions on them with the annotators’ choices. Of all substitution errors identfied by the annotators, 90% (n=72) were found by the system, with a precision of 13%. Out of these 90%, 60% were appropriately corrected. For insertion errors (n=29), recall was 23% and precision 21%. Only four deletion errors were annotated, of which 3 were detected and 1 appropriately corrected. In the talk, results will be discussed in detail, including an assessment against baselines and a comparison between L1 and L2 data.
35. Inducing multi-sense word representations multilingually
Simon Suster, Ivan Titov and Gertjan Van Noord
We will present our ongoing work on unsupervised learning of sense-specific word embeddings in which the sense estimation step is facilitated by the information from another language that is accessible through word alignment. Our model consists of two parts: an encoder that predicts senses for a pivot word based on the contextual information from the source as well as the aligned language; and a reconstruction component aiming to predict context words relying on the sense representation of the pivot. The components are estimated jointly to minimize the reconstruction error. We will report the results of an evaluation on several semantic similarity datasets.
36. Representing and Implementing Constructions in Fluid Construction Grammar
Paul Van Eecke, Luc Steels, Miquel Cornudella and Remi van Trijp
Fluid Construction Grammar (FCG) (Steels 2011, 2012) is a flexible, fully operational, computational platform for developing grammars from a constructional perspective. It is designed to be as neutral as possible with respect to the linguistic theories that one might want to explore. The platform includes mechanisms for representing grammars and for using these grammars for language comprehension (mapping from an utterance to a meaning representation) and production (mapping from a meaning representation to an utterance). Recently, Steels (forthcoming) has introduced a new notation for FCG. This new notation is easier to write in and clearer to understand, as it orders information in a more natural way, and handles default behaviour in a more intuitive way. Here, we will discuss some of the implementation issues encountered when using the new notation. We will show how a grammar can be created, how constructions can be defined and visualised in the new web-interface, and how grammars can be used in processing. The source code of the implementation is freely downloadable from http://www.fcg-net.org/.
37. Transcriptor: a transcription app for the Cyrillic script
Martin Reynaert, Pepijn Hendriks and Nicoline van der Sijs
Names are of the essence in the media but display a great variability in spelling and even scripts through the variety in languages and each language’s individual pronunciation system. In an NWO Kiem project we built an app that focuses on recommending Dutch spellings for originally Cyrillic names to news reporters who more often than not first encounter new Slavic names in English, French or German newswire, rather than in the original Cyrillic script. The project’s user groups are ANP, NOS, VRT and the Dutch Language Union who all strive for more conformity in news reports. The system is partly based on a combination of rule-based transliterators. These emulate the transliteration systems used in the countries themselves (e.g. on signposts and passports), systems that are used in the scientific world and libraries, and transliterations intended for a general audience in other countries. These latter are mainly based on the pronunciation, and therefore differ by language. This means names are usually spelled differently accross and even within languages. With anagram hashing, we perform a speedy fuzzy lookup through the names databases of JRCnames for personal names and Geonames for place names. In combination with the rule-based systems, this gives the user corpus-based evidence and guidance. We intend to unveil the CLAM-based Transcriptor web application at CLIN.
38. DISCOSUMO: Summarizing forum threads
Sander Wubben, Suzan Verberne, Antal van Den Bosch and Emiel Krahmer
We present the DISCOSUMO project. In this project, we develop a computational toolkit to automatically summarize discussion forum threads. We briefly present the initial design of the toolkit, the data that we work with and the challenges we face. Discussion threads on a single topic can easily consist of hundreds or even thousands of individual contributions, with no obvious way to gain a quick overview of what kind of information is contained within the thread. We address the summarization of forum threads with domain-independent and language-independent methodology. We aim to evaluate our system on data from four different web forums, covering different domains, languages and user communities. Our approach will be largely unsupervised, using recurrent neural networks. We show preliminary results of the first component of the summarization pipeline: encoding forum posts to fixed length vectors. Evaluation of the first version should point out where in the pipeline supervised techniques and/or heuristics are required to improve our summarization toolbox. If successful, the automatic summarization of discussion forum threads will play an important role in facilitating online discussions in many domains.
39. Cross-lingual transfer of a semantic parser via parallel data
Kilian Evang and Johan Bos
To date, semantic parsers that map text to logical form mostly exist for restricted domains. Combinatory Categorial Grammar (CCG) has yielded promising results for moving semantic parsing into more open domains, but most such systems remain limited to English. We propose a method for transferring an English CCG semantic parser to another language automatically. We assume access to parallel data and a part-of-speech tagger for the target language. Annotation with syntax or meaning representations is not required but instead provided by the existing source-language system. Our approach starts from the automatically derived English CCG derivations and n-best word alignments for the target-language sentences and generates a set of candidate CCG categories for target-language words and multiwords. A shift-reduce parser with a perceptron model is then trained to produce similar meaning representations for the target language, jointly learning a parsing model and to discriminate the useful lexical categories from the less useful ones. We test our approach on an English-Dutch parallel corpus of mostly simple sentences collected for language learners. Our results show the parser training to significantly outperform a baseline juxtaposing semantic fragments and to acquire some knowledge about Dutch syntax. This includes some structural differences from the source language, which however remain a challenge, as does lexical coverage. We conclude that a basic semantic grammar and lexicon can be bootstrapped by automatic cross-lingual transfer, but conjecture that more language-specific engineering and training is required to increase coverage.
40. 'This concert was an anticipointment': Automatically detecting emotion in open-domain event reports on Twitter
Florian Kunneman and Antal van den Bosch
Social events can spur the contrastive emotional sequence of hopeful anticipation followed by grave disappointment; in short they are an anticipointment. We aim to identify anticipointment and other sequential patterns of emotions from a wide range of events that are mentioned on Twitter. As many spectators of events share their opinion and emotion on social platforms like Twitter, this information is a valuable source of analysis to assess the quality of these events. However, apart from a handful of case studies, little is known of the (sequence of) emotions that people display when mentioning different types of social events on Twitter. Starting from Lama Events, an open-domain calendar of Dutch events mentioned on Twitter, we set out to automatically analyze the emotions in the event tweets. We describe an hashtag based approach to model different types of (complex) emotions. These models are applied on a large number of events, with their tweets divided into tweets posted before, during and after event time. We give an analysis of the most dominant types of emotion sequences and the events with the most contrasting emotions. Finally, we will reveal the most anticipointing event of the year.
41. Details from a distance? A Dutch pipeline for event detection
Chantal van Son, Marieke van Erp, Antske Fokkens, Paul Huygen, Ruben Izquierdo Bevia and Piek Vossen
In text interpretation, close reading is sometimes distinguished from distant reading, where the former focuses on precise interpretation of text paying attention to the details and the latter on broad analyses over large amounts of aggregated data. In the NewsReader [1] and BiographyNet [2] projects, we aim to apply the detailed analyses typically associated with close (or at least non-distant) reading to large amounts of textual data. In particular, our analyses attempt to identify who did what, when and where according to the text, aggregating results from different sources. This is achieved in two main steps using open source tools. In the first step, an NLP pipeline analyses individual documents. This pipeline identifies time expressions, semantic roles, named entities and events. The pipeline also applies word sense disambiguation, maps semantic roles to FrameNet, disambiguates named entities by linking them to DBpedia where possible and establishes intra-document event coreference. In the second step, cross-document event coreference is performed on the output of the first step. In this talk, we present the event detection pipeline we are using for Dutch. To our knowledge, this is the first pipeline for Dutch that provides this rich information about events. We provide an overview of results of individual components and show how they interact and can be combined.
43. Discrete-State Autoencoders for Joint Discovery and Factorization of Relations
Diego Marcheggiani and Ivan Titov
We present a method for unsupervised open-domain relation extraction, the task which consists of discovering semantic relations between pairs of entities present in text. In contrast to previous (mostly generative and agglomerative clustering) approaches, our model relies on rich contextual features and makes minimal independence assumptions. The proposed model is inspired by neural autoencoders and similarly it is composed of two parts an encoder, and a decoder. Differently form neural autoencoders the hidden state of our autoencoder is a latent relation (a categorical variable) instead of a continuous vector. Also the encoder and decoder belong to different model families. The encoder is a feature-rich relation extractor, which predicts a semantic relation between two entities. The decoder is a factorization model, which reconstructs arguments (i.e., the entities), modeled as vector embeddings, relying on the predicted relation. The two components are estimated jointly so as to minimize errors in recovering arguments. This framework allows us to both exploit rich features (in the encoding component) and capture interdependencies between arguments (in the encoding components) thanks to expressive factorization functions. For the decoder component we rely on factorization models inspired by previous work in relation factorization and selectional preference modeling. We evaluated our model on the New York Times dataset with entity pairs relations aligned with the knowledge base Freebase. Empirical results show that our models substantially outperform a generative baseline (Rel-LDA) and achieve state-of-the-art performance.
44. Word Sense Disambiguation in Text-to-Pictograph Translation
Gilles Jacobs, Leen Sevens, Vincent Vandeghinste, Ineke Schuurman and Frank Van Eynde
We describe the implementation and evaluation of a word sense disambiguation (WSD) tool in a translation system that converts English text messages into sequences of pictographic images. The Text-to-Picto tool for Dutch, English, and Spanish is used on the online communication platform “WAI-NOT” by people who have trouble reading and writing. The translation system relies on WordNets, in which synsets are populated with pictographs. In the original system, many ambiguous words are translated into an incorrect pictograph, because the pictograph is linked to the wrong word sense. The WSD method required for our translation engine must work on general domain text and use WordNet sense inventories. We opted for the gloss-overlap, extended lesk algorithm as described by Banerjee and Pedersen (2002). During translation, each possible WordNet synset of every content word in the input sentence receives a disambiguation score. This score, alongside other parameters, is used in a path-finding algorithm to determine the optimal pictograph sequence during translation. This implementation approach is easily generalised to other sense labelling algorithms, such as an SVM-based WSD tool for Dutch (Izquierdo 2015). In evaluation of the translation output, an improvement over the baseline system without WSD was not obtained. However, we found that WSD works well for ambiguous words for which sufficient pictographs are linked in our lexical-pictorial database.
45. Surfacing Dutch syntactic parses
Erwin Komen
Automatic syntactic parsing for Present-day Dutch can be done either with Frog (van den Bosch et al. 2007) or with Alpino (van der Beek et al. 2002). The former program uses statistical techniques and produces a dependency output. The latter is, in essence, rule-based and produces a dependency output that is partly transformed to a constituency-parse, since it identifies phrases such as PPs, NPs, sub clauses and main clauses. One feature of dependency parses that invokes mixed reactions is the fact that the constituents that are determined can be ‘split-constituents’: they consist of elements that are spread out over the sentence. Linguistic research in areas such as scrambling, extraposition, information structure and ‘first-constituent’ behaviour needs to have access to surface-level constituents. While the information to determine which groups of words form uninterrupted constituents at the surface level is, in principle, derivable from the Alpino dependency parses, it is also possible to transform a dependency parse into a constituency one that satisfies the criterion that all hierarchically accessible constituents map onto an uninterrupted surface string. I refer to this process as ‘surfacing’. I will provide examples that illustrate the need to ‘surface’ dependency parses, and I will present an algorithm that not only surfaces the the parses delivered by Alpino, but also retains the information that makes it possible to reconstruct split surface constituents in a manner that is in line with existing, widely used, other corpora (e.g. the historical English parsed corpora).
46. Character-level modelling of noisy microposts
Fréderic Godin, Wesley De Neve and Rik Van de Walle
The last few years, word embeddings have been a key ingredient to a lot of NLP applications, thanks to their ability to automatically capture syntactic and semantic relationships between words. Moreover, they alleviate the task of manual feature engineering. However, the most successful word embeddings only use the context of a word to learn. When these algorithms are applied to noisy datasets (e.g., collections of Twitter microposts or other user-generated datasets), they are facing rapidly expanding vocabularies due to spelling mistakes and slang usage, making the often short context much noisier. Consequently, many words have several variants, generating a lot of extra less frequent words within the vocabulary, thus hampering the quality of the word embeddings. In this talk, a neural network-inspired architecture will be discussed that models words and sentences starting from the character level. This architecture can be used to both (1) train word embeddings and (2) build end-to-end systems that perform NLP tasks such as Part-of-Speech tagging and text normalization. By starting from the character level, the model is able to learn how to deal with frequently occurring spelling mistakes in Twitter microposts.
47. Enriching machine translation input using semantics-based fuzzy matches
Tom Vanallemeersch and Leen Sevens
Computer-aided translation tools allow translators to look up source sentences in a translation memory (TM) and retrieve similar source sentences (fuzzy matches) and their translation. Fuzzy matching can be integrated with machine translation (MT) by pretranslating the matched source parts in order to have the MT system focus on the non-matched parts. Alignment links between a source sentence in the TM and its translation allow for this kind of pretranslation. We compare the MT results produced using the basic type of fuzzy matching, Levenshtein distance, with the results produced using two types of semantics-based fuzzy matching, in the context of English to Dutch translation and vice versa. The first of these types consists of matching based on semantic predicates and roles, while the second type applies lexical semantic information, more specifically WordNet and paraphrase tables, to compare sentences based on the relatedness of words and word groups. We perform the second type of matching through the METEOR metric, which was originally designed for evaluating MT systems.
48. Topic-guided token cloud visualisations
Thomas Wielfaert, Kris Heylen, Dirk Geeraerts and Dirk Speelman
Distributional Semantic Models have become the mainstay of large-scale semantic modelling, including Word Sense Induction (see Turney and Pantel 2010 for an overview). Token-level models allow to structure individual word occurrences into different senses and uses. Although potentially highly relevant for human interpretation in lexicographic purposes, the output of these models are large similarity matrices which are, as such, uninterpretable to humans. We therefore propose a human-friendly visualisation of token-level models. Using the Twente News Corpus (TwNC, 500M words) we generate a token similarity matrix for a number of Dutch lexical items. By applying a dimension reduction technique, we obtain 2D coordinates for each token and use these to create a simple scatter plot with links to concordances. Although scatter plots are a very intuitive type of visualisation, here, they pose two challenges. First, distributional models are completely unsupervised, with no sense annotations that could be added to the plot to facilitate the interpretation. Second, plots for high-frequent word types, with many tokens, quickly become overpopulated. We propose a solution by introducing an extra structuring layer in the plot in the form of topics derived from Latent Dirichlet Allocation (LDA, Blei et al. 2003). For each word type, we train a topic model, and for each topic we create a separate scatter plot. The tokens are then assigned to all topics for which they reach a threshold. As a result, users can navigate the plot by gradually zooming in on tokens representing specific topics.
49. Leveraging psycholinguistic data to evaluate Dutch semantic models
Emiel van Miltenburg
This paper shows how we can use existing Dutch psycholinguistic norms datasets to evaluate semantic models. To this end I have trained a distributional model on over 5 billion Dutch words. We will look at how well this model is able to predict the following: * Relatedness: which pair of words is more strongly related? * Similarity: rank pairs of words in terms of their similarity. * Goodness rankings: how exemplary are words for the category they are in? E.g. is ‘apple’ more exemplary of the category FRUIT than ‘coconut’? * Outliers: which word doesn’t belong in a sequence of words? E.g. for the words ‘apple’, ‘banana’, ‘cherry’, and ‘forklift’, the latter is a clear outlier. We will also look at how the performance of this model compares to (path-based) similarity measures in Cornetto. As expected from similar experiments in English, the distributional model outperforms Cornetto on the relatedness task. On the similarity task, we see that Cornetto has the upper hand. Somewhat surprisingly, we also find that the distributional model outperforms Cornetto on the outliers task, while the goodness task is inconclusive. In the talk, we will also take a closer look at the variation within each task. For example: Cornetto seems to be much better at predicting the similarity between different sports than it is at predicting the similarity between fruits.
50. Constructing viable medical records from noisy source documents
Eva D'Hondt, Brigitte Grau and Pierre Zweigenbaum
Due to privacy concerns and patient confidentiality, there are very few corpora of medical records publicly available for Information Retrieval purposes. Research groups that are interested in biomedical Natural Language Processing or Information Retrieval usually obtain their data sets from collaborations with individual (university) hospitals. However, as each hospital has their own system of gathering and managing patient records, the quality and format of these data sets can vary significantly from hospital to hospital. Consequently, the patient records in these real-life data sets may contain textual noise, conflicting information and high amounts of textual redundancy. Moreover, in the interest of protecting the privacy of current patients, the data sets that are released for research purposes often span older periods and are not necessarily in an electronic format, which necessitates an additional OCR step. All this results in a high level of noise, which in turn has a direct negative impact on NLP performance (Cohen et al. 2014). In this paper we present a method for the automatic construction of clean patient records from unreliable and noisy source documents. Our method is an extension to the well-known Hunt–McIlroy (diff) algorithm in which domain knowledge and document metrics are used to resolve merge conflicts, and reconstruct missing information in the text. We evaluate our method on a synthetic noisy data set that was created from Wikipedia articles, and perform a manual evaluation of the medical records that were constructed based on OCR-ed French medical documents in the domain of foetopathology.
51. Recurrent Neural Networks for Genre-specific Language Generation in Dutch
Tim Van de Cruys
In recent years, neural networks have shown impressive performance on a broad range of natural language processing tasks, such as machine translation and image caption generation. Recurrent neural networks, in particular, are very good at capturing both syntactic and semantic properties of language sequences, resulting in very good perplexity scores for language modeling. In this presentation, we will look at a number of different recurrent neural network architectures for language generation in Dutch. Specifically, we will investigate the application of Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures within an encoder-decoder framework. In a first step, the LSTM and GRU networks are used as an encoder, aiming to construct a broad-scale, general purpose language model for Dutch. In a second step, the networks are then applied as a decoder, aiming to generate language expressions for specific text genres, such as news articles or poetry.
52. Reordering Grammar Induction
Milos Stanojevic and Khalil Simaan
We present a novel approach for unsupervised induction of a Reordering Grammar using a modified form of permutation trees (Zhang and Gildea, 2007), which we apply to preordering in phrase-based machine translation. Unlike previous approaches, we induce in one step both the hierarchical structure and the transduction function over it from word-aligned parallel corpora. Furthermore, our model (1) handles non-ITG reordering patterns (up to 5-ary branching), (2) is learned from all derivations by treating not only labeling but also bracketing as latent variable, (3) is entirely unlexicalized at the level of reordering rules, and (4) requires no linguistic annotation. Our model is evaluated both for accuracy in predicting target order, and for its impact on translation quality. We report significant performance gains over phrase reordering, and over two known preordering baselines for English-Japanese.
53. An editor to quickly produce age-specific texts for children
Thijs Westerveld
Today, children grow up with digital media and spend an increasing portion of their time online. Yet, the amount of online information that is targeted at children is small and the content that does exist is hard to find. In the main web search engines, this information gets overpowered by the plethora of information that is available for adults. Moreover, the information that is aimed at children often targets them as one homogeneous group, failing to differentiate between children of different ages or comprehension levels. WizeNoze provides a child-friendly technology platform that increases the amount of content available to children, that improves the access to this information, and that targets each child at their own comprehension level. In this demo we will showcase our Content Editor, a tool that helps copywriters, publishers and other organizations to modify texts to age-specific copy for children between the ages of 6 and 15 years. The tool automatically classifies the reading level of a text and indicates which elements of a text need further simplification to make it readable for the indicated target audience. Authors can maintain various versions of the same text for different target audiences. The built-in summarizer helps to reduce the volume of copy for the younger age groups. The tools have initially been trained on texts that are labeled by experts, but we are working on mechanisms to let children provide feedback on the texts to further adapt the classification to their needs.
55. Extending Alpino with Multiword Expressions: an initial investigation
Jan Odijk
I investigate whether and how the Dutch Alpino parser can be extended to deal appropriately with flexible Multiword Expressions. Though at first I thought this would have to start with a paper exercise, I realized I can actually use PaQu (http://portal.clarin.nl/node/4182 ) to test out some initial ideas. I created a minitreebank of sentences containing relevant Multiword expressions with the help of PaQu and can now systematically investigate what is needed for such an extension and how Alpino should be adapted. The treebank is accessible to everyone who logs on onto PaQU (and everybody can do that, without a password). I will consider the interaction of MWEs (mostly of the form V+NP) of different types (opaque and idiosyncratically transparent idioms and (simple) support verb constructions) with a wide range of grammatical phenomena, including (for the verb) Verb final, V2, V1, Verb Raising; (for NP) Topicalisation, Wh-questioning, scrambling, independent occurrence; (for N inside NP) modification, relativisation, diminutives, plural, other determiners; and for V+NP passive. In this poster and demo, I will explain the idea, and show how PaQu aids me in this investigation.
56. SHEBANQ: Annotations based hub of research in the Hebrew Bible
Wido van Peursen, Dirk Roorda and Martijn Naaijer
The database of the Hebrew Bible (ca. 400,000 words) of the Eep Talstra Centre for Bible and Computer (ETCBC) contains a wealth of annotations with linguistic information on the levels of morphemes, words, phrases, clauses, sentences and text-syntactic structures. These annotations are the result of almost four decades of encoding work. Recently the CLARIN-NL-project SHEBANQ [1] has made this work available online. It contains various representations of the database with the possibility to save queries as annotations. Other annotation tools serve the integration of research results into the database. The stand-off markup of the Linguistic Annotation Framework (LAF) [2] showed to be a suitable basis for the curation of the database. A new Python tool was developed for handling LAF-resources. The new research environment created in SHEBANQ harbours all kinds of research: searching for clause connections, verbal valence patterns, parallel passages, author recognition.The new potential of this research environment and the tools it includes will be presented in the demo. [1] https://shebanq.ancient-data.org [2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326
57. Personality traits on Twitter for less-resourced languages
Barbara Plank, Ben Verhoeven and Walter Daelemans
Most existing work on personality prediction focuses on small samples and closed-vocabulary investigations, which limits the applicability of learned models and overall generality of the results. In this paper, we build on recent work that explores the use of social media as a resource for large-scale, open-vocabulary personality detection for English (Plank and Hovy, 2015). We investigate to what extent the proposed method generalizes to other languages. In particular, we examine how it extends to three geographically more confined and less representative languages on Twitter, namely, Italian, German and Dutch. We present a novel corpus of Italian, German and Dutch tweets annotated with Myers-Briggs personality types and gender, compare it to English, and present results on Myers-Briggs personality prediction for all languages.
58. Towards learning domain-general representations for language from multi-modal data
Ákos Kádár, Grzegorz Chrupała and Afra Alishahi
Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.
59. Deep Machine Translation for Dutch: the QTLeap project
Dieke Oele, Gertjan Van Noord and Ondrej Dusek
The QTLeap project (Quality Translation by Deep Language Engineering Approaches) is a collaborative project funded by the European Commission that aims to improve Machine Translation (MT) by use of deep language engineering approaches to achieve higher quality translations. The incremental advancement of research on Machine Translation (MT) has been obtained by taking advantage of increasingly sophisticated statistical approaches and fine-grained linguistic features adding to the surface level alignment on which these approaches are ultimately based. The goal of this project is to contribute to the advancement of quality MT by pursuing an approach that further relies on semantics. We build on the two pillars of language technology, symbolic and probabilistic, and seek to advance their hybridization by exploring combinations of them that amplify their strengths and mitigate their drawbacks. The work is done along the development of three MT pilot systems that progressively seek to integrate deep language engineering approaches. We present the second pilot system that has just been finished. The new system shows promising results, outperforming a Moses baseline for the English-Dutch translation direction.
60. Unbiased Expectations for Statistical Machine Translation
Wilker Aziz
State-of-the-art Statistical Machine Translation (SMT) relies on a pipeline of heuristics and independently trained models which are typically combined under a linear model. Inference is performed by means of approximate search techniques (Viterbi and -best decoding) which rely on heavy pruning. As the field moves towards end-to-end learning frameworks, particularly those which optimise a probabilistic objective, we are often struck with the need for robust estimation of expectations. However, estimating expectations by relying on techniques primarily designed for non-probabilistic linear models leads to arbitrary biases. Ultimately, this misalignment between learning framework and inference technique clouds our understanding of the merits of proposed models. This calls for suitable techniques better aligned with probabilistic inference. Expectations play a major role in training models under probabilistic criteria such as maximum likelihood, maximum a posteriori, and empirical Bayes risk. Expectations also play a crucial role in decoding by probabilistic disambiguation. One can estimate expectations through statistical sampling, however, most interesting models are intractable to represent exactly. Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling, offer a way to overcome the tractability issues in sampling, however, convergence to unbiased estimates is only possible in the limit of infinite samples. Rejection sampling (a Monte Carlo method) offers stronger guarantees and unbiased samples, but it is intractable in high dimensional settings such as SMT. In this work, expectations are estimated from exact samples obtained by an adaptive rejection sampling technique. Our technique combines naturally with importance sampling to yield improved estimates of a model’s partition function, entropy, and derivatives.
61. A Multi-Agent Model Approach to Resemanticization in Pronominal Agreement in Dutch
Roxana Radulescu and Katrien Beuls
Do you refer to a table with ‘him’, ‘her’, or ‘it’ ? In Flemish this depends on where you live. In general, there are two possible strategies to tackle this dilemma: the syntactic approach, where a speaker can employ the lexical gender of the noun, or the semantic approach, where one can make use of the noun’s semantic properties. Due to morphological changes, the syntactic system is becoming increasingly opaque and lexical gender knowledge is being lost. One can start observing here a shift towards a more transparent semantic system. In studied Dutch spoken language data, this new mapping seems to be realized over the individuation hierarchy defined by Audring (2006), where concepts are clustered based on their level of individualization (e.g., animates, concrete objects and bounded abstracts, unbounded abstracts and masses). In the context of cultural evolution, we propose a multi-agent model, that allows, through pair-wise interactions between homogeneous individuals of a population, to simulate this shift. The interactions are carried out through sequential language games, being grounded within a context and driven by a certain communicative goal. We explore various gender mapping and learning mechanisms that can allow the agents to form a new agreement system using their semantic knowledge about the world. We investigate whether our strategies can yield cohesive clusterings over the semantic space. We notice that the system reaches full convergence in terms of gender preference at population level and that there are multiple successful ways of dividing the semantic space, including the individuation hierarchy.
62. Semantic Relatedness and Textual Entailment via Corpus Patterns
Ngoc Phuoc An Vo and Octavian Popescu
In the last years, two tasks that involve meaning processing, namely Semantic Relatedness (SR) (Marelli et al., 2014b) and Textual Entailment (TE) (Dagan et al., 2006), have received particular attention. In various competitions, both the TE and SR related tasks have been proposed, and useful benchmarks have been created. Yet, till year 2014, no corpus annotated with both SR and TE labels was available. In the SemEval 2014, the SICK corpus containing both SR and TE annotation for the same pairs have been released (Marelli et al., 2014a). The importance of this corpus comes from the fact that many systems for resolving either one of the tasks are actually quite similar, but the relation between SR and TE has not been analyzed yet. SICK allows a direct comparison between the systems that address both tasks. In this paper we present a system that combines distributional and structural information for both the SR and TE tasks. There are two major contributions the method proposed here brings to the field: (1) it shows that there is a correlation between the SR scores and TE judgments which can be used as active learning approach to improve the accuracy of both tasks, and (2) it shows that we can handle the structural information via patterns extracted from corpora and that this approach brings a substantial improvement to distributional systems. By employing (1) and (2), we built a system that performs competitively in SR and TE tasks reaching a new state of the art on the SICK corpus for TE and being less than 1.5% below the state of the art for SR.
63. Data Selection for SMT: Comparison of Various Methods
Amir Kamran and Khalil Sima'An
The performance of Statistical Machine Translation (SMT) is highly dependent on the availability and quality of the parallel corpus. For a number of languages, the availability has increased in recent years, however, blind concatenation of all available training data may shift translation probabilities away from the domain that the user is interested in. Selecting data, using limited amount of in-domain corpus, to train domain adapted translations models showed significant improvements. In this work, three different approaches for data selection have been explored and compared: 1. Baseline: Bilingual Cross-Entropy The bilingual cross-entropy selection method by Axelrod et al. (2011), which is, in turn, based on a monolingual method to select language model training data by Moore and Lewis (2010). 2. Invitation-based Selection Cuong and Sima’an (2014) exploits in-domain data (both monolingual and bilingual) as prior to guide word alignment. 3. IR-based Selection A simple but effective selection technique based on a standard Vector-Space Model (VSM) in which sentences are represented as word vectors weighted by tf-idf scores (Tamchyna et al., 2012). Invitation-based and IR-based selection methods show significant improvement over the baseline approach. Results are reported for English-Spanish and English-German language paris.
64. Text Clustering for Improved Statistical Machine Translation
Bushra Jawaid and Amir Kamran
“The problem of clustering can be very useful in the text domain, where the objects to be clustered can be of different granularities such as documents, paragraphs, sentences or terms”. The main focus of this research is to find suitable metrics and representations for “sentence” clustering for the purpose of improved Statistical Machine Translation (SMT). There are many potential advantages of clustering in SMT. Appropriate representations can “smartly” partition a corpus into highly “specialized” clusters, where each cluster can have sentences or documents that belong to a particular “domain” or a mixture of similar “domains”. As a result, SMT performance could greatly be improved by e.g. training separate SMT models for each cluster and consequently obtaining translations that are more appropriate for that particular cluster. By knowing the domain/cluster of the sentence or document that is being translated, a better and more informed translation is possible. Essentially, by clustering, the idiosyncrasies of a corpus could be better exposed, that would otherwise be lost if treated as a whole. In order to perform clustering, the very first step is to define an effective representation for sentences/documents and determine the notion of similarity between them. We present the work on selection of vector representations and appropriate similarity metrics in order to obtained good quality clusters. We discuss the results of our preliminary experiments in both monolingual and bilingual data setting. In bilingual data setting, we observe that mere concatenation of source and target sentences works much better than the vector representation obtained on aligned source and target language tokens. The quality of clusters is measured by estimating a entropy of a domain specific test set on a language model trained on a clustered data.
65. Task specific warping of word vector representations using limited labeled data
Paul Neculoiu, Chao Li, Carsten Lygteskov Hansen and Mihai Rotaru
Information extraction from CVs (resumes) is one of the success stories of applying NLP in industry. The standard approach is to cast the problem as a sequence labeling problem and solve via statistical models like CRFs. In our previous and other related work, adding word embeddings has been shown to improve the performance of these models. Word embeddings obtained via tools like word2vec have proven to be very efficient at capturing general thematic relations. However, for our application they do not properly represent functional categories. For example, the word “nurse” and “hospital” will be closer in word vector space than “nurse” and “teacher”. Thus, these representation are suboptimal for the task of identifying concepts like job titles and companies. We show that by using a relatively small number of annotated data points, we can warp a vector space into a space that is more relevant for our labeling task and also more compact. We train a deep neural network to serve as the morphing function. We frame it as a classification task from word embedding to the target label and use the last hidden layer representation as the new word embedding space for our sequence labeler. This representation retains both the thematic relations as well as information about the label space. Our approach is able to leverage the strength of both unsupervised and supervised learning and significantly boost parsing performance.
66. Machine Translation with Source-Predicted Target Morphology
Joachim Daiber and Khalil Simaan
We propose a novel pipeline for translation into morphologically rich languages which consists of two steps: initially, the source string is enriched with target morphological features and then fed into a translation model which takes care of reordering and lexical choice that matches the provided morphological features. As a proof of concept we first show improved translation performance for a phrase-based model translating source strings enriched with morphological features projected through the word alignments from target words to source words. Given this potential, we present a model for predicting target morphological features on the source string and its predicate-argument structure, and tackle two major technical challenges: (1) How to fit the morphological feature set to training data? and (2) How to integrate the morphology into the back-end phrase-based model such that it can also be trained on projected (rather than predicted) features for a more efficient pipeline? For the first challenge we present a latent variable model, and show that it learns a feature set with quality comparable to a manually selected set for German. And for the second challenge we present results showing that it is possible to bridge the gap between a model trained on a predicted and another model trained on a projected morphologically enriched parallel corpus. Finally we exhibit final translation results showing promising improvement over the baseline phrase-based system.
67. Modelling Adjunction in Hierarchical Phrase-Based SMT
Sophie Arnoult and Khalil Sima'An
Hierarchical phrase-based models (Hiero) learn generic rules, leading to very ambiguous grammars. To limit ambiguity, Hiero is constrained to learn local rules only, but this prevents the model from capturing long-distance dependencies and clause-level reorderings. We propose to use adjunction to let the model cover larger contexts selectively. Adjunction is known to introduce long-distance dependencies, but its application in Statistical Machine Translation has mostly been restricted to syntax-based models, which lend themselves more readily than Hiero to syntactic modelling. Our work constitutes the first application of adjunction to Hiero. We model adjunction in Hiero by assuming translation independence between adjuncts and non-adjunct material. We accordingly adapt phrase extraction to accept phrases with any number of adjuncts; from these phrases we then generalize over unseen adjunction patterns. The model still retains Hiero’s ability to embed phrases hierarchically, to allow for generalization in longer fragments. The resulting grammar is a Hiero grammar extended with an adjunct non-terminal and abstract adjunction rules. Our work is currently in progress. We plan to test our model against a Hiero baseline on Chinese-English and French-English. We are particularly interested in exploring different degrees of adjunct translation independence when segmenting for phrase extraction, and in the effect of our model on sentence-level reordering.
68. Self-Attentive Neural Models for Machine Reading
Ehsan Khoddam and Ivan Titov
Machine Reading (MR) is the task of teaching machines to read and comprehend standalone documents. Reading skills of machines are tested by answering questions about the documents. Recently neural models with an attention mechanism have shown a lot of promise for the MR task. These models represent a document as a sequence of neural hidden states with states corresponding to individual positions in the document, and they answer questions by aligning the questions to the document text using an attention model. The potential shortcoming of this approach is that it relies on a linear representation of the document, hence, such MR models are unlikely to capture non-local context and answer complex questions requiring chains of reasoning. In our work, we induce an attention model within a document, or, equivalently, we allow for non-Markovian connections (jumps) in the sequence of states representing the document. We discuss alternative ways how discourse and syntactic knowledge can inform such within-document attention models.
69. Clause analysis: using syntactic information to identify who is attacking whom in political news
Wouter Van Atteveldt, Tamir Sheafer, Shaul Shenhav and Yair Foger-Dror
Automatic text analysis methods, including many supervised and unsupervised methods for topic detection and sentiment analysis, are often frequency based. Political news, however, often contains multiple actors and topics in the same article and even the same sentence, making it very important to identify to which actors or topics sentiment is being attributed and by whom. While “Palestinian civilians killed by Israeli attack” and “Palestinian attack kills Israeli civilians” contain almost the same words, the meaning and implications are radically different. This paper shows how syntactic information can be used to automatically extract clauses from text, where a clause consists of a subject, predicate, and optional source. Since the output of this analysis can be seen as an enriched token list or bag of words, normal frequency based or corpus linguistic analyses such as dictionary-based sentiment analysis and LDA topic modeling can be used on the output of this method. Taking the 2008–2009 Gaza war as an example, we show how corpus comparison, topic modelling, and semantic network analysis can be used to explore the differences between US and Chinese coverage of this war. Although the case study focuses on coverage of violent conflict, the same method can also be used to analyse ‘regular’ political news, showing who is criticizing whom or who is claiming and framing which issues.
70. Tagging Variation-Rich Languages Using Convolutional Neural Networks
Mike Kestemont, Guy De Pauw, Walter Daelemans, Renske van Nie, Sander Dieleman and Gilles-Maurice de Schryver
Part-of-speech tagging and lemmatization are low-level, yet essential stages in modern NLP pipelines. While these basic tasks have long been considered “solved” for many languages, existing architectures have difficulties processing dirtier real-world corpora, such as internet-intermediated communication (e.g. Twitter) or collections of historic texts (e.g. newspaper databases). Standard NLP pipelines suffer from the orthographic variation in this material and struggle to reliably identify words at an early stage in the processing, thus causing a substantial percolation of errors to further stages in the pipeline. In this paper, we report the application of convolutional neural networks to the problem of POS-tagging and lemmatizing languages that are rich in surface variation. Convolutional networks are popular in computer vision research, where they are used to learn local filters, that can detect features, independently of their exact position in an image. Here, we apply one-dimensional convolutions at the character level, to represent words, and word embeddings, to represent word contexts. We will show how this approach enables the extraction of powerful, morpheme-level features from noisy training data. We apply this technique to a series of medieval corpora, as well as a number of Bantu languages, demonstrating the language-independence of our approach. Our results suggest that a single, deep architecture performs well across a variety of tasks, and in many cases improves upon previously reported results.
71. Text-based age and gender prediction for online safety monitoring
Janneke van de Loo, Guy De Pauw and Walter Daelemans
We present results of author profiling experiments that explore the capabilities of text-based age and gender prediction for online safety monitoring. In the project AMiCA, we are developing a monitoring tool for automatically detecting harmful content and conduct in online social networks, such as cyberbullying, “grooming” activities by sexual predators and suicidal behavior. Author profiling – i.e., automatically detecting “metadata” of authors, such as their age and gender – is an important subtask in this application. The use case for which the relevance of age and gender classification is most evident, is the detection of sexual predators, who may provide false age and gender information in their user profiles. Also for other use cases, however, automatically detecting age and gender information can be useful for risk estimation. Regarding age prediction, various age categories can be relevant, based on legal constraints (e.g. minors vs. adults) or age related statistics (e.g. suicide incidence rates across age groups). In our study, we evaluated and compared binary age classifiers trained to separate younger and older authors according to different age boundaries. Experiments were carried out on a dataset of nearly 380,000 Dutch chat posts from the Belgian social network Netlog, using a ten-fold cross-validation setup. We found that macro-averaged F-scores increased when the age boundary was raised and that practically applicable performance levels can be achieved, thereby providing a useful component in a cybersecurity monitoring tool for social network moderators.
72. Automatic writing assistance in multiple languages
Dennis de Vries
GridLine is the expert on language and search technology for Dutch and other languages. One of our main products is Sonaling, a writing aid that helps users to write correct and understandable texts. Sonaling is built on a flexible, modular architecture with a text analysis pipeline at its core. The composition of this analysis pipeline can be changed, depending on the goal of the application, or the user’s needs. For this we have a large library of language analysis modules that we call our Language Server. Sonaling’s user interfaces allow users to create, evaluate and improve their texts. They integrate in Microsoft Word and many web based systems like CMS’s, CRM’s, etc. We began development of Sonaling over five years ago, starting with the Dutch version called Klinkende Taal. The tool has proven very successful, as it is now used by around 50 organizations, including local and national governments, insurance companies and hospitals, etc. Currently, we are expanding the product to other languages, like English, German, Frisian and Afrikaans. In this talk, I will present the current developments on Sonaling and demonstrate some of our new products. I will also discuss how we combine our Language Server, our modular architecture and our various user interfaces to quickly develop applications for new languages and new types of writing assistance.
SHARED TASK PAPERS
73. Rule-Based Coreference Resolution for Dutch
Rob van der Goot, Hessel Haagsma and Dieke Oele
We have adapted Stanford’s multi-pass sieve coreference resolution system for Dutch. Our experiments prove that this rule-based system works robustly on Dutch, for different domains. Because no training data is needed, it is a well-suited approach for low-resource languages.
74. Running Frog on the CLIN26 NER task
Iris Hendrickx, Ko van der Sloot, Maarten van Gompel and Antal van den Bosch
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool, which is currently maintained and developed by the Language Machines Research Group at Radboud University Nijmegen.
For the CLIN26 NER task we applied the Frog named-entity recognizer module that is trained on the SoNaR1 named entity labels. We will briefly discuss the architecture of Frog, its NER module, and the results. The main source of misclassifications is the difference between the label sets of SoNar1 and the CLIN task.
75. Rule based classification of events and factuality
Iris Monster, Iris Hendrickx-Dekkers
The aim of this project was to solve the event detection and factuality classification task of CLIN26. The data was parsed by an annotation tool called Frog. Using these annotations (Part of Speech-tags, lemma, et cet.) a rule-based classifier was built that discovers events. In order to annotate the resulting events with corresponding polarity and factuality tags two other rules-based classifiers were implemented. The performance of the classifiers varied on the test corpora. This could be due to the small training set.
76. Event detection and event factuality classification for shared task
Oliver Louwaars and Chris Pool
In this paper we present RuGGED, our approach to detecting and events and determine their factuality. This task was organised as a double shared task within CLIN26, and the data was annotated following the FactBank [2] and NewsReader [3] guidelines. Considering the little amount and high skewness of the data provided to develop our system, we opted for a rule-based rather than learning approach. Rule development was heavily based on the annotation guidelines and on data observation.
As annotation of events is dependent on part-of-speech and more widely on morphosyntactic information, we also based our rules mainly on this kind of input. Thus, we POS-tagged and parsed all sentences using the Dutch parser Alpino [4], and used the output to determine whether a word should be tagged as an event, based on its POS-tag and the relation to other events in the sentence. As the data is BIO-annotated, if a word formed a new event, the system had to tag it as B-E, while it should be tagged I-E if it was deemed as part of an existing event. This posed a hard task, but it resulted in a f-score of 0.88 for B-E and 0.46 for I-E on the development data.
Events can be signalled by verbs, nouns, and adjectives. While verbs and asjectives were relatively easy to identify via progressively specific rules, the nouns posed the most difficult challenge. Indeed, the types of nouns that can be events (deverbal or modal) are stated clearly in [2], but it isn’t as straightforward to identify them automatically, also because Alpino doesn’t directly output such information. Therefore, we opted for creating lists of known or potential event nouns (via existing databases and via heuristics, also exploiting Cornetto [5]), and checked list-inclusion for each encountered noun.
Aside from the detection of events, also the certainty and the polarity of a detected event had to be established. This task proved even harder than the initial detection, as almost every event is certain and positive so that there was little evidence for the other categories to go by in developing rules. This also makes for a very high baseline (94% for certainty and 95% for polarity). By implementing rules based on indicative words found in the data and extracted by common sense, and exploiting Alpino’s dependency relations, this baseline was not significantly beaten, but results were better distributed in terms of recall and precision for the various categories. We tested the factuality rules both on the gold labels for events as provided by the organisers, as well as on the labels assigned by our detection system (thus noisy).