
The way a linguistic item is represented has a fundamental impact on the effectiveness of the computation of semantic similarity, as a consequence of the expressiveness of the representation. First, it is necessary to identify a suitable representation of the items to be analysed. In order to compute the degree of semantic similarity between items, two major steps have to be carried out. Note that the set of linguistic items can be cross-level, that is, it can include (and therefore enable the comparison of) items of different types, such as words and senses (Jurgens Reference Jurgens2016). Where I is the set of linguistic items of interest and the output of the function typically ranges between 0 and 1, or between −1 and 1. This paper focuses on the first two items, that is, words and senses, and provides a review of the approaches used for determining to which extent two or more words or senses are similar to each other, ranging from the earliest attempts to recent developments based on embedded representations. Paragraphs and texts, which are made up of sequences of sentences. Sentences, that is, grammatical sequences of words which typically include a main clause, made up of a predicate, a subject and, possibly, other syntactic elements. Word senses, that is, the meanings that words convey in given contexts (e.g., the device meaning vs. Words, which are the basic building blocks of language, also including their inflectional information. The second aspect concerns the type of linguistic item to be analysed, which can be: Furthermore, hybrid semantic similarity combines both knowledge-based and distributional methods. Accordingly, we distinguish between knowledge-based semantic similarity, in the former case, and distributional semantic similarity, in the latter. The first concerns the type of resource employed, whether it be a lexical knowledge base (LKB), that is, a wide-coverage structured repository of linguistic data, or large collections of raw textual data, that is, corpora. In general, semantic similarity can be classified on the basis of two fundamental aspects. An explicative illustration of word similarity and relatedness. In this paper, relatedness will not be discussed and the focus will lie on similarity.įigure 1. In fact, similarity is often considered to be a specific instance of relatedness (Jurafsky Reference Jurafsky2000), where the concepts evoked by the two words belong to the same ontological class. As is apparent from Figure 1, beautiful and appeal are related but not similar, whereas pretty and cute are both related and similar. Relatedness encompasses a much larger set of semantic relations, ranging from antonymy ( beautiful and ugly) to correlation ( beautiful and appeal). In fact, while similarity refers to items which can be substituted in a given context (such as cute and pretty)without changing the underlying semantics, relatedness indicates items which have semantic correlations but are not substitutable. However, before examining such approaches, it is crucial to provide a definition of similarity: what is meant exactly by the term ‘similar’? Are all semantically related items ‘similar’? Resnik ( Reference Resnik1995) and Budanitsky and Hirst ( Reference Budanitsky and Hirst2001) make a fundamental distinction between two apparently interchangeable concepts, that is, similarity and relatedness. Over the last two decades, several different approaches have been put forward for computing similarity using a variety of methods and techniques. Measuring the degree of semantic similarity between linguistic items has been a great challenge in the field of Natural Language Processing (NLP), a sub-field of Artificial Intelligence concerned with the handling of human language by computers.
