Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

Nina Khairova , Svitlana Petrasova , Włodzimierz Lewoniewski , Mamyrbayev Orken , Mukhsina Kuralai


Automatic extraction of synonymous collocation pairs from text corpora is a challenging task of NLP. In order to search collocations of similar meaning in English texts, we use logical-algebraic equations. These equations combine grammatical and semantic characteristics of words of substantive, attributive and verbal collocations types. With Stanford POS tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of words. We exploit WordNet synsets to pick synonymous words of collocations. The potential synonymous word combinations found are checked for compliance with grammatical and semantic characteristics of the proposed logical-linguistic equations. Our dataset includes more than half a million Wikipedia articles from a few portals. The experiment shows that the more frequent synonymous collocations occur in texts, the more related topics of the texts might be. The precision of synonymous collocations search in our experiment has achieved the results close to other studies like ours.
Author Nina Khairova
Nina Khairova,,
, Svitlana Petrasova
Svitlana Petrasova,,
, Włodzimierz Lewoniewski (WIiGE / KIE)
Włodzimierz Lewoniewski,,
- Department of Information Systems
, Mamyrbayev Orken - Institute of Information and Computational Technologies, Kazakhstan
Mamyrbayev Orken,,
, Mukhsina Kuralai - Al-Farabi Kazakh National University, Kazakhstan
Mukhsina Kuralai,,
Publication size in sheets0.5
Book Ganzha Maria, Maciaszek Leszek, Paprzycki Marcin (eds.): Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems, vol. 15, 2018, Polish Information Processing Society, ISBN 978-83-949419-5-6, [978-83-949419-6-3, 978-83-949419-7-0], 1094 p., DOI:10.15439/978-83-949419-5-6
Keywords in PolishWikipedia, jakość informacji, eksploracja danych
Keywords in EnglishWikipedia, information quality, data mining
Languageen angielski
Score (nominal)15
Score sourceconferenceIndex
ScoreMinisterial score = 15.0, 21-09-2020, ChapterFromConference
Publication indicators WoS Citations = 1
Citation count*3 (2020-09-16)
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?