Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Svitlana Petrasova , Nina Khairova , Włodzimierz Lewoniewski , Mamyrbayev Orken , Mukhsina Kuralai

Abstract

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.
Author Svitlana Petrasova
Svitlana Petrasova,,
-
, Nina Khairova
Nina Khairova,,
-
, Włodzimierz Lewoniewski (WIiGE / KIE)
Włodzimierz Lewoniewski,,
- Department of Information Systems
, Mamyrbayev Orken - Institute of Information and Computational Technologies, Kazakhstan
Mamyrbayev Orken,,
-
, Mukhsina Kuralai - Al-Farabi Kazakh National University, Kazakhstan
Mukhsina Kuralai,,
-
Journal seriesData, ISSN 2306-5729, (0 pkt, indicated Indexes)
Issue year2018
Vol3
No4
Pages1-9
Publication size in sheets0.5
Keywords in PolishWikipedia, eksploracja danych, jakość informacji, informacja
Keywords in EnglishWikipedia, data mining, information quality, information
DOIDOI:10.3390/data3040066
URL https://www.mdpi.com/2306-5729/3/4/66
Languageen angielski
Score (nominal)15
Score sourcejournalIndex
ScoreMinisterial score = 15.0, 11-09-2020, ArticleFromJournal
Publication indicators WoS Citations = 0
Citation count*2 (2020-09-16)
Additional fields
UwagiSpecial Issue : Data Stream Mining and Processing
Cite
Share Share

Get link to the record


* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Back
Confirmation
Are you sure?