Using morphological and semantic features for the quality assessment of russian Wikipedia

Włodzimierz Lewoniewski , Nina Khairova , Krzysztof Węcel , Nataliia Stratiienko , Witold Abramowicz


Nowadays, the assessment of the quality and credibility of Wikipedia articles becomes increasingly important. We propose to use morphological and semantic features to estimate the quality of Wikipedia articles in Russian language. We distinguished over 150 linguistic features and divided them into four groups. In these groups, we considered the features of encyclopedic style, readability and subjectivism of the article’s text. Based on Random Forest as a classification algorithm, we show the most importance linguistic features that affect the quality of Russian Wikipedia articles. We compare the classification results of our four linguistic features groups separately. We have achieved the F-measure of 89,75%.
Author Włodzimierz Lewoniewski (WIiGE / KIE)
Włodzimierz Lewoniewski,,
- Department of Information Systems
, Nina Khairova
Nina Khairova,,
, Krzysztof Węcel (WIiGE / KIE)
Krzysztof Węcel,,
- Department of Information Systems
, Nataliia Stratiienko
Nataliia Stratiienko,,
, Witold Abramowicz (WIiGE / KIE)
Witold Abramowicz,,
- Department of Information Systems
Publication size in sheets0.5
Book Damaševičius Robertas, Mikašytė Vilma (eds.): Information and Software Technologies, Communications in Computer and Information Science, vol. 756, 2017, Springer, ISBN 978-3-319-67641-8, [978-3-319-67642-5], 624 p., DOI:10.1007/978-3-319-67642-5
Keywords in Englishquality assessment of texts, morphological and semantics features, Russian Wikipedia articles, random forests classification, encyclopedic, readability, subjectivism
ASJC Classification2600 General Mathematics; 1700 General Computer Science
Languageen angielski
Score (nominal)20
Score sourcepublisherList
ScoreMinisterial score = 20.0, 24-03-2020, ChapterFromConference
Publication indicators WoS Citations = 0; Scopus SNIP (Source Normalised Impact per Paper): 2017 = 0.354
Citation count*3 (2020-09-23)
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?