An overview of methods for treating selectivity in big data sources

Maciej Beręsewicz , Risto Lehtonen , Fernando Reis , Loredana Di Consiglio , Martin Karlberg


Official statistics is now considering seriously big data as a significant data source for producing statistics. It holds the potential for providing faster, cheaper, more detailed and completely new types of statistics. However, the use of big data brings also several challenges. One of them is the nonprobabilistic character of most sources of big data, as very often they were not designed to produce statistics. The resulting selectivity bias is therefore a major concern when using big data. This paper presents a statistical approach to big data, searching for a definition meaningful from the statistical point of view and identifying its main statistical characteristics. It then argues that big data sources share many characteristics with Internet opt-in panel surveys and proposes this as a reference to address selectivity and coverage problems in big data. Coverage and the self-selection process are briefly discussed in mobile network data, Twitter, Google Trends and Wikipedia page views data. An overview of methods which can be used to address selectivity and eliminate, or mitigate, bias is then presented, covering both methods applied at individual level, i.e. at the level of the statistical unit, and at domain level, i.e. at the level of the produced statistics. Finally, the applicability of the methods to the several big data sources is briefly discussed and a framework for adjusting selectivity in big data is proposed.
Book typeMonograph
Author Maciej Beręsewicz (WIiGE / KS)
Maciej Beręsewicz,,
- Department of Statistics
, Risto Lehtonen - University of Helsinki
Risto Lehtonen,,
, Fernando Reis - European Commission (Eurostat)
Fernando Reis,,
, Loredana Di Consiglio - European Commission (Eurostat)
Loredana Di Consiglio,,
, Martin Karlberg - European Commission (Eurostat)
Martin Karlberg,,
Publisher name (outside publisher list) Publications Office of the European Union
Publishing place (Publisher address)Luxembourg
Issue year2018
Book series /Journal (in case of Journal special issue)Statistical Working Papers - EUROSTAT, ISSN 2315-0807 , (0 pkt)
Publication size in sheets5.7
Keywords in Englishbig data, selectivity
Languageen angielski
Score (nominal)20
Score sourcepublisherList
ScoreMinisterial score = 20.0, 21-04-2020, MonograhOrBookMainLanguagesAuthor
Citation count*13 (2020-09-23)
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?