It is big, but not always clever

Ella Rhodes reports on the flaws of large data sets from social media.

08 December 2014

Computer scientists at McGill and Carnegie Mellon Universities have pointed out some of the flaws in using large sets of data from social media to learn about human behaviour, in a new article published in Science. They said the approach is now taken in thousands of papers, some of which contain erroneous data.

Derek Ruths, an assistant professor in McGill’s School of Computer Science, said that faulty results gleaned from using such data can have huge implications. He said: ‘Many of these papers are used to inform and justify decisions and investments among the public and in industry and government.’

Ruths and Jürgen Pfeffer of Carnegie Mellon’s Institute for Software Research have highlighted several issues involved in using social media datasets as well as strategies to address them. The challenges include the fact that different social media platforms attract different users: for example, Pinterest is dominated by females aged 25–34 but researchers often don’t correct for the distorted picture these populations can produce. They also pointed out that the design of social media platforms can dictate how users behave and, therefore, what behaviour can be measured. For instance, on Facebook the absence of a ‘dislike’ button makes negative responses to content harder to detect than positive likes.

The large numbers of spammers and bots, which masquerade as normal users on social media, also pose an interesting problem for researchers – they can get mistakenly incorporated into many measurements and predictions of human behaviour. The researchers said that many of these problems have well-known solutions from other fields, such as epidemiology, statistics and machine learning, Ruths and Pfeffer wrote: ‘The common thread in all these issues is the need for researchers to be more acutely aware of what they’re actually analysing when working with social media data.’

Tom Stafford (University of Sheffield) told The Psychologist that it was a tremendously exciting possibility to have access to digital track of human behaviour. He added: ‘And just because they’re online doesn’t mean they don’t reflect real feelings and behaviour. Using such data means you can remove sample error, but you cannot remove sample bias. What psychology needs to get to grips with is doing properly controlled comparisons.

‘The same things which were fundamentally important in pre-big-data times are still important. Things such as theoretical questions – what you are interested in and why – and a strong causal inference based on experiments using random assignment where possible. And, where experiments aren’t possible, using properly controlled comparisons like those used by epidemiologists.’