Skip to content

Blog

Understanding online (and offline) society

Can we link social media and survey data?

A phone being held in two hands as someone scrolls, the phone open on Twitter

Social media has transformed how we communicate with each other, get news, and do business – and it’s pretty much everywhere. In 2011, 45% of people accessed the internet to use social media, and by 2020, that figure was 70% (90% or more for people aged 44 and under).

It makes sense, then, that social scientists would want not just to explore its effects, but to have access to the wealth of data it generates. Large, longitudinal surveys are the gold standard of social and economic data, but social media gives us real-time, ‘natural’ data about attitudes and behaviours – and the two sources can complement each other.

X/Twitter data, for example, can be collected continuously, in between annual surveys, perhaps picking up on fluctuations in work, mental health, or political allegiances. Social media data can give us a greater breadth of information on attitudes, beliefs and behaviours that a questionnaire might not have collected, without any extra burden on participants.

The project

We’re part of the ESRC-funded project Understanding (Offline/Online) Society: Linking survey and digital trace data, which set out to look at how the traces of ourselves we leave online can enhance our understanding of society.

We’ve focused on four main areas:

  • how digital trace data and survey data can enhance each other
  • how to maximise informed consent to linking survey and digital trace data
  • how digital trace data can be collected, linked to survey data, and shared in a legal and ethical manner, and still be useful
  • examples of using linked survey and digital trace data.

We looked specifically at Twitter data, not least because they were easy to get hold of at first. Twitter’s API (application programming interface – code that allows two software programs to communicate) was free to access, and widely used by academic researchers. (Unfortunately, in February 2023, this free access ended. One reason why we refer to ‘X’ as Twitter is because most of our work on the project came before the company changed hands, and changed its name.)

Consent

Perhaps the biggest challenge for this area of research is consent to linkage. Previous research has looked at the effect of demographic factors such as age, race, and sex on consent, but produced inconsistent results. Using Understanding Society and other data, we’ve found that the numbers of people who will allow their Twitter data to be linked to survey data are low – between 27% and 37% – and that face-to-face surveys have higher consent rates than web versions.

We also know that privacy concerns are an important factor, and that these concerns are affected by demographics. Some young people are less likely to consent than middle-aged and older people – perhaps because, having grown up in the digital age, they are more adept at managing online privacy. Some older people may place more trust in public bodies, such as a university, collecting data.

We have also examined whether how people use their smartphones – and how often – can affect consent rates. The wider the variety of things people do on their phones, the more likely they are to say yes to linking. However, the more often people used their phone, the more likely they were to be concerned about privacy.

It may be that people who use smartphones for a wider range of activities tend to be more open to new experiences and practices, and the more we use our phones, the more likely we are to come across privacy and security risks, such as data breaches or attempted hacking. But it may also be that self-reported frequent use is linked to heightened anxiety and stress, which could amplify concerns over privacy and security. People who use their phone for a wider variety of activities may be more used to sharing personal information in different places.

Variability

Once you have consent, you then face the issue of the variability of the data, such as:

  • the number of followers an account has
  • the number of accounts that follow the account
  • the number of tweets they’ve sent.

On the last point, for example, there are some accounts which rarely or never post – people who are simply on Twitter to look at the news, for example. In our sample, from Understanding Society’s Innovation Panel, we found users who’d only ever sent one tweet, and others who’d sent as many as 36,000.

Asymmetry

Linking Twitter with survey data also presents the challenge of linking very structured data – from a survey – to this much less structured data. We describe this as asymmetry between the two datasets. All these considerations can affect the amount of data available, and the accuracy of results, and introduce bias.

We also have small sample sizes. In our research on the Innovation Panel data, 22% of the sample had a Twitter account, and 9% consented to linkage. Once we get down to those with public accounts, we are left with a sample of just 127 people.

Results

These are just some of the challenges of using these data, but we have produced some interesting results nonetheless. For example, the people who were more likely to have sent a higher number of tweets tended to:

  • have larger numbers of followers
  • be men
  • have higher levels of education.

We also looked at whether there was a link between people’s answers on mental wellbeing in the survey and the content of their tweets – if people said they were generally happy, did their tweets reflect this? We found that the number of tweets which used negative words correlated with people who said they felt less or much less happy than usual.

New dataset

Perhaps the most important output from our work has been the deposit of a new dataset with the UK Data Service: Understanding Society: Innovation Panel Twitter Study, 2007-2023. This contains Twitter data collected from consenting respondents which can be linked to data on the same people from previous and future waves of the Innovation Panel.

Creating it involved a great deal of work to protect individual privacy and also create detailed individual records. The data has been de-identified, creating useful variables, but removing any metadata that can be fed into the API to identify specific survey participants.

We ended up with 30 variables of platform-based behaviour – including numbers of followers, numbers following, number of tweets and replies, amount of original content, and use of hashtags. There are also 135 variables which allow researchers to consider content: sentiment scores for sentences in a tweet, for example, and readability.

The data are safeguarded, which means researchers can only access them by registering and accepting the UK Data Service’s End User Licence. We will be doing further work to protect identities, by using a neural network to rewrite tweets, so they contain the same sentiments, but using vocabulary which will be different from the original tweet, making the account significantly more difficult to identify.

Conclusion

There is more work to do – not least in persuading social networks to release their data, and in convincing users to consent to linkage. So far, though, we have established that linking is possible, and shown that linked Twitter and survey data have the potential to be a unique source for social research.

Find out more about the project on the National Centre for Social Research website

Published papers include:

Linking survey with Twitter data: examining associations among smartphone usage, privacy concern and Twitter linkage consent

Linking Twitter and survey data: asymmetry in quantity and its impact

Linking Twitter and Survey Data: The Impact of Survey Mode and Demographics on Consent Rates Across Three UK Studies

Understanding Society: Innovation Panel Twitter Study, 2007-2023 is available from the UK Data Service

Authors

Headshot of Tarek Al Baghal

Tarek Al Baghal

Professor Tarek Al Baghal is Deputy Director, Understanding Society and lead for content development

Headshot: Paolo Serodio

Paulo Serodio

Dr Paulo Serodio is Senior Research Officer, University of Essex, working on data linkage and survey data augmentation at Understanding Society

Email newsletter

Sign up to our newsletter