I have worked with quantitative and qualitative data for 25 years, but when I started working with the Understanding Society’s omics data team it was time to learn some new terms. Gene expression, GWAS, EWAS, proteomics, SNPs, alleles, bim, fam and bed… it took me back to my childhood, annoying my sister with endless replays of Mary Poppins and The Sound of Music which I ‘sang’ along to. The new language was disorientating, but I had once coped with ‘supercalifragilisticexpialidocious’ and ‘Do-Re-Mi’, and as I set about familiarising myself with this new language, I learned about the benefits of working with data from the social sciences and biology side by side.
My learning was consolidated by attending the Biology for Social Scientists workshop run by Understanding Society in March this year. Listening to the presenters talk about the genetic, epigenetic, and proteomic data (collectively known as omics data) I learnt the fundamentals of what a social scientist needs to know about them, and how the attitudes, opinions and environmental factors of social science data create a rich resource when combined with omics data.
What’s in the data – Sample size
The End User Licence data contain bio-medical measures on height, weight and lung function for around 20,000 participants
- Blood samples (biomarkers) for around 13,000 participants
- Genotype data for just under 10,000 participants (9920)
- DNA methylation data for over 3,500 participants
- Proteomics data for just under 6,200 participants
- The data represents a broad age range of participants
- But only white Europeans are included in the DNA-derived dataset (genotype and methylation) as the ethnic minority sample size is too small and if used would result in sample bias yielding misleading results
Biomarkers are representative of your health
Let’s start at the very beginning (a very good place to start) by talking about biomarkers. Understanding Society collects information on general health in each wave, and in In Waves 2 and 3, a follow-up health assessment visit from a registered nurse also collected bio-medical measures from around 20,000 adults. These included blood pressure, height, weight, waist measurements, body fat, grip strength and lung function. Blood samples were also taken, and biomarker data is available for measures of fat in the blood and indicators of diabetes, inflammation and immune system, anaemia and hormones. Collecting these interdisciplinary data allows us to get ‘under the skin’ and see how biomarkers can show the effect of social and economic environment on health. This is illustrated well in the cartoon Inequality under the skin which summarises research showing that people in a lower social and economic position have higher levels of biomarkers that indicate inflammation caused by stress and infection much earlier in life.
What’s upstream?
There are great advantages to combining social and biological data. For instance, biological measures such as cholesterol and blood pressure are objective measures of health, free from reporting bias. You are born with your genes, and they are free from reverse causation. Harnessing these with social science measures means you can determine how environmental factors affect people. They can become a powerful indicator of what’s ‘upstream’. For example, measuring biomarkers can give an indication of diabetes or heart disease even before people know about their illness. Examining their environment can identify trends such as those illustrated in the ‘inequality under the skin’ cartoon.
Advantages of using biological data
You are born with your genes and they are free from reverse causation. Using UKHLS data means you can determine how environmental factors can affect individuals.
GWAS and SNPs
A SNP (Single Nucleotide Polymorphism, pronounced ‘snip’) is a variation in human DNA. An entire set of 23 human chromosome pairs is called a genome. A genome consists of millions of combinations of base pairs known by the letters A, T, C, and G. When the genome is copied to make a new cell, a base pair can get altered. This alteration or variation is called a SNP. Estimates vary, but there are reckoned to be around 10 million SNPs in the human genome, and the different versions of each SNP are referred to as alleles, which result in differences in how we look and in our health outcomes. It is also how we establish if we are related to others. Every SNP is catalogued and all SNPs have a name, recorded in the National Center for Biotechnology Information.
Alleles example
ACTGA
AATGA
A SNP array is used to measure inter-individual variation across the genome. (An array is the glass slide used to test a subset of SNPs in a lab.) Array data can be used to perform Genome Wide Association Studies (GWAS), which involve “rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease.” Understanding Society data has been used in many large GWAS to help understand the causes of diseases and other outcomes.
Imputation fills the gap
Not all SNPs are directly tested in each array. However, researchers who conducted the 1,000 Genomes project sequenced the full genomes of 1,000 participants, providing a resource that allows us to estimate genotypes that haven’t been tested on the array in a process called imputation.
Phenotype
A phenotype is any observable characteristic of an individual, and usually results from a complex interplay between genetics and the environment. Many social science variables could be classed as phenotypes, and environmental factors that influence phenotypes include nutrition, temperature, humidity and stress. Flamingos are a famous example – naturally white, their pink appearance comes from pigments in the organisms they eat.
Free from cultural labelling
Omics data is free from social science cultural labels, and studies an individual’s ancestry not their ethnicity.
This is a workshop presentation slide by Anna Dearman, showing the different types of omics data derived from the blood sample collected from participants in Waves 2 and 3 of Understanding Society.
Genetic dataset
A genome-wide array has been conducted on DNA samples from approximately 10,000 people, which enables us to examine gene-environment interactions for health and social phenomena. Researchers can apply for Understanding Society survey data linked to genetic and/or epigenetic data. See ‘How to access the data – the application process’ below.
Epigenetic dataset
DNA methylation is one aspect of epigenetics. While SNPs cannot be altered by the environment, DNA methylation can see an increase or decrease in your lifetime as a result of your environment. Methylation profiling has been conducted on DNA samples from approximately 3,650 people –1,425 individuals from the British Household Panel Survey component of Understanding Society and another 2,230 from the General Population Sample. This is particularly important to advance understanding of how people’s social, economic and physical environments over their lifetime influence their biological processes by altering how their genes work. Researchers have controlled for socially patterned variations such as smoking in studies of subjects as diverse as educational attainment and schizophrenia.
Other health datasets
Some new Understanding Society biological datasets will be available soon for researchers to test. The Data Releases page will announce when they are released, and include:
Clocks and polygenic scores
Epigenetic clocks are derived from methylation data. There are several different methylation ‘signatures’ of ageing, also known as epigenetic clocks, and several of these have been devised to measure wear and tear on the body. For example, accelerated epigenetic age has been associated with socioeconomic position. A polygenic score is the sum of an individual’s alleles which may contribute to a given outcome (phenotype) such as diseases or other characteristics. As such, a polygenic score can be used to estimate somebody’s relative risk of developing a disease, or to estimate a measurement like body mass index, blood cholesterol, etc. However, polygenic scores are not robust enough to predict outcomes on their own.
Proteomic dataset
Proteins are the products of gene expression but closer to the phenotype than genetics. Proteomics is a young field and work continues to try to understand the significance of these measures in social science. Unlike biomarkers, which are more established measures of health, proteins have a variable amount of research behind them and researchers are still learning what insights they provide.
Covid-19 has brought familiarity to omics terminology
In our podcast on Public health and the COVID-19 pandemic, Professor Meena Kumari, Professor of Biological and Social Epidemiology and Deputy Director of ISER, and David Finch, Assistant Director at The Health Foundation, look at the health issues that Covid has highlighted and the future challenges for public health. In David’s words, “Understanding Society is providing a really valuable long-term resource”.
How to access the data – the application process
Eligible researchers can combine genetic and/or epigenetic data with EUL survey data from Understanding Society. The application process is outlined on our health and biomarkers webpage and you may find the FAQs useful to see who is eligible to apply. Register to use the application portal and deposit the application form and variable request form. The Health and biomarker web pages also contain links to the variables, user guide, questionnaires and fieldwork documents.
Complete the application form and variable request form to apply:
- Genetic/Epigenetic application form
- Variable request form
What information is contained in the SNP array data files?
Behind the names: bed, bim, fam
- .bed measures the SNPs
- .bim contains information about the SNPs including the chromosome, its position and name of the SNP e.g. (rs123456)
- .Fam contains information about the individual, including their sex, their father and mother or other family information.
Working across disciplines – the best approach
Hopefully you have now become a little more familiar with omics terminology and you may be wondering what level or depth of background knowledge a social scientist needs to reach in order to comfortably use this type of data, given the complex dynamics behind the genetic measures and biomarkers. As Meena says, “Working with a bio-statistician is a good idea. It’s hard to work with the omics data on your own, so working as a multidisciplinary team is mutually beneficial to both fields.”
Glossary and… adieu
I hope this blog has helped you become a little more familiar with the world of omics and perhaps even given you a hankering to (re)watch the Sound of Music. To help remind yourself of those omics meanings, I’ll leave you with a glossary of terms. Do get in touch if you are thinking of using the omics data: genetics@understandingsociety.ac.uk and until then… so long, farewell, auf wiedersehen, goodbye.
Authors
Annette Pasotti
Annette is Senior Web and Communications Officer (User Engagement) at Understanding Society






