About the study
It is a household longitudinal survey of UK residents living in private households. The survey started in 2009.
A household survey is where the population of interest is households (and their members) and so a sample of households are chosen and information is collected about the household members.
A longitudinal survey is a survey where the same set of people are interviewed at regular intervals and asked the same questions. This helps in understanding the dynamic processes of their lives and measuring change.
The regular intervals at which a longitudinal survey collects information from its sample members is referred to as waves (some surveys refer to these as sweeps).
The data is collected over a period of 24 months, also referred to as the fieldwork period. The fieldwork period for the first wave was 2009 to 2010. But as the interviews had to be in one year intervals, the second wave covered the period 2010 to 2011. Generally, those who were interviewed in 2009 were next interviewed in 2010, 2011, and so on and similarly those who were interviewed in 2010 were next interviewed in 2011, 2012 and so on. Read the survey timeline
The data is available one year after the end of the collection period. So, Wave 1 data was available in November 2011, Wave 2 data in November 2012, and so on.
At each sampled address, up to three dwelling units are selected and at each dwelling unit, up to 3 households are randomly selected into the sample. In most cases, an address has one dwelling unit and one household in that dwelling unit. All household members of the households chosen in Wave 1 form the core sample. They are referred to as Original Sample Members (OSM). The children of OSM mothers also become OSMs. This is the core sample, that is, they represent the population of interest. So, OSMs are followed wherever they go as long as they live in the UK. If anyone joins the household of one or more OSMs, then they are called Temporary Sample Members (TSMs) and are interviewed only as long as they live with at least one OSM. This is because they are not the core sample who represent the population of interest, but they are interviewed to provide the household context for the OSMs. Exception: If a TSM is the father of an OSM child, then their status changes to Permanent Sample Member (PSM) and like the OSMs they are followed wherever they go as long as it is within UK. PSMs are followed permanently because they provide information about the family background of an OSM child. If we did not do this, then if the OSM mother and TSM father separate, we will not have information about the father of the OSM child.
We use the standard ONS definition: "one person living alone or a group of people who either share living accommodation OR share one meal a day and who have the address as their only or main residence".
A Dwelling Unit (DU) is a living space with its own front door – this can be either a street door or a door within a house or block of flats. Usually there is only one dwelling unit at an address.
Students who have left the household are followed and treated as a "split-off" household. If they are living in institutional accommodation (e.g., Halls of Residence), they are considered to be the only member of the household. If they are living in private residence (e.g., renting a flat), then the usual rules apply of who is considered part of the household.
The sample is chosen from private households and so institutions are not part of the core sample. However, if sampled individuals move into institutions after the first wave, they are interviewed if it is possible to do so. The only exception to this is prisons, where we do not seek an interview.
Information is collected about different aspects of the lives of the sample members. The main topic areas cover income, education, labour market activities, family background, partnerships, fertility behaviour health and wellbeing, attitudes and values, identity, ethnicity and religion. For a good overview of the different topics see Long Term Content Plan. You can also search for variables here HYPERLINK to Online Documentation search facility.
All information collected is voluntary. Interviewees can stop at any point in time and can also ask for their data to be destroyed after it is collected. First information is collected by interviewers from a responsible adult in the household about the household members - their names, date or birth or age, sex, marital status, employment status and relationship to each other. This is called the ENUMERATION GRID. Second, generally the owner or renter (or if more than one, the eldest of them) is asked about the household - household expenditure, mortgage, number of bedrooms,... This is the HOUSEHOLD QUESTIONNAIRE. Then all 16+ year olds, referred to as adults for the purposes of the survey, are asked questions about different aspects of their lives. This is the INDIVIDUAL QUESTIONNAIRE. This questionnaire also includes questions about children, 0-9 year olds for their parents and guardians...such as parenting styles, birthweight, etc. All 10-15 year olds are asked a number of questions about aspects of their school, friends, eating and drinking habits, happiness and wellbeing. This is the YOUTH QUESTIONNAIRE. The interviewers also provide some information about the condition of the property, cooperativeness of the interviewees, any difficulties faced by interviewees in answering the question etc.
Parents and guardians answer some questions about their children such as parenting styles, birthweight, etc. All 10-15 year olds are asked fewer number of questions about aspects of their school, friends, eating and drinking habits, happiness and wellbeing. This is the youth questionnaire.
The collection of information is subcontracted to a fieldwork agency. Until 2014 (Waves 1-5) it was NatCen Social Research, for Waves 6-8 it was Kantar Public. For the current Waves 9-11 contract, the fieldwork is conducted by Kantar Public, with part of the interviewing being done by NatCen. There are different ways of collecting data. (1) Until recently the most common method was face-to-face, that is interviewers went to the homes of interviewees and asked them questions in person. Some sensitive questions were answered by the interviewees directly on paper or computer and the interviewer did not see their answers. (2) Around 500 households are interviewed via telephone (3) Since Wave 7, an increasing proportion of the sample completes the interview directly on the web, that is, no interviewers are directly involved in asking the questions.
To meet the different objectives of the study, in the first Wave, a random sample of 26000 households was chosen from UK (referred to as the General Population Sample) AND a sample of 4800 households where at least one household belonged to an ethnic minority group was chosen (referred to as the Ethnic Minority Boost Sample). Then in 2010, the BHPS sample was incorporated into the Understanding Society and was interviewed as part of the second Wave. In 2015, as part of the sixth Wave, a new sample was added. This was the Immigration and Ethnic Minority Boost Sample, which included 2500 households where there was at least one person who was born outside the UK and/or belonged to an ethnic minority group.
Yes, the BHPS in Wave 2 and the Immigrant and Ethnic Minority Boost Sample in Wave 6.
Biomarkers such as (grip strength, lung function test, blood samples, etc.) were collected from a sub-sample of the survey sample. This is referred to as the biomarker sample.
Five minutes of question time was set aside for questions of interest to ethnicity and migration research. These questions were only asked of a sub-sample of the survey sample. This is referred to as the extra five minutes sample.
The BHPS is the British Household Panel Survey which started in 1991 and continued until 2008. It started with a Great Britain sample of around 5500 households. In 1999, Scottish & Welsh boost samples of around 1500 households were added and then in 2001 the Northern Ireland boost sample was added. There are similarities between the BHPS and Understanding Society in terms of sample design, questionnaire content, data structure but there are some differences. For example, the BHPS did not include an ethnic minority boost samples. There are also some differences in questionnaire content. See here (HYPERLINK TO FAQ relevant for that).
No. As a longitudinal study it is important that we interview the same people over a long period of time. We cannot add new people to the sample who are not connected to the households already in the Study.
First step is to look at the frequency distribution of the variable responses. For example, for the question "how satisfied are you with your life overall?" asked in Wave 1, 5588 individuals said they were completely satisfied, 17453 said they were mostly satisfied and so on and 1033 said they were completely dissatisfied. Read more
Understanding Society data can be accessed via the UK Data Service (UKDS). Researchers will need to be registered with the UKDS before requesting and downloading data. You will need to decide which level of the main data you need - End User Licence (EUL), Special Licence or Secure Data access.
- Read an outline of the different licence types available and the conditions of their access.
- Read the study's full Data Access Strategy
The EUL version of the main survey dataset (Study Number 6614) includes all of the data that should be required by the majority of researchers and is the simplest and quickest to access from the UK Data Service (UKDS). It does not, however, contain: dates of birth more detailed than the year of birth; detailed country of birth; detailed occupation or industry codes; or geographical identifiers less than Government Office Region (GOR). In addition, the income and pay variables are top-coded. Read the UKDS catalogue details and the Understanding Society documentation.
Government Office Regions (GOR) and countries are available as part of the main datasets (Study Numbers 6614 and 6931). The most detailed geographical indicators, grid references, are supplied as part of the Secure Access version (Study Number 6676). The geographical indicators between GOR and grid references are available as Special Licence datasets. A maximum of three Special Licences are usually allowed per application and combinations of certain 2001 and 2011 census type geographical identifiers are not allowed due to disclosure risks. The Understanding Society geography Special Licence datasets are only applicable to the Understanding Society waves. Users wishing to use Special Licence geographies with the harmonised BHPS waves will need to apply for the relevant BHPS geography Special Licence dataset(s).
It is free to use for non-commercial purposes.
The data is available in Stata (*.dta), SPSS (*.sav) and in tab delimited formats.
Read the terms and conditions of access to End User Licence data
Read the terms and conditions of access to Special User Licence data
The Special Licence / Secure Access application process is two-stage. Applications are made to the UK Data Service (UKDS) and they perform a series of checks on security and other aspects of the application. The UKDS will inform applicants of any delays in their processing. Secure Access applications are likely to take longer than Special Licence ones due to the additional processing required. The second stage is the approval of the application by the Understanding Society Scientific Leadership Team (SLT) under delegated authority from the data owners. The SLT is required to process applications within 10 working days of receiving them although the time-frame is often considerably shorter. Applications with insufficient or unclear details on the form are likely to require clarification being sought from the applicants via the UKDS and this will prolong the application process. It should also be noted that users of Secure Access data will require special training - the UKDS will advise on this matter.
A limit of three is normally applied to the number of Special Licences granted per application. In addition, combinations of certain 2001 and 2011 census type geographical identifiers are normally prohibited due to potential disclosure risks. Read the access restrictions . With regards to Secure Access data, access to it is only available in a controlled environment. It is possible to import Special Licence data, or indeed datasets from elsewhere, and a special application will need to be made to the UKDS in such circumstances.
Data that a researcher has been granted access to on a research project may not be used on another research project. A separate application will need to be submitted to the UK Data Service (UKDS). If clarification is required before submission then please get in touch with the UKDS for advice via their standard contact mechanisms.
Only those persons named in the data access application will be able to use the data for the specific project. Each applicant on a project must submit an application form. Supervisors requiring access to the data must also apply in the same way.
Project team members from different institutions can apply for access to data as part of the same project and to work on the same data. Special consideration for access to some Special Licence datasets may be given for applicants from non-UK institutions.
No person may access Understanding Society data without authorisation. Please contact the UK Data Service (UKDS) for advice in the first instance.
Data that a researcher has been granted access to on a research project may not be used on another research project. A separate application will need to be submitted to the UK Data Service (UKDS). If clarification is required before submission then please get in touch with the UKDS via their standard contact mechanisms.
All released Understanding Society data should be accessed from the UK Data Service (UKDS).
The survey data files that are downloaded from the UK Data Service (UKDS) are supplied in a single ZIP file which is a compressed format that results in faster download times. Read full details on ZIP files.
Yes. When you download the data from the UK Data Service (UKDS) you will be asked to check a box to say you will be using it for teaching purposes. You will then be asked to download a simple form with instructions. Basically it will require you to provide some simple details about you and the data and get you and everyone in your class to sign the form.
Please contact the UK Data Service (UKDS) Helpdesk for all download and accessibility issues.
A large number of questions are asked every year, while others are asked every few years. The frequency of variables can be seen below the variable description in the online description of the variable. For example, current economic activity status is asked every year. If you click here, you will see the waves in which this question has been asked under "Wave Occurance" For an overview of the question module frequency see the long term content plan
If the questions have the same name then they are comparable across time
Participation in the survey is completely voluntary and respondents can skip any question that they do not want to answer. As part of the survey we also collect respondents' consent to link to various administrative data sources such as information held by DWP, HMRC, DVLA and FCA. Read more
Read the BHPS - Harmonised User Guide. This guide accompanies the first edition of the Understanding Society – harmonised BHPS. It focuses on the harmonisation process, not the differences in scope, fieldwork practices, questionnaire design and content of the two studies.
Data and research
All data files that include information collected in Wave 1, start with a prefix A_, similarly all files pertaining to Wave 1, start with a prefix B_. Same types of information are available in files with the same names across all waves, just the prefix changes. So, all information collected in the adult interviews is put in INDRESP files: the Wave 1 adult interview information is in A_INDRESP, the Wave 2 adult interview information is in B_INDRESP. Since November 2018, 18 waves of BHPS datafiles which have been harmonised (to a large extent) have also been released along with 7 waves of Understanding Society data. These data files have similar structure, but their names start with prefixes BA_ ,BB_,... until BR_. For details about the other files see the study’s documentation and information on harmonisation of the BHPS
These files have names starting with the letter "x" to indicate these are the cross-wave files, that is, they include data collected from different waves. For example, XWAVEDAT includes time fixed information such as date or birth, country of birth, parents' pccupation when the person was 14 years old. These types of information are only collected once, generally the first time a person is interviewed. Although most poeple were interviewed for the first time in Wave 1, others who join the households of the core sample members (TSMs) after Wave 1, were asked in the wave they joined. So, this file puts this data collected in different waves together in one file.
The naming convention for variables is the same as for files, that is, the same variable across waves has the same root name but a different wave prefix. So, the variables for age in Wave 1 is A_DVAGE, in Wave 2 is B_DVAGE and so on.
If a variable does not change across waves it will not have a wave prefix. One example is the individual crosswave identifier pidp
A_HHORIG, B_HHORIG, C_HHORIG,… in wave specific files AND HHORIG in the cross-wave files will show which sample a sample member belongs to.
If the variables have the same root name then they are comparable across waves.
In some data files each row represents a unique household. These are household level files, e.g., A_HHRESP. In some data files, each row represents a unique individual. These are individual level files, e.g., A_INDREP.
These are derived variables, that is, the Understanding Society data team has produced these variables using different variables to make it easier for users. Sometimes these are imputed values of the raw variables (e.g., A_PAYGU_DV), sometimes these combine different bits of information (e.g., A_HIQUAL_DV), sometimes these have been checked for consistency using different bits of information available for that person (e.g., A_AGE_DV). Note another type of derived variable are the relationship pointers. These do not have a _DV suffix.
This is the unique cross-wave person identifier. That is, within or across waves, this variable uniquely identifies each person ever associated with the survey.
These are the wave specific household identifiers, that is, within any wave these values uniquely identify each household. Every person in the same household in each wave will have the same household identifier.
These are person numbers assigned to all household members in a responding household. These along with the household identifiers uniquely identify an individual member within a wave. These can change across waves and the ordering has no significance. So, one person could have the value A_PNO = 1 in Wave 1 and B_PNO = 7 in Wave 2
Use the household identifier: A_HIDP in Wave 1, B_HIDP in Wave 2, and so on
Use the individual crosswave identifier PIDP
You cannot do that. There is no concept of a longitudinal household because over time people move in together as well as live separately. As a result a cross-wave household identifier cannot be provided.
When individuals live in the same household with their spouse or partner, then along with other information for the person the PIDP of their spouse or partner (W_PPID) are also provided. Use these to link information about spouses.
The negative values -21 -20, -11, -10, -9, -8, -7, -2, -1 have been reserved for different types of missing data. Other than these values any negative value represent actual value of the variable. -1 = Don't Know, -2: Refusal, -7: Not asked as Proxy Interview, -8: Not asked as not eligible for question, -9: Other reasons for missing data; ONLY IN XWAVDAT FILE: -20 means "no data from BHPS" and -21 means "no data from Understanding Society". ONLY IN WAVE 6 data files: -11 means "variable not available for non IEMB samples" and -10 means "variable not available for IEMB sample".
Questions about variables
The two key age variables are W_DVAGE and W_AGE_DV. You should either use W_DVAGE or W_AGE_DV depending on your requirements [LINK TO THE RELEVANT FAQ HERE]. The data includes other age-related variables. Here is an explanation. As long as a person’s exact DOB has not been provided or confirmed by herself/himself, adults will be asked to provide their DOB in their personal interview [see CDOBM and CDOBY in the demographics module]. If they do report _CDOBY and _CDOBM this is not used to re-calculate the DVAGE on that occasion: there is no instruction following the _CDOBM and _CDOBY question to re-compute _DVAGE. Youths also provide their DOB on the youth self-completion form, and this information is also not available during the current wave interview. Following the fieldwork, the information will be picked up to update existing information in the sampling database for the next wave’s feed forward information. This requires some manual checking and cross-validation as there will always be cases where the seemingly correct DOB has been entered incorrectly – e.g., a typo could make a person born in 1959 appear to be a person born in 1995, or a youth’s sloppy written ‘3’ may have been misread as ‘8’ by the scanner. These variables are used by the Understanding Society data team to maintain and update the sampling file but no attempt is made to make these variables consistent with _DVAGE or _AGE_DV.
There are two key age variables - W_DVAGE and W_AGE_DV (i) DVAGE is the person’s age, in number of completed years, at the time the individual was enumerated as a member of a responding household. DVAGE is computed during the interview at the end of the household grid. This is the one used for filtering age-triggered questions. So, if you want to understand or check who gets asked these age triggered questions you should use W_DVAGE (ii) But there W_DVAGE is not checked for longitudinal consistency or even consistency across multiple age reports. As an alternative we provide AGE_DV. AGE_DV is computed using the latest DOB information as it is recorded in the sampling file and uses the date of the interview. Note that all input variables used in computing AGE_DV may include imputations (we provide the associated imputation flags _IF). For some respondents the information collected across different waves of the study continues to be inconsistent. If inconsistencies could not be resolved, the W _AGE_DV in the wave-specific files is set to -9. Corresponding to W_DVAGE we also provide date or birth variables: BIRTHM (only SL version) and BIRTHY. Corresponding to AGE_DV we also provide date of birth variables: DOBM_DV (only SL version) and DOBY_DV.
The script instruction for computation of DVAGE is: Compute DVAge = using Ff_birthd/ff_birthm/ff_birthy | BIRTHD/BIRTHM/BIRTHY | Ageif | Cageif If DOB Not Known. In other words, DVAGE uses the date of the enumeration and either the exact date of birth as has been provided during that enumeration, as it was available from a previous interview or from an age estimate if the exact DOB is unknown. DVAGE is used to route age-triggered questions during the following interviews with individuals. Note that DVAGE is not necessarily a good measure of the person’s age at the time of interview: the DOB used to calculate DVAGE may have been incorrect, estimated or unavailable and sometimes a lot of time will have passed between the household enumeration and the personal interview.
Year of birth is available in End User Licence; Year and month of birth variables are available in Special Licence. Day of birth is only available through Secure Access.
W_QFHIGH_DV and W_HIQUAL_DV (which is a summary variable for W_WFHIGH_DV) provide the highest educational qualification as of a particular wave. These variables are produced by combining information collected in the first wave the person was interviewed (W_QFHIGH) with information about new qualifications reported each wave since then (W_QFVOC1-15 W_QUALNEW1-33 W_TRQUAL1-34). From Wave 2 onward this also includes information fed forward from the BHPS. WARNING! Not all respondents were ever asked the highest educational qualification question (W_QFHIGH). Such cases are identified by a flag variable W_QFHIGHFL_DV. These group mainly comprises of BHPS respondents and Rising 16's who had a youth interview in Wave 1 and were incorrectly routed out of the initial conditions module in Wave 2. From Wave 6 onwards this group also includes members of the IEMB sample who provided an adult interview and reported that their highest qualification was obtained abroad (see F_QFHIGHOTH and F_ISCED11_DV). Highest qualifications may be picked up through W_QUALNEW or W_TRQUAL for these groups but may be additional to pre-existing (potentially higher) qualifications.
The key highest educational qualification variables are W_QFHIGH_DV and W_HIQUAL_DV. These are computed using information provided in the other education variables: W_QFHIGH records the highest educational qualification of respondents in their first interview, so it will be missing for anyone for whom that is not the first interview, that is, anyone who has been interviewed before. W_HIQUAL is a summary variable for W_QFHIGH, containing five categories of qualification. Respondents who have taken part before are asked about any new qualifications in these variables: W_QFVOC1-15 W_QUALNEW1-33 W_TRQUAL1-34. W_NQFHIGH_DV records the highest educational qualification reported in the current wave, including vocational qualifications which were reported in W_QFVOC1-15, BUT not including any information fed forward from previous waves. Qualifications obtained since last wave are recorded in W_QUALNEW1-33 OR W_TRQUAL1-34. Responses that are not part of the code frame for W_QFHIGH are coded to  "None of the above". W_NHIQUAL_DV is a summary variable for W_NQFHIGH_DV, containing five categories of qualification.
Data is recorded as reported by the respondent and is provided as is. if there are discrepancies or inconsistencies in their reports then that will appear in the data. But in some cases we also provide "derived variables" with suffix _dv which are edited to be consistent. These are sex_dv, w_age_dv, and w_relationship_dv. For more information please click on the relevant link.
Paygu_dv is derived as “Gross pay at last payment” (paygl) or if the last payment is not usual then “usual pay” (payu). Amounts are converted to a monthly equivalent. If “payu” is reported net instead of gross, then it is converted back to a gross amount by applying the rules of the tax system. If there is missing data then it is imputed. See the user guide for further details on imputation methods used in UKHLS.
Information about the sex of every household member is asked as part of the household grid and reported by the someone in the household. This information confirmed each wave and is recorded in W_SEX in individual and multi-level files. The variable SEX in the crosswave files contains the latest value available. As this value may be inconsistent across waves, we have created a longitudinally consistent version based on the information collected from all sources across all waves. This is named W_SEX_DV in individual level files and SEX_DV in crosswave files.
W_RELATIONSHIP is the relationship of the EGO with the ALTER as reported in the household grid part of the interview. But sometimes the information reported is not consistent within the household and/or across waves. For example, person 1 is person 2’s parent, person 2 is person 3’s sibling, but person 3 is person 1’s partner. o, such inconsistencies are checked and corrected and that is recorded in W_RELATIONSHIP_DV.
The best source is the questionnaire. In the questionnaire search for this question using the question name and then look at the "Universe" description below the question. That will tell you who was eligible for this question.
When an adult sample member (16+ year old) is unable to respond to an interview, someone else (generally a spouse, partner or adult child) who knows them very well are asked to answer some questions on their behalf. But these questions are restricted to those that collect factual information and information on attitudes, values and beliefs are not collected by proxy. In the datafile as information for proxy and non-proxy cases are provided together, responses to the questions not included in the proxy questionnaire will be missing. To identify such cases, these variables are assigned a value of -7 for proxy respondents.
Samples and Sampling
ADD THIS TO THE FAQ on OSM, TSM, PSMs
That is not recommended because if you do that you will have coverage error. This is because the IEMB sample is not sampled from the whole of GB but only from areas of high ethnic minority and immigrant concentration. Also, no weights are provided for this sample separately.
That is not recommended because if you do that you will have coverage error. This is because the EMB sample is not sampled from the whole of GB but only from areas of high ethnic minority concentration. Also, no weights are provided for this sample separately.