Stata code for matching files

The syntax examples below show how to perform some common data management tasks useful in analysing the Innovation Panel data files.

Each task is illustrated with code for Stata. Statements beginning with // are comments. The 6 tasks are:

Distributing household level information to individual level
Summarising individual level information at the household level
Matching individuals within a household
Using the egoalt file to create household composition variables
Merging individual files across waves into long format
Merging individual files across waves into wide format

Example 1: Distributing household level information to individual level

In this example we will distribute household level information to individuals in those households. We can do this by merging household level file (such as w_hhresp_ip) with an individual level file (such as w_indresp_ip) within the same wave.

// open the household level file

use a_hidp a_hhsize using a_hhresp_ip, clear 

// sort it on the household identifier, w_hidp

sort a_hidp

// save this temporary file

save hhinfo, replace

// open the individual level file

use pidp a_hidp  a_marstat using a_indresp_ip, clear 

// sort it on the household identifier, w_hidp

sort a_hidp

// merge it with the earlier saved file on w_hidp. The output shows how many cases matched

merge m:1 a_hidp using hhinfo   

// drop this variable – essential step

drop _merge

save final1, replace

// clean up unwanted files

erase hhinfo.dta

Example 2: Summarising individual level information at the household level

In this example we will summarise individual level information within a household (number of 18-24 year olds in the household) and then match that onto the household level file.

use a_hidp a_hhsize using a_hhresp_ip, clear

sort a_hidp

save hhinfo, replace

use pidp a_hidp a_dvage using a_indall_ip, clear

// create a variable that counts the number of 18-24year olds in each household

bysort a_hidp: egen n1824= sum(a_dvage>=18 & a_dvage<=24)

// keep only first observation for every household

bysort a_hidp: keep if _n==1

// keep only household level information

keep a_hidp n1824

// now merging this household information with the household level file

sort a_hidp

merge 1:1 a_hidp using hhinfo

drop _merge

save final2, replace

erase hhinfo.dta

Example 3: Matching individuals within a household

In this example we will match the information of wives onto that of their partners/spouses.

/* Open the dataset with information on all persons in responding households and keep only those persons who have a spouse/partner in the household*/

use a_hidp a_pno a_hgpart a_sex a_dvage using a_indall_ip if a_hgpart>0, clear

// rename the prefix a_ to something that would indicate that this information relates to the spouse or partner

renpfix a_ sp_

/* rename the spouse/partner pno variable to the respondent pno variable as this will be used to match on to the respondent information. Then sort and save the data*/

rename sp_hgpart a_pno

rename sp_hidp a_hidp

drop sp_pno

sort a_hidp a_pno

save spousepartner, replace

/* Again open the data with information on all persons in responding households*/

use a_hidp a_pno a_hgpart a_sex a_dvage using a_indall_ip if a_hgpart>0, clear

/* rename the prefix a_ to something that would indicate that this information relates to the respondent */

renpfix a_ r_

/* as we want to match on a_hidp and a_pno rename r_hidp and r_pno back to these */

rename r_hidp a_hidp

rename r_pno a_pno

// Now sort and merge with the spouse partner file

sort a_hidp a_pno

merge 1:1 a_hidp a_pno using spousepartner

drop _merge

save final3, replace

erase spousepartner.dta

Example 4: Using the EGOALT file to create household composition variables

In this example we will create a variable that measures the number of siblings in the household using the w_egoalt_ip file.

use b_hidp b_epno b_relationship using b_egoalt_ip, clear

// create a variable that counts the number of siblings in the household

bysort b_hidp b_epno: egen nsiblings = sum(b_relationship>=14 & b_relationship<=17)

lab var nsiblings "number of siblings in household"

// keep one observation per person

bysort b_hidp b_epno: keep if _n==1

sort b_hidp b_epno

save final4, replace

Now this information can be merged with any individual level file.

Example 5: Merging individual files across waves into long format

To match individual level files across two waves into a long format do the following (for more waves add wave specific prefix in the foreach statement):

foreach w in a b {

     // open the individual level file

     use pidp `w’_jbhas using `w’_indresp_ip, clear

     // drop the wave prefix from all variables

     renpfix `w’_

     // create a wave variable

     gen wave=strpos(“ab”, “`w’”)

     // save one file for each wave

     save temp`w’, replace

}

// open the file for the first wave (wave a_)

use tempa, clear

foreach w in b {

     // append the files for second wave onwards

     append using temp`w’

}

// save the long file

save final5, replace 

// erase temporary files

foreach w in a b {

     erase temp`w’.dta

}

Example 6: Merging individual files across waves into wide format

The following code shows how to match individual level files across two waves into a wide format. The code can be adapted to handle more waves by adding wave specific prefixes in the foreach statement:

use pidp a_jbhas using a_indresp_ip, clear

sort pidp

save temp, replace

foreach w in b {

     use pidp `w’_jbhas using `w’_indresp_ip, clear

     sort pidp

     merge 1:1 pidp using temp

     drop _merge

     sort pidp

     save temp, replace

}

save final6, replace

erase temp.dta

Example 1: Distributing household level information to individual level

Example 2: Summarising individual level information at the household level

Example 3: Matching individuals within a household

Example 5: Merging individual files across waves into long format

Example 6: Merging individual files across waves into wide format

What else is Understanding Society doing?

Children’s worries include good grades and rising prices

Calendar Year 2023 data now available

Professor Annette Jäckle debates survey methods on BBC Radio 4

Young people’s aspirations rise with parents’ income

Email newsletter