Read Additional CPS Variables • cpsvote

library(cpsvote)
library(dplyr)

The basic mechanism of reading in the CPS depends on two included data sets:

cps_cols, containing information about column positions in the raw CPS files
cps_factors, containing information about the factor levels for the raw numeric codes

These contain a set of default data for some commonly-used variables in the CPS and all of the VRS-specific variables. Across years, there have been many changes in the questions asked by the CPS, their possible responses, and where they are located in the fixed-width files that make up the data. For example, in 1996 the question asking how long a respondent had lived in their current residence was called PES6, occupied positions 823-824, and had six valid responses: less than 1 month, 1-6 months, 7-11 months, 1-2 years, 3-4 years, and 5 years or longer. In 2014, this same question was called PES8, occupied positions 965-966, and had replaced the first three categories from 1996 with one response “less than 1 year”. This illustrates how important it is to provide the correct column specification and factor levels for any variable you want to read in.

In an ideal world, we would have provided this information for every variable in the CPS across all years. This quickly becomes time-prohibitive, as there are several hundred unique variables over the 13 years of data. We chose instead to focus on several important demographic variables and all of the voting and registration questions, and have provided full specifications for these most relevant variables.

All that being said, let’s explore how you can add some more variables to your data. In this example, we’ll read in the 2006-2008 CPS data with family income as an additional variable. First, we need to specify which column positions contain the family income variable in those years - for both years, this is positions 39-40. This is found in the 2006 and 2008 documentation files, which you can download with cps_download_docs().

income_cols <- data.frame(
  year = c(2006, 2008),
  cps_name = "HUFAMINC",
  new_name = "FAM_INCOME",
  start_pos = 39,
  end_pos = 40,
  stringsAsFactors = FALSE
)

year	cps_name	new_name	start_pos	end_pos
2006	HUFAMINC	FAM_INCOME	39	40
2008	HUFAMINC	FAM_INCOME	39	40

We should then specify which factor levels are needed for those years of data. This is also obtained from the 2006 and 2008 documentation files.

income_factors <- data.frame(
  year = c(rep(2006, 16), rep(2008, 16)),
  cps_name = "HUFAMINC",
  new_name = "FAM_INCOME",
  code = c(1:16, 1:16),
  value = rep(c("LESS THAN $5,000",
                "5,000 TO 7,499",
                "7,500 TO 9,999",
                "10,000 TO 12,499",
                "12,500 TO 14,999",
                "15,000 TO 19,999",
                "20,000 TO 24,999",
                "25,000 TO 29,999",
                "30,000 TO 34,999",
                "35,000 TO 39,999",
                "40,000 TO 49,999",
                "50,000 TO 59,999",
                "60,000 TO 74,999",
                "75,000 TO 99,999",
                "100,000 TO 149,999",
                "150,000 OR MORE"), 2),
  stringsAsFactors = FALSE
)

year	cps_name	new_name	code	value
2006	HUFAMINC	FAM_INCOME	1	LESS THAN $5,000
2006	HUFAMINC	FAM_INCOME	2	5,000 TO 7,499
2006	HUFAMINC	FAM_INCOME	3	7,500 TO 9,999
2006	HUFAMINC	FAM_INCOME	4	10,000 TO 12,499
2006	HUFAMINC	FAM_INCOME	5	12,500 TO 14,999
2006	HUFAMINC	FAM_INCOME	6	15,000 TO 19,999
2006	HUFAMINC	FAM_INCOME	7	20,000 TO 24,999
2006	HUFAMINC	FAM_INCOME	8	25,000 TO 29,999
2006	HUFAMINC	FAM_INCOME	9	30,000 TO 34,999
2006	HUFAMINC	FAM_INCOME	10	35,000 TO 39,999
2006	HUFAMINC	FAM_INCOME	11	40,000 TO 49,999
2006	HUFAMINC	FAM_INCOME	12	50,000 TO 59,999
2006	HUFAMINC	FAM_INCOME	13	60,000 TO 74,999
2006	HUFAMINC	FAM_INCOME	14	75,000 TO 99,999
2006	HUFAMINC	FAM_INCOME	15	100,000 TO 149,999
2006	HUFAMINC	FAM_INCOME	16	150,000 OR MORE
2008	HUFAMINC	FAM_INCOME	1	LESS THAN $5,000
2008	HUFAMINC	FAM_INCOME	2	5,000 TO 7,499
2008	HUFAMINC	FAM_INCOME	3	7,500 TO 9,999
2008	HUFAMINC	FAM_INCOME	4	10,000 TO 12,499
2008	HUFAMINC	FAM_INCOME	5	12,500 TO 14,999
2008	HUFAMINC	FAM_INCOME	6	15,000 TO 19,999
2008	HUFAMINC	FAM_INCOME	7	20,000 TO 24,999
2008	HUFAMINC	FAM_INCOME	8	25,000 TO 29,999
2008	HUFAMINC	FAM_INCOME	9	30,000 TO 34,999
2008	HUFAMINC	FAM_INCOME	10	35,000 TO 39,999
2008	HUFAMINC	FAM_INCOME	11	40,000 TO 49,999
2008	HUFAMINC	FAM_INCOME	12	50,000 TO 59,999
2008	HUFAMINC	FAM_INCOME	13	60,000 TO 74,999
2008	HUFAMINC	FAM_INCOME	14	75,000 TO 99,999
2008	HUFAMINC	FAM_INCOME	15	100,000 TO 149,999
2008	HUFAMINC	FAM_INCOME	16	150,000 OR MORE

To read income in with our default data, we bind these to the bottom of the included data sets.

my_cols <- bind_rows(cps_cols, income_cols)
my_factors <- bind_rows(cps_factors, income_factors)

Then we can read in the CPS data with our new column specifications and factor it according to the updated factors.

cps_income <- cps_read(years = c(2006, 2008),
                       dir = here::here("cps_data"),
                       cols = my_cols) %>%
  cps_label(factors = my_factors)
#> Warning in cps_read(years = c(2006, 2008), dir = here::here("cps_data"), : The
#> column names provided by the CPS do not refer to the same question across all
#> years. Be cautious that you are joining columns which correspond across years.

str(cps_income)
#> tibble [304,054 × 18] (S3: tbl_df/tbl/data.frame)
#>  $ FILE                      : Factor w/ 2 levels "cps_nov2006.zip",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ YEAR                      : int [1:304054] 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
#>  $ STATE                     : Factor w/ 51 levels "AL","AK","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ AGE                       : int [1:304054] 35 11 50 78 63 63 37 18 15 8 ...
#>  $ SEX                       : Factor w/ 2 levels "MALE","FEMALE": 2 2 2 1 2 2 2 1 2 1 ...
#>  $ EDUCATION                 : Factor w/ 16 levels "LESS THAN 1ST GRADE",..: 9 NA 10 16 11 4 9 9 4 NA ...
#>  $ RACE                      : Factor w/ 21 levels "WHITE ONLY","BLACK ONLY",..: 1 1 1 1 1 1 2 2 2 2 ...
#>  $ HISPANIC                  : Factor w/ 2 levels "HISPANIC","NON-HIPSANIC": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ WEIGHT                    : num [1:304054] 2411 2666 3254 3647 2754 ...
#>  $ VRS_VOTE                  : Factor w/ 5 levels "YES","NO","DON'T KNOW",..: 1 NA 1 1 1 5 1 2 NA NA ...
#>  $ VRS_REG                   : Factor w/ 5 levels "YES","NO","DON'T KNOW",..: NA NA NA NA NA 5 NA 2 NA NA ...
#>  $ VRS_REG_WHYNOT            : Factor w/ 12 levels "DID NOT MEET REGISTRATION DEADLINES",..: NA NA NA NA NA NA NA 6 NA NA ...
#>  $ VRS_VOTE_WHYNOT           : Factor w/ 14 levels "OUT OF TOWN OR AWAY FROM HOME",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ VRS_VOTEMODE_2004toPRESENT: Factor w/ 5 levels "IN PERSON","BY MAIL",..: 1 NA 1 1 1 NA 1 NA NA NA ...
#>  $ VRS_VOTEWHEN_2004toPRESENT: Factor w/ 5 levels "ON ELECTION DAY",..: 1 NA 1 1 1 NA 1 NA NA NA ...
#>  $ VRS_REG_METHOD            : Factor w/ 11 levels "AT A SCHOOL, HOSPITAL, OR ON CAMPUS",..: 5 NA 10 9 9 NA 6 NA NA NA ...
#>  $ VRS_RESIDENCE             : Factor w/ 9 levels "LESS THAN 1 MONTH",..: 6 NA 4 4 4 9 4 4 NA NA ...
#>  $ FAM_INCOME                : Factor w/ 16 levels "LESS THAN $5,000",..: 7 7 6 NA NA NA 5 5 5 5 ...

One note: the warning from cps_read appears when join_dfs = TRUE (which is a default). This is intended to remind the user that variable names change across years, and to urge caution in only joining the correct columns.

This is an unweighted breakdown of family income responses in 2006 and 2008.

table(cps_income$FAM_INCOME, cps_income$YEAR)

	2006	2008
LESS THAN $5,000	2800	2647
5,000 TO 7,499	2224	1963
7,500 TO 9,999	2188	2164
10,000 TO 12,499	3447	2992
12,500 TO 14,999	3057	2815
15,000 TO 19,999	5046	4596
20,000 TO 24,999	6352	5945
25,000 TO 29,999	6833	6192
30,000 TO 34,999	7213	6901
35,000 TO 39,999	6662	6068
40,000 TO 49,999	10231	9951
50,000 TO 59,999	10604	9855
60,000 TO 74,999	12607	12488
75,000 TO 99,999	14291	13838
100,000 TO 149,999	11881	12886
150,000 OR MORE	8257	8936