Skip to contents

The basic mechanism of reading in the CPS depends on two included data sets:

  • cps_cols, containing information about column positions in the raw CPS files
  • cps_factors, containing information about the factor levels for the raw numeric codes

These contain a set of default data for some commonly-used variables in the CPS and all of the VRS-specific variables. Across years, there have been many changes in the questions asked by the CPS, their possible responses, and where they are located in the fixed-width files that make up the data. For example, in 1996 the question asking how long a respondent had lived in their current residence was called PES6, occupied positions 823-824, and had six valid responses: less than 1 month, 1-6 months, 7-11 months, 1-2 years, 3-4 years, and 5 years or longer. In 2014, this same question was called PES8, occupied positions 965-966, and had replaced the first three categories from 1996 with one response “less than 1 year”. This illustrates how important it is to provide the correct column specification and factor levels for any variable you want to read in.

In an ideal world, we would have provided this information for every variable in the CPS across all years. This quickly becomes time-prohibitive, as there are several hundred unique variables over the 13 years of data. We chose instead to focus on several important demographic variables and all of the voting and registration questions, and have provided full specifications for these most relevant variables.

All that being said, let’s explore how you can add some more variables to your data. In this example, we’ll read in the 2006-2008 CPS data with family income as an additional variable. First, we need to specify which column positions contain the family income variable in those years - for both years, this is positions 39-40. This is found in the 2006 and 2008 documentation files, which you can download with cps_download_docs().

income_cols <- data.frame(
  year = c(2006, 2008),
  cps_name = "HUFAMINC",
  new_name = "FAM_INCOME",
  start_pos = 39,
  end_pos = 40,
  stringsAsFactors = FALSE
)
year cps_name new_name start_pos end_pos
2006 HUFAMINC FAM_INCOME 39 40
2008 HUFAMINC FAM_INCOME 39 40

We should then specify which factor levels are needed for those years of data. This is also obtained from the 2006 and 2008 documentation files.

income_factors <- data.frame(
  year = c(rep(2006, 16), rep(2008, 16)),
  cps_name = "HUFAMINC",
  new_name = "FAM_INCOME",
  code = c(1:16, 1:16),
  value = rep(c("LESS THAN $5,000",
                "5,000 TO 7,499",
                "7,500 TO 9,999",
                "10,000 TO 12,499",
                "12,500 TO 14,999",
                "15,000 TO 19,999",
                "20,000 TO 24,999",
                "25,000 TO 29,999",
                "30,000 TO 34,999",
                "35,000 TO 39,999",
                "40,000 TO 49,999",
                "50,000 TO 59,999",
                "60,000 TO 74,999",
                "75,000 TO 99,999",
                "100,000 TO 149,999",
                "150,000 OR MORE"), 2),
  stringsAsFactors = FALSE
)
year cps_name new_name code value
2006 HUFAMINC FAM_INCOME 1 LESS THAN $5,000
2006 HUFAMINC FAM_INCOME 2 5,000 TO 7,499
2006 HUFAMINC FAM_INCOME 3 7,500 TO 9,999
2006 HUFAMINC FAM_INCOME 4 10,000 TO 12,499
2006 HUFAMINC FAM_INCOME 5 12,500 TO 14,999
2006 HUFAMINC FAM_INCOME 6 15,000 TO 19,999
2006 HUFAMINC FAM_INCOME 7 20,000 TO 24,999
2006 HUFAMINC FAM_INCOME 8 25,000 TO 29,999
2006 HUFAMINC FAM_INCOME 9 30,000 TO 34,999
2006 HUFAMINC FAM_INCOME 10 35,000 TO 39,999
2006 HUFAMINC FAM_INCOME 11 40,000 TO 49,999
2006 HUFAMINC FAM_INCOME 12 50,000 TO 59,999
2006 HUFAMINC FAM_INCOME 13 60,000 TO 74,999
2006 HUFAMINC FAM_INCOME 14 75,000 TO 99,999
2006 HUFAMINC FAM_INCOME 15 100,000 TO 149,999
2006 HUFAMINC FAM_INCOME 16 150,000 OR MORE
2008 HUFAMINC FAM_INCOME 1 LESS THAN $5,000
2008 HUFAMINC FAM_INCOME 2 5,000 TO 7,499
2008 HUFAMINC FAM_INCOME 3 7,500 TO 9,999
2008 HUFAMINC FAM_INCOME 4 10,000 TO 12,499
2008 HUFAMINC FAM_INCOME 5 12,500 TO 14,999
2008 HUFAMINC FAM_INCOME 6 15,000 TO 19,999
2008 HUFAMINC FAM_INCOME 7 20,000 TO 24,999
2008 HUFAMINC FAM_INCOME 8 25,000 TO 29,999
2008 HUFAMINC FAM_INCOME 9 30,000 TO 34,999
2008 HUFAMINC FAM_INCOME 10 35,000 TO 39,999
2008 HUFAMINC FAM_INCOME 11 40,000 TO 49,999
2008 HUFAMINC FAM_INCOME 12 50,000 TO 59,999
2008 HUFAMINC FAM_INCOME 13 60,000 TO 74,999
2008 HUFAMINC FAM_INCOME 14 75,000 TO 99,999
2008 HUFAMINC FAM_INCOME 15 100,000 TO 149,999
2008 HUFAMINC FAM_INCOME 16 150,000 OR MORE

To read income in with our default data, we bind these to the bottom of the included data sets.

my_cols <- bind_rows(cps_cols, income_cols)
my_factors <- bind_rows(cps_factors, income_factors)

Then we can read in the CPS data with our new column specifications and factor it according to the updated factors.

cps_income <- cps_read(years = c(2006, 2008),
                       dir = here::here("cps_data"),
                       cols = my_cols) %>%
  cps_label(factors = my_factors)
#> Warning in cps_read(years = c(2006, 2008), dir = here::here("cps_data"), : The
#> column names provided by the CPS do not refer to the same question across all
#> years. Be cautious that you are joining columns which correspond across years.

str(cps_income)
#> tibble [304,054 × 18] (S3: tbl_df/tbl/data.frame)
#>  $ FILE                      : Factor w/ 2 levels "cps_nov2006.zip",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ YEAR                      : int [1:304054] 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
#>  $ STATE                     : Factor w/ 51 levels "AL","AK","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ AGE                       : int [1:304054] 35 11 50 78 63 63 37 18 15 8 ...
#>  $ SEX                       : Factor w/ 2 levels "MALE","FEMALE": 2 2 2 1 2 2 2 1 2 1 ...
#>  $ EDUCATION                 : Factor w/ 16 levels "LESS THAN 1ST GRADE",..: 9 NA 10 16 11 4 9 9 4 NA ...
#>  $ RACE                      : Factor w/ 21 levels "WHITE ONLY","BLACK ONLY",..: 1 1 1 1 1 1 2 2 2 2 ...
#>  $ HISPANIC                  : Factor w/ 2 levels "HISPANIC","NON-HIPSANIC": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ WEIGHT                    : num [1:304054] 2411 2666 3254 3647 2754 ...
#>  $ VRS_VOTE                  : Factor w/ 5 levels "YES","NO","DON'T KNOW",..: 1 NA 1 1 1 5 1 2 NA NA ...
#>  $ VRS_REG                   : Factor w/ 5 levels "YES","NO","DON'T KNOW",..: NA NA NA NA NA 5 NA 2 NA NA ...
#>  $ VRS_REG_WHYNOT            : Factor w/ 12 levels "DID NOT MEET REGISTRATION DEADLINES",..: NA NA NA NA NA NA NA 6 NA NA ...
#>  $ VRS_VOTE_WHYNOT           : Factor w/ 14 levels "OUT OF TOWN OR AWAY FROM HOME",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ VRS_VOTEMODE_2004toPRESENT: Factor w/ 5 levels "IN PERSON","BY MAIL",..: 1 NA 1 1 1 NA 1 NA NA NA ...
#>  $ VRS_VOTEWHEN_2004toPRESENT: Factor w/ 5 levels "ON ELECTION DAY",..: 1 NA 1 1 1 NA 1 NA NA NA ...
#>  $ VRS_REG_METHOD            : Factor w/ 11 levels "AT A SCHOOL, HOSPITAL, OR ON CAMPUS",..: 5 NA 10 9 9 NA 6 NA NA NA ...
#>  $ VRS_RESIDENCE             : Factor w/ 9 levels "LESS THAN 1 MONTH",..: 6 NA 4 4 4 9 4 4 NA NA ...
#>  $ FAM_INCOME                : Factor w/ 16 levels "LESS THAN $5,000",..: 7 7 6 NA NA NA 5 5 5 5 ...

One note: the warning from cps_read appears when join_dfs = TRUE (which is a default). This is intended to remind the user that variable names change across years, and to urge caution in only joining the correct columns.

This is an unweighted breakdown of family income responses in 2006 and 2008.

table(cps_income$FAM_INCOME, cps_income$YEAR)
2006 2008
LESS THAN $5,000 2800 2647
5,000 TO 7,499 2224 1963
7,500 TO 9,999 2188 2164
10,000 TO 12,499 3447 2992
12,500 TO 14,999 3057 2815
15,000 TO 19,999 5046 4596
20,000 TO 24,999 6352 5945
25,000 TO 29,999 6833 6192
30,000 TO 34,999 7213 6901
35,000 TO 39,999 6662 6068
40,000 TO 49,999 10231 9951
50,000 TO 59,999 10604 9855
60,000 TO 74,999 12607 12488
75,000 TO 99,999 14291 13838
100,000 TO 149,999 11881 12886
150,000 OR MORE 8257 8936