Example 06: More dplyr

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycdogs)
library(socviz)

Using across() to apply functions to multiple columns

gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Let’s summarize the age, childs, and sibs columns.

gss_sm |>
  summarize(
    age_mean = mean(age, na.rm = TRUE),
    age_sd = sd(age, na.rm = TRUE),
    childs_mean = mean(childs, na.rm = TRUE),
    childs_sd = sd(childs, na.rm = TRUE),
    sibs_mean = mean(sibs, na.rm = TRUE),
    sibs_sd = sd(sibs, na.rm = TRUE)
  )
# A tibble: 1 × 6
  age_mean age_sd childs_mean childs_sd sibs_mean sibs_sd
     <dbl>  <dbl>       <dbl>     <dbl>     <dbl>   <dbl>
1     49.2   17.7        1.85      1.67      3.72    3.21

This is tedious and we have to name all the columns we create, which is error-prone. We can use across() to apply the same functions to multiple columns at once.

gss_sm |>
  summarize(
    across(c(age, childs, sibs),
           list(mean = \(x) mean(x, na.rm = TRUE),
                sd = \(x) sd(x, na.rm = TRUE)))
  )
# A tibble: 1 × 6
  age_mean age_sd childs_mean childs_sd sibs_mean sibs_sd
     <dbl>  <dbl>       <dbl>     <dbl>     <dbl>   <dbl>
1     49.2   17.7        1.85      1.67      3.72    3.21

We can also use where() to select columns based on their type. Here we summarize all numeric columns. First we drop id because it’s just a row-identifier and it doesn’t make sense to summarize it. We drop year, ballot, and the weight variables for similar reasons. We use a tidy selector to find the weight variables, and negate with - before the c() to drop the slected columns. Then in the summarize statement we use where(is.numeric) to select all remaining columns that are numeric.

gss_sm |>
  select(-c(id, ballot, year, starts_with("wts"))) |>
  summarize(
    across(where(is.numeric),
           list(mean = \(x) mean(x, na.rm = TRUE),
                sd = \(x) sd(x, na.rm = TRUE)))
  )
# A tibble: 1 × 10
  age_mean age_sd childs_mean childs_sd sibs_mean sibs_sd pres12_mean pres12_sd
     <dbl>  <dbl>       <dbl>     <dbl>     <dbl>   <dbl>       <dbl>     <dbl>
1     49.2   17.7        1.85      1.67      3.72    3.21        1.42     0.625
# ℹ 2 more variables: obama_mean <dbl>, obama_sd <dbl>

This will of course work with grouping variables as well. Here we also add a count of the number of rows in each group.

gss_sm |>
  select(-c(id, ballot, year, starts_with("wts"))) |>
  group_by(bigregion, degree) |>
  summarize(
    n = n(),
    across(
      where(is.numeric),
      list(mean = \(x) mean(x, na.rm = TRUE),
           sd = \(x) sd(x, na.rm = TRUE))
    )
  ) |>
  ungroup()
`summarise()` has grouped output by 'bigregion'. You can override using the
`.groups` argument.
# A tibble: 24 × 15
   bigregion degree            n age_mean age_sd childs_mean childs_sd sibs_mean
   <fct>     <fct>         <int>    <dbl>  <dbl>       <dbl>     <dbl>     <dbl>
 1 Northeast Lt High Scho…    38     51.3   21.0        2.66      2.26      5.18
 2 Northeast High School     237     50.2   18.6        1.76      1.52      3.44
 3 Northeast Junior Colle…    40     51.5   17.0        1.92      1.87      2.9 
 4 Northeast Bachelor         98     48.8   16.7        1.34      1.29      2.66
 5 Northeast Graduate         74     51.0   15.0        1.23      1.28      2.42
 6 Northeast <NA>              1     63     NA          2        NA         7   
 7 Midwest   Lt High Scho…    67     52.7   18.5        2.99      2.19      5.33
 8 Midwest   High School     378     49.7   18.6        1.99      1.67      3.88
 9 Midwest   Junior Colle…    44     44.8   16.8        1.57      1.47      3.80
10 Midwest   Bachelor        137     46.5   16.4        1.40      1.46      2.91
# ℹ 14 more rows
# ℹ 7 more variables: sibs_sd <dbl>, pres12_mean <dbl>, pres12_sd <dbl>,
#   obama_mean <dbl>, obama_sd <dbl>, n_mean <dbl>, n_sd <dbl>

See how, when we do this, we don’t put the n() call in the across()? That’s because n() is not a function that we are applying across all the numeric variables. We’re just counting the number of rows within the group, which will not change no matter how many summary statistics we calculate for variables within each group.

Aggregating NYC dog breeds by Zip Code

Zip codes are more complicated than they look. We’ll see more about them in a future class. For now, we’ll just use them as an example of how to do some data manipulation with dplyr. Here is a table of New York City zip codes (actually ZIP Code Tabulation Areas, or ZCTAs):

nyc_ztca <- readxl::read_xlsx(here::here(
  "files",
  "examples",
  "nyc-ztca-crosswalk.xlsx"))

nyc_ztca
# A tibble: 214 × 5
    zcta modzcta uhfcode uhfname                        borough  
   <dbl>   <dbl>   <dbl> <chr>                          <chr>    
 1 10001   10001     306 Chelsea - Clinton              Manhattan
 2 10002   10002     309 Union Square - Lower East Side Manhattan
 3 10003   10003     309 Union Square - Lower East Side Manhattan
 4 10004   10004     310 Lower Manhattan                Manhattan
 5 10005   10005     310 Lower Manhattan                Manhattan
 6 10006   10006     310 Lower Manhattan                Manhattan
 7 10007   10007     310 Lower Manhattan                Manhattan
 8 10009   10009     309 Union Square - Lower East Side Manhattan
 9 10010   10010     307 Gramercy Park -  Murray Hill   Manhattan
10 10011   10011     306 Chelsea - Clinton              Manhattan
# ℹ 204 more rows

The nycdogs package also has some information on zip codes. There’s a zip code column in nyc_license. How many zip codes are there in there?

nyc_license |>
    filter(extract_year == 2018) |>
    distinct(zip_code)
# A tibble: 441 × 1
   zip_code
      <int>
 1    10013
 2    10022
 3    11215
 4    11238
 5    10025
 6    11232
 7    11209
 8    11208
 9    10308
10    10002
# ℹ 431 more rows

That’s too many. People enter invalid zip codes for all kinds of reasons. You can look more carefully at what’s going on if you like. For now we’ll just filter them out based on our vector of valid zip codes. Then we group and summarize by breed within zip code.

zip_vec <- as.integer(nyc_ztca$zcta)

nyc_license |>
    filter(extract_year == 2018,
           zip_code %in% zip_vec) |>
    count(zip_code, breed_rc)
# A tibble: 18,373 × 3
   zip_code breed_rc                           n
      <int> <chr>                          <int>
 1    10001 American Bully                     1
 2    10001 American Eskimo Dog                2
 3    10001 American Foxhound                  1
 4    10001 American Staffordshire Terrier     5
 5    10001 Australian Cattle Dog              2
 6    10001 Australian Cattledog               4
 7    10001 Australian Kelpie                  1
 8    10001 Australian Shepherd               14
 9    10001 Australian Silky Terrier           1
10    10001 Basset Hound                       1
# ℹ 18,363 more rows

This is a lot of categories: 18,373. But many are still dropped. We know this because we can calculate how many slots there are in principle. In principle we have 214 zip codes. But, not every one of these is in the dog license data. Let’s see how many unique zip codes we have in the filtered data.

nyc_license |>
  filter(extract_year == 2018, zip_code %in% zip_vec) |>
  distinct(zip_code)
# A tibble: 190 × 1
   zip_code
      <int>
 1    10013
 2    10022
 3    11215
 4    11238
 5    10025
 6    11232
 7    11209
 8    11208
 9    10308
10    10002
# ℹ 180 more rows

So we get 190 zip codes. And how many unique breeds?

nyc_license |>
  filter(extract_year == 2018, zip_code %in% zip_vec) |>
  distinct(breed_rc)
# A tibble: 311 × 1
   breed_rc                       
   <chr>                          
 1 Basenji                        
 2 Unknown                        
 3 Labrador (or Crossbreed)       
 4 Miniature Pinscher             
 5 Dachshund Smooth Coat Miniature
 6 Cavalier King Charles Spaniel  
 7 Havanese                       
 8 Beagle                         
 9 Boxer                          
10 Pit Bull (or Mix)              
# ℹ 301 more rows

That’s 311 breeds. So in principle we have 190 * 311 = 59,090 combinations of zip code and breed. That’s a lot more than the 18,372 number of rows we have. So, many breed/zip code combinations are empty. We can recover these and fill them with zeros using complete(). This will add rows for all combinations of zip code and breed, filling in zeros for any that are missing.

nyc_license |>
    filter(extract_year == 2018,
           zip_code %in% zip_vec) |>
    count(zip_code, breed_rc) |>
    complete(zip_code, breed_rc, fill = list(n = 0))
# A tibble: 59,090 × 3
   zip_code breed_rc                       n
      <int> <chr>                      <int>
 1    10001 Affenpinscher                  0
 2    10001 Afghan Hound                   0
 3    10001 Afghan Hound Crossbreed        0
 4    10001 Airedale Terrier               0
 5    10001 Akita                          0
 6    10001 Akita Crossbreed               0
 7    10001 Alaskan Malamute               0
 8    10001 American Bully                 1
 9    10001 American English Coonhound     0
10    10001 American Eskimo Dog            2
# ℹ 59,080 more rows

This sort of thing is useful when you want to make sure that you have all combinations of two categorical variables. It’s also relevant when we’re doing things like making maps, where we want to make sure that every geographic unit is represented, even if there are no cases of whatever we’re counting in that unit.

If you look a little more closely at what’s inside the zip_code column in nyc_license compared to what’s in the zip_vec you’ll see that there are mismatches in both directions. That is, some codes appear in each that don’t appear in the other. This situation is very common. We’ll return to it later when we discuss joining tables and relational data more generally. In addition, one problem with the way that zip codes are read in here is that they are numeric. Really they should be read as character strings. If we had a zip code of 06511, for instance it would be read as 6511, which is not valid. Next week we’ll learn how to exercise more control over the data types as we read in our columns, and how to validate our data a bit more rigorously.