Example 08: Databases

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /Users/kjhealy/Documents/courses/mptc

Loading required package: sysfonts

Loading required package: showtextdb

ARCOS Opioid Data

This example shows the results of setting up the ARCOS data as a duckdb database. Unlike most of the other examples we are not actually rendering this live on these pages because I do not want to include the >40GB ARCOS raw data file in the course repository.

First we take a look at a sample of the data; just the fist thousand rows:

library(DBI) # will be autoloaded by dbplyr too
library(dbplyr)
library(here)

## Testing df
tmp <- read_tsv(here::here("raw", "arcos_all_washpost_noquotes.tsv"), n_max = 1000)
#> Rows: 1000 Columns: 16
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr  (14): BUYER_DEA_NO, BUYER_BUS_ACT, BUYER_NAME, BUYER_ADDL_CO_INFO, BUYER_ADDRESS1, BUYER_ADDRESS2, BUYER_CITY, BUYER_COUNTY, REPOR...
#> dbl   (1): BUYER_ZIP
#> date  (1): date_iso
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’re going to import the date_iso column directly as a date by creating a custom class for it.

raw_colclasses <- rep("character", length(colnames(tmp)))
names(raw_colclasses) <- colnames(tmp)

## Hand-rolled date converter
setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format="%Y-%m-%d"))


raw_colclasses[which(colnames(tmp) %in% "date_iso")] <- "myDate"
raw_colclasses
#>          BUYER_DEA_NO         BUYER_BUS_ACT            BUYER_NAME    BUYER_ADDL_CO_INFO        BUYER_ADDRESS1        BUYER_ADDRESS2 
#>           "character"           "character"           "character"           "character"           "character"           "character" 
#>            BUYER_CITY             BUYER_ZIP          BUYER_COUNTY         REPORTER_NAME              date_iso          Product_Name 
#>           "character"           "character"           "character"           "character"              "myDate"           "character" 
#> Combined_Labeler_Name  Revised_Company_Name       Reporter_family           BUYER_STATE 
#>  

When we import the data to duckdb in a moment we’ll use this raw_colclasses vector to set the column classes for duckdb’s duckdb_read_csv() function.

First we need to open a connection and say where the database file will exist.

# Raw data
path <- here::here("raw", "arcos_all_washpost_noquotes.tsv")

# DB connection
con <- dbConnect(duckdb::duckdb(),
                 dbdir = here("db", "arcos.duckdb"),
                 read_only = FALSE)

Now we read in the data using duckdb’s CSV reader. It’s parallelized and very fast. It will index all the columns by default.

## tictoc just to time things
tictoc::tic("Read DB")
duckdb::duckdb_read_csv(con, "arcos", path,
                        delim = "\t",
                        lower.case.names = TRUE,
                        colClasses = raw_colclasses)
#> Warning in duckdb::duckdb_read_csv(con, "arcos", path, delim = "\t", lower.case.names = TRUE, : Arguments passed to ... are currently not
#> used
tictoc::toc()
#> Read DB: 27.477 sec elapsed

This takes about 30 seconds on an MacBook Pro M3 Max with 128GB of RAM. On a Mac Mini with an M2 Pro processor and 16GB of RAM it takes about 80 seconds.

Now we can create a pseudo-table of our data. As a matter of convention we’ll use the _db prefix for objects that refer to the database but have not been materialized.


## Check
df_db <- tbl(con, "arcos")

df_db
#> # Source:   table<arcos> [?? x 16]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#>    buyer_dea_no buyer_bus_act  buyer_name  buyer_addl_co_info buyer_address1 buyer_address2 buyer_city buyer_zip buyer_county reporter_name
#>    <chr>        <chr>          <chr>       <chr>              <chr>          <chr>          <chr>      <chr>     <chr>        <chr>        
#>  1 BC2511980    PRACTITIONER   COLEMAN, P… NA                 FAA C/O AMERI… PSC 801 BOX 41 LONDON     9498      NA           DIAMOND PHAR…
#>  2 BC2511980    PRACTITIONER   COLEMAN, P… NA                 FAA C/O AMERI… PSC 801 BOX 41 LONDON     9498      NA           DIAMOND PHAR…
#>  3 BC2462719    CHAIN PHARMACY CARR-GOTTS… CARRS PHARMACY #1… 1501 E HUFFMA… NA             ANCHORAGE  99511     ANCHORAGE    MCKESSON COR…
#>  4 AM3167120    PRACTITIONER   MERCHANT, … NA                 3710 WOODLAND… SUITE 1100     ANCHORAGE  99517     ANCHORAGE    DRX PHARMACE…
#>  5 AM3167120    PRACTITIONER   MERCHANT, … NA                 3710 WOODLAND… SUITE 1100     ANCHORAGE  99517     ANCHORAGE    DRX PHARMACE…
#>  6 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  7 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  8 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  9 BJ5538371    PRACTITIONER   JOHNSTON, … ALASKA  SPINE INS… 3801 UNIVERSI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#> 10 BM9540320    PRACTITIONER   MERCHANT, … NA                 1301 HUFFMAN … SUITE 205      ANCHORAGE  99515     ANCHORAGE    DRX PHARMACE…
#> # ℹ more rows
#> # ℹ 6 more variables: date_iso <date>, product_name <chr>, combined_labeler_name <chr>, revised_company_name <chr>, reporter_family <chr>,
#> #   buyer_state <chr>

Tabulating the Data

Now that our data are read in we can in many ways treat it just as if it existed as an ordinary tibble, even though it isn’t one.

How many rows do we have in our table anyway?

## How many rows?
df_db |> count() 
#> # Source:   SQL [1 x 1]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#>           n
#>       <dbl>
#> 1 178598026

That’s 178,598,026 rows.

We can create new columns in the table. Not all functions are supported for this kind of operation.

## We could extract year and month from date using sql directly
# df_db <- df_db |>
#   mutate(year = sql("year(date_iso)"), 
#          month = sql("month(date_iso)"))  

## But we can now do the same thing using lubridate
## (Previously one couldn't do this)
df_db <- df_db |>
  mutate(year = lubridate::year(date_iso), 
         month = lubridate::month(date_iso))  
         
df_db
#> # Source:   SQL [?? x 18]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#>    buyer_dea_no buyer_bus_act  buyer_name  buyer_addl_co_info buyer_address1 buyer_address2 buyer_city buyer_zip buyer_county reporter_name
#>    <chr>        <chr>          <chr>       <chr>              <chr>          <chr>          <chr>      <chr>     <chr>        <chr>        
#>  1 BC2511980    PRACTITIONER   COLEMAN, P… NA                 FAA C/O AMERI… PSC 801 BOX 41 LONDON     9498      NA           DIAMOND PHAR…
#>  2 BC2511980    PRACTITIONER   COLEMAN, P… NA                 FAA C/O AMERI… PSC 801 BOX 41 LONDON     9498      NA           DIAMOND PHAR…
#>  3 BC2462719    CHAIN PHARMACY CARR-GOTTS… CARRS PHARMACY #1… 1501 E HUFFMA… NA             ANCHORAGE  99511     ANCHORAGE    MCKESSON COR…
#>  4 AM3167120    PRACTITIONER   MERCHANT, … NA                 3710 WOODLAND… SUITE 1100     ANCHORAGE  99517     ANCHORAGE    DRX PHARMACE…
#>  5 AM3167120    PRACTITIONER   MERCHANT, … NA                 3710 WOODLAND… SUITE 1100     ANCHORAGE  99517     ANCHORAGE    DRX PHARMACE…
#>  6 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  7 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  8 BH6074431    PRACTITIONER   HALL, ROBE… NA                 3801 LAKE OTI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#>  9 BJ5538371    PRACTITIONER   JOHNSTON, … ALASKA  SPINE INS… 3801 UNIVERSI… SUITE 300      ANCHORAGE  99508     ANCHORAGE    DRX PHARMACE…
#> 10 BM9540320    PRACTITIONER   MERCHANT, … NA                 1301 HUFFMAN … SUITE 205      ANCHORAGE  99515     ANCHORAGE    DRX PHARMACE…
#> # ℹ more rows
#> # ℹ 8 more variables: date_iso <date>, product_name <chr>, combined_labeler_name <chr>, revised_company_name <chr>, reporter_family <chr>,
#> #   buyer_state <chr>, year <dbl>, month <dbl>
         

Now we have year and month as their own columns.

We can try out counts and other summary operations without materializing them:

## Count, but not really, yet
tictoc::tic("Pretend counting")
by_state_yr_db <- df_db |>
  count(buyer_state, year, month)
tictoc::toc()
#> Pretend counting: 0.014 sec elapsed

## This object is still not real, 
## or "materialized" NB the ?? in the rows
## (the counts shown are correct though)
by_state_yr_db
#> # Source:   SQL [?? x 4]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#>    buyer_state  year month     n
#>    <chr>       <dbl> <dbl> <dbl>
#>  1 AL           2012     7 58027
#>  2 AL           2007    12 40010
#>  3 AL           2010     6 53036
#>  4 AL           2012     4 54197
#>  5 AL           2010    10 47835
#>  6 AL           2007     1 39079
#>  7 AL           2008     5 42683
#>  8 AL           2011     5 52198
#>  9 AL           2006     7 36516
#> 10 AL           2006     4 34656
#> # ℹ more rows

And when we are ready we materialize the results with collect(). With data of this size, these operations are orders of magnitude faster than regular dplyr would be.

tictoc::tic("For reals counting")
by_state_yr <- by_state_yr_db |>
  collect()
tictoc::toc()
#> For reals counting: 0.175 sec elapsed

by_state_yr
#> # A tibble: 4,635 × 4
#>    buyer_state  year month     n
#>    <chr>       <dbl> <dbl> <dbl>
#>  1 AL           2010    12 53201
#>  2 AL           2009     3 44807
#>  3 AL           2009    11 43733
#>  4 AL           2011     4 49788
#>  5 AL           2010     1 41560
#>  6 AL           2012     5 59970
#>  7 AL           2011     7 49985
#>  8 AL           2008     4 42716
#>  9 AL           2008     7 45914
#> 10 AL           2007    10 43320
#> # ℹ 4,625 more rows

## Lots of data
by_state_yr |> 
  summarize(tot = sum(n))
#> # A tibble: 1 × 1
#>         tot
#>       <dbl>
#> 1 178598026