library(here) # manage file paths
library(socviz) # data and some useful functions
library(tidyverse) # your friend and mine
library(furrr) # Also loads `future`
Loading required package: future
Modern Plain Text Social Science: Week 10
Duke University
November 5, 2024
Split
Apply
Combine
We start with a dataset
We split it into pieces, usually according to some feature or categorical variable, or by file or something.
We do something—the same thing—to each of those pieces. That is we apply a function or procedure to the pieces. That procedure returns some result for each piece. The result will be of the same form: a number, a vector of counts, summary statistics, a model, a list, whatever.
Finally we combine those results into a final piece of output, usually a tibble or somesuch.
dplyr
is all about this.
# A tibble: 4 × 2
bigregion n
<fct> <int>
1 Northeast 488
2 Midwest 695
3 South 1052
4 West 632
We split into groups, apply the sum()
function within the groups, and combine the results into a new tibble showing the resulting sum per group. The various dplyr functions are oriented to doing this in a way that gives you a consistent set of outputs.
We can split, apply, combine in various ways.
Base R has the split()
function:
Tidyverse has group_split()
:
<list_of<
tbl_df<
mpg : double
cyl : double
disp: double
hp : double
drat: double
wt : double
qsec: double
vs : double
am : double
gear: double
carb: double
>
>[3]>
[[1]]
# A tibble: 11 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
[[2]]
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
[[3]]
# A tibble: 14 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
11 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
12 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
13 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
14 15 8 301 335 3.54 3.57 14.6 0 1 5 8
The application step is “I want to fit a linear model to each piece”. We did this with nest()
last time.
The application step is “I want to fit a linear model to each piece” and get a summary. Again, we did this with group_by()
and nest()
last time.
In this case the “combine” step is implicitly at the end: we get a vector of R squareds back, and it’s as long as the number of groups.
This is also what we’re doing more elegantly (staying within a tibble structure) if we nest()
and use broom
to get a summary out.
mtcars |>
group_by(cyl) |>
nest() |>
mutate(model = map(data, \(df) lm(mpg ~ wt + hp + gear, data = df)),
perf = map(model, broom::glance)) |>
unnest(perf)
# A tibble: 3 × 15
# Groups: cyl [3]
cyl data model r.squared adj.r.squared sigma statistic p.value df
<dbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 <tibble> <lm> 0.660 0.319 1.20 1.94 0.300 3
2 4 <tibble> <lm> 0.730 0.615 2.80 6.31 0.0211 3
3 8 <tibble> <lm> 0.500 0.349 2.06 3.33 0.0648 3
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
In each of these cases, the data is processed sequentially or serially. R splits the data according to your instructions, applies the function or procedure to each one in turn, and combines the outputs in order out the other side. Your computer’s processor is handed each piece in turn.
For small tasks that’s fine. But for bigger tasks it gets inefficient quickly.
If this is the sort of task we have to apply to a bunch of things, it’s going to take ages.
A feature of many split-apply-combine activities is that it does not matter what order the “apply” part happens to the groups. All that matters is that we can combine the results at the end, no matter what order they come back in. E.g. in the mtcars
example the model being fit to the 4-cylinder cars group doesn’t depend in any way on the results of the model being fit to the 8-cylinder group.
This sort of situation is embarrassingly parallel.
When a task is embarrassingly parallel, and the number of pieces or groups is large enough or complex enough, then if we can do them at the same time then we should. There is some overhead—we have to keep track of where each piece was sent and when the results come back in—but if that’s low enough in comparison to doing things serially, then we should parallelize the task.
Parallel computing used to mean “I have a cluster of computers at my disposal”. But modern CPUs are constructed in a semi-modular fashion. They have some number of “cores”, each one of which is like a small processor in its own right. In effect you have a computer cluster already.
Socket: The physical connection on your computer that houses the processor. These days, mostly there’s just one.
Core: The part of the processor that actually performs the computation. Most modern processors have multiple cores. Each one can do wholly independent work.
Process: A single instance of a running task or program (R, Slack, Chrome, etc). A core can only run one process at a time. But, cores are fast. And so, they can run many threads
Thread: A piece of a process that can share memory and resources with other threads. If you have enough power you can do something Intel called hyperthreading, which is a way of dividing up a physical core into (usually) two logical cores that work on the same clock cycle and share some resources on the physical core.
Cluster: A collection of things that are capable of hosting cores. Might be a single socket (on your laptop) or an old-fashioned room full of many physical computers that can be made to act as if they were a single machine.
Most of the time, even when we are using them, our computers sit around doing PRETTY MUCH NOTHING.
Remember, processor clock cycles are really fast. They’re measured in billions of cycles per second.
We need to put those cores to work!
These packages make parallel computing way more straightforward than it used to be. In particular, for Tidyverse-centric workflows, furrr
provides a set of future_
functions that are drop-in replacements for map
and friends. So map()
becomes future_map()
and so on.
We set things up like this:
# Another slow function (from
# Grant McDermott this time)
slow_square <- function(x = 1) {
x_sq <- x^2
out <- tibble(value = x, value_sq = x_sq)
Sys.sleep(2) # Zzzz
out
}
tictoc::tic("Serially")
## This is of course way slower than just writing
## slow_square(1:15) but nvm that for now
serial_out <- map(1:15, slow_square) |>
list_rbind()
tictoc::toc()
Serially: 30.105 sec elapsed
Parallelized: 4.709 sec elapsed
Data obtained with:
mkdir raw
cd raw
wget --no-parent -r -l inf --wait 5 --random-wait 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/'
This tries to be polite with the NOAA: it enforces a wait time and in addition randomizes it to make it variably longer. Note also no boundary on depth of folder recursion. There are a lot of files (>15,000). Doing it this way will take several days in real time (though much much less in actual transfer time of course).
Raw data directories:
[1] "198109" "198110" "198111"
[4] "198112" "198201" "198202"
[7] "198203" "198204" "198205"
[10] "198206" "198207" "198208"
[13] "198209" "198210" "198211"
[16] "198212" "198301" "198302"
[19] "198303" "198304" "198305"
[22] "198306" "198307" "198308"
[25] "198309" "198310" "198311"
[28] "198312" "198401" "198402"
[31] "198403" "198404" "198405"
[34] "198406" "198407" "198408"
[37] "198409" "198410" "198411"
[40] "198412" "198501" "198502"
[43] "198503" "198504" "198505"
[46] "198506" "198507" "198508"
[49] "198509" "198510" "198511"
[52] "198512" "198601" "198602"
[55] "198603" "198604" "198605"
[58] "198606" "198607" "198608"
[61] "198609" "198610" "198611"
[64] "198612" "198701" "198702"
[67] "198703" "198704" "198705"
[70] "198706" "198707" "198708"
[73] "198709" "198710" "198711"
[76] "198712" "198801" "198802"
[79] "198803" "198804" "198805"
[82] "198806" "198807" "198808"
[85] "198809" "198810" "198811"
[88] "198812" "198901" "198902"
[91] "198903" "198904" "198905"
[94] "198906" "198907" "198908"
[97] "198909" "198910" "198911"
[100] "198912" "199001" "199002"
[103] "199003" "199004" "199005"
[106] "199006" "199007" "199008"
[109] "199009" "199010" "199011"
[112] "199012" "199101" "199102"
[115] "199103" "199104" "199105"
[118] "199106" "199107" "199108"
[121] "199109" "199110" "199111"
[124] "199112" "199201" "199202"
[127] "199203" "199204" "199205"
[130] "199206" "199207" "199208"
[133] "199209" "199210" "199211"
[136] "199212" "199301" "199302"
[139] "199303" "199304" "199305"
[142] "199306" "199307" "199308"
[145] "199309" "199310" "199311"
[148] "199312" "199401" "199402"
[151] "199403" "199404" "199405"
[154] "199406" "199407" "199408"
[157] "199409" "199410" "199411"
[160] "199412" "199501" "199502"
[163] "199503" "199504" "199505"
[166] "199506" "199507" "199508"
[169] "199509" "199510" "199511"
[172] "199512" "199601" "199602"
[175] "199603" "199604" "199605"
[178] "199606" "199607" "199608"
[181] "199609" "199610" "199611"
[184] "199612" "199701" "199702"
[187] "199703" "199704" "199705"
[190] "199706" "199707" "199708"
[193] "199709" "199710" "199711"
[196] "199712" "199801" "199802"
[199] "199803" "199804" "199805"
[202] "199806" "199807" "199808"
[205] "199809" "199810" "199811"
[208] "199812" "199901" "199902"
[211] "199903" "199904" "199905"
[214] "199906" "199907" "199908"
[217] "199909" "199910" "199911"
[220] "199912" "200001" "200002"
[223] "200003" "200004" "200005"
[226] "200006" "200007" "200008"
[229] "200009" "200010" "200011"
[232] "200012" "200101" "200102"
[235] "200103" "200104" "200105"
[238] "200106" "200107" "200108"
[241] "200109" "200110" "200111"
[244] "200112" "200201" "200202"
[247] "200203" "200204" "200205"
[250] "200206" "200207" "200208"
[253] "200209" "200210" "200211"
[256] "200212" "200301" "200302"
[259] "200303" "200304" "200305"
[262] "200306" "200307" "200308"
[265] "200309" "200310" "200311"
[268] "200312" "200401" "200402"
[271] "200403" "200404" "200405"
[274] "200406" "200407" "200408"
[277] "200409" "200410" "200411"
[280] "200412" "200501" "200502"
[283] "200503" "200504" "200505"
[286] "200506" "200507" "200508"
[289] "200509" "200510" "200511"
[292] "200512" "200601" "200602"
[295] "200603" "200604" "200605"
[298] "200606" "200607" "200608"
[301] "200609" "200610" "200611"
[304] "200612" "200701" "200702"
[307] "200703" "200704" "200705"
[310] "200706" "200707" "200708"
[313] "200709" "200710" "200711"
[316] "200712" "200801" "200802"
[319] "200803" "200804" "200805"
[322] "200806" "200807" "200808"
[325] "200809" "200810" "200811"
[328] "200812" "200901" "200902"
[331] "200903" "200904" "200905"
[334] "200906" "200907" "200908"
[337] "200909" "200910" "200911"
[340] "200912" "201001" "201002"
[343] "201003" "201004" "201005"
[346] "201006" "201007" "201008"
[349] "201009" "201010" "201011"
[352] "201012" "201101" "201102"
[355] "201103" "201104" "201105"
[358] "201106" "201107" "201108"
[361] "201109" "201110" "201111"
[364] "201112" "201201" "201202"
[367] "201203" "201204" "201205"
[370] "201206" "201207" "201208"
[373] "201209" "201210" "201211"
[376] "201212" "201301" "201302"
[379] "201303" "201304" "201305"
[382] "201306" "201307" "201308"
[385] "201309" "201310" "201311"
[388] "201312" "201401" "201402"
[391] "201403" "201404" "201405"
[394] "201406" "201407" "201408"
[397] "201409" "201410" "201411"
[400] "201412" "201501" "201502"
[403] "201503" "201504" "201505"
[406] "201506" "201507" "201508"
[409] "201509" "201510" "201511"
[412] "201512" "201601" "201602"
[415] "201603" "201604" "201605"
[418] "201606" "201607" "201608"
[421] "201609" "201610" "201611"
[424] "201612" "201701" "201702"
[427] "201703" "201704" "201705"
[430] "201706" "201707" "201708"
[433] "201709" "201710" "201711"
[436] "201712" "201801" "201802"
[439] "201803" "201804" "201805"
[442] "201806" "201807" "201808"
[445] "201809" "201810" "201811"
[448] "201812" "201901" "201902"
[451] "201903" "201904" "201905"
[454] "201906" "201907" "201908"
[457] "201909" "201910" "201911"
[460] "201912" "202001" "202002"
[463] "202003" "202004" "202005"
[466] "202006" "202007" "202008"
[469] "202009" "202010" "202011"
[472] "202012" "202101" "202102"
[475] "202103" "202104" "202105"
[478] "202106" "202107" "202108"
[481] "202109" "202110" "202111"
[484] "202112" "202201" "202202"
[487] "202203" "202204" "202205"
[490] "202206" "202207" "202208"
[493] "202209" "202210" "202211"
[496] "202212" "202301" "202302"
[499] "202303" "202304" "202305"
[502] "202306" "202307" "202308"
[505] "202309" "202310" "202311"
[508] "202312" "202401" "202402"
[511] "202403" "202404" "index.html"
[514] "index.html?C=D;O=A" "index.html?C=D;O=D" "index.html?C=M;O=A"
[517] "index.html?C=M;O=D" "index.html?C=N;O=A" "index.html?C=N;O=D"
[520] "index.html?C=S;O=A" "index.html?C=S;O=D"
Raw data files, in netCDF (Version 4) format
[1] "oisst-avhrr-v02r01.20240201.nc" "oisst-avhrr-v02r01.20240202.nc"
[3] "oisst-avhrr-v02r01.20240203.nc" "oisst-avhrr-v02r01.20240204.nc"
[5] "oisst-avhrr-v02r01.20240205.nc" "oisst-avhrr-v02r01.20240206.nc"
[7] "oisst-avhrr-v02r01.20240207.nc" "oisst-avhrr-v02r01.20240208.nc"
[9] "oisst-avhrr-v02r01.20240209.nc" "oisst-avhrr-v02r01.20240210.nc"
[11] "oisst-avhrr-v02r01.20240211.nc" "oisst-avhrr-v02r01.20240212.nc"
[13] "oisst-avhrr-v02r01.20240213.nc" "oisst-avhrr-v02r01.20240214.nc"
[15] "oisst-avhrr-v02r01.20240215.nc" "oisst-avhrr-v02r01.20240216.nc"
[17] "oisst-avhrr-v02r01.20240217.nc" "oisst-avhrr-v02r01.20240218.nc"
[19] "oisst-avhrr-v02r01.20240219.nc" "oisst-avhrr-v02r01.20240220.nc"
[21] "oisst-avhrr-v02r01.20240221.nc" "oisst-avhrr-v02r01.20240222.nc"
[23] "oisst-avhrr-v02r01.20240223.nc" "oisst-avhrr-v02r01.20240224.nc"
[25] "oisst-avhrr-v02r01.20240225.nc" "oisst-avhrr-v02r01.20240226.nc"
[27] "oisst-avhrr-v02r01.20240227.nc" "oisst-avhrr-v02r01.20240228.nc"
[29] "oisst-avhrr-v02r01.20240229.nc"
## Seasons for plotting
season <- function(in_date){
br = yday(as.Date(c("2019-03-01",
"2019-06-01",
"2019-09-01",
"2019-12-01")))
x = yday(in_date)
x = cut(x, breaks = c(0, br, 366))
levels(x) = c("Winter", "Spring", "Summer", "Autumn", "Winter")
x
}
season_lab <- tibble(yrday = yday(as.Date(c("2019-03-01",
"2019-06-01",
"2019-09-01",
"2019-12-01"))),
lab = c("Spring", "Summer", "Autumn", "Winter"))
Raw data files
#install.packages("ncdf4")
#install.packages("terra")
library(terra)
## For the filename processing
## This one gives you an unknown number of chunks each with approx n elements
chunk <- function(x, n) split(x, ceiling(seq_along(x)/n))
## This one gives you n chunks each with an approx equal but unknown number of elements
chunk2 <- function(x, n) split(x, cut(seq_along(x), n, labels = FALSE))
## All the daily .nc files:
all_fnames <- as.character(fs::dir_ls(here("avhrr"), recurse = TRUE, glob = "*.nc"))
chunked_fnames <- chunk(all_fnames, 25)
length(all_fnames)
[1] 15549
[1] 622
The data is in netCDF (Version 4) format. An interesting self-documenting format. Read one in with the netCDF reader.
File /Users/kjhealy/Documents/courses/mptc/avhrr/200901/oisst-avhrr-v02r01.20090120.nc (NC_FORMAT_NETCDF4):
4 variables (excluding dimension variables):
short anom[lon,lat,zlev,time] (Chunking: [1440,720,1,1]) (Compression: shuffle,level 4)
long_name: Daily sea surface temperature anomalies
_FillValue: -999
add_offset: 0
scale_factor: 0.00999999977648258
valid_min: -1200
valid_max: 1200
units: Celsius
short err[lon,lat,zlev,time] (Chunking: [1440,720,1,1]) (Compression: shuffle,level 4)
long_name: Estimated error standard deviation of analysed_sst
units: Celsius
_FillValue: -999
add_offset: 0
scale_factor: 0.00999999977648258
valid_min: 0
valid_max: 1000
short ice[lon,lat,zlev,time] (Chunking: [1440,720,1,1]) (Compression: shuffle,level 4)
long_name: Sea ice concentration
units: %
_FillValue: -999
add_offset: 0
scale_factor: 0.00999999977648258
valid_min: 0
valid_max: 100
short sst[lon,lat,zlev,time] (Chunking: [1440,720,1,1]) (Compression: shuffle,level 4)
long_name: Daily sea surface temperature
units: Celsius
_FillValue: -999
add_offset: 0
scale_factor: 0.00999999977648258
valid_min: -300
valid_max: 4500
4 dimensions:
time Size:1 *** is unlimited ***
long_name: Center time of the day
units: days since 1978-01-01 12:00:00
zlev Size:1
long_name: Sea surface height
units: meters
actual_range: 0, 0
positive: down
lat Size:720
long_name: Latitude
units: degrees_north
grids: Uniform grid from -89.875 to 89.875 by 0.25
lon Size:1440
long_name: Longitude
units: degrees_east
grids: Uniform grid from 0.125 to 359.875 by 0.25
38 global attributes:
title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1 - Final
Description: Reynolds, et al.(2007) Daily High-resolution Blended Analyses. Available at ftp://eclipse.ncdc.noaa.gov/pub/OI-daily/daily-sst.pdf Climatology is based on 1971-2000 OI.v2 SST, Satellite data: Navy NOAA18 METOP AVHRR, Ice data: NCEP ice
source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfinder_AVHRR, Navy_AVHRR
id: oisst-avhrr-v02r01.20090120.nc
naming_authority: gov.noaa.ncei
summary: NOAAs 1/4-degree Daily Optimum Interpolation Sea Surface Temperature (OISST) (sometimes referred to as Reynolds SST, which however also refers to earlier products at different resolution), currently available as version v02r01, is created by interpolating and extrapolating SST observations from different sources, resulting in a smoothed complete field. The sources of data are satellite (AVHRR) and in situ platforms (i.e., ships and buoys), and the specific datasets employed may change over time. At the marginal ice zone, sea ice concentrations are used to generate proxy SSTs. A preliminary version of this file is produced in near-real time (1-day latency), and then replaced with a final version after 2 weeks. Note that this is the AVHRR-ONLY DOISST, available from Oct 1981, but there is a companion DOISST product that includes microwave satellite data, available from June 2002
cdm_data_type: Grid
history: Final file created using preliminary as first guess, and 3 days of AVHRR data. Preliminary uses only 1 day of AVHRR data.
date_modified: 2020-05-08T19:05:13Z
date_created: 2020-05-08T19:05:13Z
product_version: Version v02r01
processing_level: NOAA Level 4
institution: NOAA/National Centers for Environmental Information
creator_url: https://www.ncei.noaa.gov/
creator_email: oisst-help@noaa.gov
keywords: Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature
keywords_vocabulary: Global Change Master Directory (GCMD) Earth Science Keywords
platform: Ships, buoys, Argo floats, MetOp-A, MetOp-B
platform_vocabulary: Global Change Master Directory (GCMD) Platform Keywords
instrument: Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > AVHRR > Advanced Very High Resolution Radiometer
instrument_vocabulary: Global Change Master Directory (GCMD) Instrument Keywords
standard_name_vocabulary: CF Standard Name Table (v40, 25 January 2017)
geospatial_lat_min: -90
geospatial_lat_max: 90
geospatial_lon_min: 0
geospatial_lon_max: 360
geospatial_lat_units: degrees_north
geospatial_lat_resolution: 0.25
geospatial_lon_units: degrees_east
geospatial_lon_resolution: 0.25
time_coverage_start: 2009-01-20T00:00:00Z
time_coverage_end: 2009-01-20T23:59:59Z
metadata_link: https://doi.org/10.25921/RE9P-PT57
ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
comment: Data was converted from NetCDF-3 to NetCDF-4 format with metadata updates in November 2017.
sensor: Thermometer, AVHRR
Conventions: CF-1.6, ACDD-1.3
references: Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version v02r01, submitted.Climatology is based on 1971-2000 OI.v2 SST. Satellite data: Pathfinder AVHRR SST and Navy AVHRR SST. Ice data: NCEP Ice and GSFC Ice.
For analysis we are going to use the terra
package, which understands many GIS type formats. Fundamentally we have a grid or raster of data that’s 1440 x 720 in size. The data has several layers, each one a specific measure, such as sea-surface temperature, sea ice coverage, and so on.
class : SpatRaster
dimensions : 720, 1440, 4 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : 0, 360, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
sources : oisst-avhrr-v02r01.20090120.nc:anom
oisst-avhrr-v02r01.20090120.nc:err
oisst-avhrr-v02r01.20090120.nc:ice
oisst-avhrr-v02r01.20090120.nc:sst
varnames : anom (Daily sea surface temperature anomalies)
err (Estimated error standard deviation of analysed_sst)
ice (Sea ice concentration)
...
names : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0
unit : Celsius, Celsius, %, Celsius
time (days) : 2009-01-20
Terra can understand rasters that are aggregated into slabs or “bricks” covering the same area repeatedly, and has methods for aggregating these arrays in different ways.
Read one in with terra
. Get the sst
(Sea Surface Temperature) layer only.
class : SpatRaster
dimensions : 720, 1440, 1 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : 0, 360, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : oisst-avhrr-v02r01.20090120.nc:sst
varname : sst (Daily sea surface temperature)
name : sst_zlev=0
unit : Celsius
time (days) : 2009-01-20
What reading a chunk of filenames (with all their layers) does:
class : SpatRaster
dimensions : 720, 1440, 100 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : 0, 360, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
sources : oisst-avhrr-v02r01.19820414.nc:anom
oisst-avhrr-v02r01.19820414.nc:err
oisst-avhrr-v02r01.19820414.nc:ice
... and 97 more source(s)
varnames : anom (Daily sea surface temperature anomalies)
err (Estimated error standard deviation of analysed_sst)
ice (Sea ice concentration)
...
names : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, anom_zlev=0, err_zlev=0, ...
unit : Celsius, Celsius, %, Celsius, Celsius, Celsius, ...
time (days) : 1982-04-14 to 1982-05-08
Write a function to get a file, read in the SST raster, and get its area-weighted mean, for the North Atlantic region only.
process_raster <- function(fnames, crop_area = c(-80, 0, 0, 60), var = "sst") {
tdf <- terra::rast(as.character(fnames))[var] |>
terra::rotate() |> # Fix 0-360 lon
terra::crop(crop_area) # Manually crop to a defined box. Default is Atlantic lat/lon box
wts <- terra::cellSize(tdf, unit = "km") # For scaling
data.frame(
date = terra::time(tdf),
# global() calculates a quantity for the whole grid on a particular SpatRaster
# so we get one weighted mean per file that comes in
wt_mean_sst = terra::global(tdf, "mean", weights = wts, na.rm=TRUE)$weighted_mean
)
}
Try it on one data file:
Do it in parallel for all files:
# library(furrr) # We did this already
# plan(multisession)
tictoc::tic("Terra Method")
df <- future_map(chunked_fnames, process_raster) |>
list_rbind() |>
as_tibble() |>
mutate(date = ymd(date),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date),
yrday = lubridate::yday(date),
season = season(date))
tictoc::toc()
Terra Method: 27.999 sec elapsed
[1] 15549 7
# A tibble: 10 × 7
date wt_mean_sst year month day yrday season
<date> <dbl> <dbl> <dbl> <int> <dbl> <fct>
1 1989-01-21 19.3 1989 1 21 21 Winter
2 2014-12-10 21.1 2014 12 10 344 Winter
3 1995-04-07 19.5 1995 4 7 97 Spring
4 2014-12-01 21.5 2014 12 1 335 Autumn
5 1999-03-26 19.2 1999 3 26 85 Spring
6 2019-01-18 20.0 2019 1 18 18 Winter
7 1999-01-01 20.3 1999 1 1 1 Winter
8 1985-09-07 23.3 1985 9 7 250 Autumn
9 2003-12-26 20.7 2003 12 26 360 Winter
10 1982-08-23 23.5 1982 8 23 235 Summer
Make our cores work even harder
Simple feature collection with 101 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -180 ymin: -85.5625 xmax: 180 ymax: 90
Geodetic CRS: WGS 84
# A tibble: 101 × 11
NAME ID Longitude Latitude min_X min_Y max_X max_Y area MRGID
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Rio de La Pl… 33 -56.8 -35.1 -59.8 -36.4 -54.9 -31.5 3.18e4 4325
2 Bass Strait 62A 146. -39.5 144. -41.4 150. -37.5 1.13e5 4366
3 Great Austra… 62 133. -36.7 118. -43.6 146. -31.5 1.33e6 4276
4 Tasman Sea 63 161. -39.7 147. -50.9 175. -30 3.34e6 4365
5 Mozambique C… 45A 40.9 -19.3 32.4 -26.8 49.2 -10.5 1.39e6 4261
6 Savu Sea 48o 122. -9.48 119. -10.9 125. -8.21 1.06e5 4343
7 Timor Sea 48i 128. -11.2 123. -15.8 133. -8.18 4.34e5 4344
8 Bali Sea 48l 116. -7.93 114. -9.00 117. -7.01 3.99e4 4340
9 Coral Sea 64 157. -18.2 141. -30.0 170. -6.79 4.13e6 4364
10 Flores Sea 48j 120. -7.51 117. -8.74 123. -5.51 1.03e5 4341
# ℹ 91 more rows
# ℹ 1 more variable: geometry <MULTIPOLYGON [°]>
## Rasterize the seas polygons using one of the nc files
## as a reference grid for the rasterization process
## They're all on the same grid so it doesn't matter which one
one_raster <- all_fnames[1]
seas_vect <- terra::vect(seas)
tmp_tdf_seas <- terra::rast(one_raster)["sst"] |>
terra::rotate()
seas_zonal <- terra::rasterize(seas_vect, tmp_tdf_seas, "NAME")
We’ll need to apply this grid of seas and oceans repeatedly — once for each .nc
file — which means it has to get passed to all our worker cores. But it is stored as a pointer, so we can’t do that directly. We need to wrap it:
Write a function very similar to the other one.
process_raster_zonal <- function(fnames, var = "sst") {
tdf_seas <- terra::rast(as.character(fnames))[var] |>
terra::rotate() |>
terra::zonal(terra::unwrap(seas_zonal_wrapped), mean, na.rm = TRUE)
colnames(tdf_seas) <- c("ID", paste0("d_", terra::time(terra::rast(as.character(fnames))[var])))
tdf_seas |>
pivot_longer(-ID, names_to = "date", values_to = "sst_wt_mean")
}
Try it on one record:
# A tibble: 101 × 3
ID date sst_wt_mean
<chr> <chr> <dbl>
1 Adriatic Sea d_2009-01-20 13.5
2 Aegean Sea d_2009-01-20 15.9
3 Alboran Sea d_2009-01-20 14.8
4 Andaman or Burma Sea d_2009-01-20 27.1
5 Arabian Sea d_2009-01-20 26.6
6 Arafura Sea d_2009-01-20 29.6
7 Arctic Ocean d_2009-01-20 -1.74
8 Baffin Bay d_2009-01-20 -1.58
9 Balearic (Iberian Sea) d_2009-01-20 14.0
10 Bali Sea d_2009-01-20 28.7
# ℹ 91 more rows
We’ll tidy the date later. You can see we get 101 summary records per day.
Try it on one chunk of records:
# A tibble: 2,525 × 3
ID date sst_wt_mean
<chr> <chr> <dbl>
1 Adriatic Sea d_1981-09-01 23.0
2 Adriatic Sea d_1981-09-02 23.1
3 Adriatic Sea d_1981-09-03 22.9
4 Adriatic Sea d_1981-09-04 22.9
5 Adriatic Sea d_1981-09-05 22.5
6 Adriatic Sea d_1981-09-06 22.4
7 Adriatic Sea d_1981-09-07 22.4
8 Adriatic Sea d_1981-09-08 22.5
9 Adriatic Sea d_1981-09-09 22.6
10 Adriatic Sea d_1981-09-10 22.5
# ℹ 2,515 more rows
Now future_map()
it:
tictoc::tic("Big op")
seameans_df <- future_map(chunked_fnames, process_raster_zonal) |>
list_rbind() |>
mutate(date = str_remove(date, "d_"),
date = ymd(date),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date),
yrday = lubridate::yday(date),
season = season(date)) |>
rename(sea = ID)
tictoc::toc()
Big op: 47.064 sec elapsed
# A tibble: 1,570,449 × 8
sea date sst_wt_mean year month day yrday season
<chr> <date> <dbl> <dbl> <dbl> <int> <dbl> <fct>
1 Adriatic Sea 1981-09-01 23.0 1981 9 1 244 Summer
2 Adriatic Sea 1981-09-02 23.1 1981 9 2 245 Autumn
3 Adriatic Sea 1981-09-03 22.9 1981 9 3 246 Autumn
4 Adriatic Sea 1981-09-04 22.9 1981 9 4 247 Autumn
5 Adriatic Sea 1981-09-05 22.5 1981 9 5 248 Autumn
6 Adriatic Sea 1981-09-06 22.4 1981 9 6 249 Autumn
7 Adriatic Sea 1981-09-07 22.4 1981 9 7 250 Autumn
8 Adriatic Sea 1981-09-08 22.5 1981 9 8 251 Autumn
9 Adriatic Sea 1981-09-09 22.6 1981 9 9 252 Autumn
10 Adriatic Sea 1981-09-10 22.5 1981 9 10 253 Autumn
# ℹ 1,570,439 more rows