Search and Edit Text

Modern Plain Text Social Science: Week 3

Kieran Healy

October 7, 2024

Getting around in the Shell

Your command history

Eventually you will accumulate a history of shell commands you have typed, many of which you may want to reuse from time to time.

Go to the previous command with the up arrow, ↑
Search your command history with control-R, ^R
^R will also work for history search at the RStudio console and in many other places.

Aside: Standard Modifier Key Symbols

Symbol	Key	Unicode	Symbol	Key	Unicode
⎋	Escape	`U+238B`	⌫	Backspace	`U+232B`
⇥	Tab	`U+21E5`	⌦	Delete	`U+2326`
⇪	Caps Lock	`U+21EA`	⇱	Home	`U+21F1`
⇧	Shift	`U+21E7`	⇲	End	`U+21F2`
⌃	Control	`U+2303`	⇞	Page Up	`U+21DE`
⌥	Option/Alt	`U+2325`	⇟	Page Down	`U+21DF`
⌘	Command	`U+2318`	⌤	Enter	`U+2324`
⏎	Return	`U+23CE`	␣	Space	`U+2423`

Searching inside files

`grep`

find searches file and folder names only. To search inside files we use grep. Or rather we will use a flavor of grep called egrep.

grep 'Stately' files/examples/ulysses.txt

Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of

Search more than one file:

## -i for case-insensitive search
egrep -i 'jabber' files/examples/*.txt

files/examples/jabberwocky.txt:“Beware the Jabberwock, my son! 
files/examples/jabberwocky.txt:      The Jabberwock, with eyes of flame, 
files/examples/jabberwocky.txt:“And hast thou slain the Jabberwock? 
files/examples/ulysses.txt:jabber on their girdles: roguewords, tough nuggets patter in their

The `grep` family

grep and its derivatives are very powerful thanks to their ability to use regular expressions. We will learn about those momentarily. There are also more recent command-line search tools like ripgrep, or rg, a modern version of grep that is very fast, automatically searches subfolders, and has other nice additional features. For more details see the ripgrep project page.

# On a Mac with Homebrew (https://brew.sh)
brew install rg

Integrate `rg` and `fzf`

fzf is a command-line fuzzy-finder. It makes ^R really powerful and convenient. For details see the fzf project page.

# On a Mac
brew install fzf

Regular Expressions

Setup

Regular Expressions

Or,

Waiter, there appears to be a language inside my language

Regular Expressions

Regexps are their own world of text processing

☜ This book is a thing of beauty.

Searching for patterns

A regular expression is a way of searching for a piece of text, or pattern, inside some larger body of text, called a string.
The simplest sort of search is like the “Find” functionality in a Word Processor. The pattern is a literal letter, number, punctuation mark, word or series of words; the strings are all the sentences in a document.
When searching a plain-text file, our strings are the lines of some plain text file. We search the file for some pattern and we want to see every line in the file where there is a match.

Searching for patterns

Here’s a file:

# cat sends the contents of a file to the console
cat files/examples/basics.txt

apple, banana, pear
Apple, Banana, Pear,
apple pie, apple, apple cake
apple
Apple
Apple. Banana. Pear.
Banana
Guava
Alabama
pear
peach, apple, plum
grey
gray
griy
groy
A period at the end.

Searching for patterns

Search basics.txt for apple:

egrep 'apple' files/examples/basics.txt

apple, banana, pear
apple pie, apple, apple cake
apple
peach, apple, plum

Searching for patterns

Regular expressions get their real power from their ability to search for patterns that match more than just literal strings.
We want to match things like “Any word that follows an @ symbol”, or “Dates formatted as YYYY/MM/DD” or “Any word that’s repeated”, and so on.
To do this we need a way of expressing search terms like “any word” or “a four digit number” and so on. Regexps do this by creating a little mini-language where some tokens stand for classes of things we might search for.
The most general matching pattern is, “Match everything!” This is represented by the period, or .

Searching for patterns

egrep '.' files/examples/basics.txt

apple, banana, pear
Apple, Banana, Pear,
apple pie, apple, apple cake
apple
Apple
Apple. Banana. Pear.
Banana
Guava
Alabama
pear
peach, apple, plum
grey
gray
griy
groy
A period at the end.

Everything in the file matches the . pattern.

Searching for patterns

But … if “.” matches any character, how do you specifically match the character “.”?
You have to “escape” the period to tell the regex you want to match it exactly, or literally, rather than interpret it as meaning “match anything”.
As in the shell, regular expressions use the backslash character, \, to signal “escape the next character”, or “treat the next character in a predefined special way”. (E.g. \n for “New Line”).
To match a “.”, you need the regex “\.”

Searching for patterns

egrep '\.' files/examples/basics.txt

Apple. Banana. Pear.
A period at the end.

Now the only match is the period (highlighted in red).

Hang on, I see a further problem

… how do you match a literal \ then?

cat files/examples/specials.txt


Two backslashes: \ \ 
A period or full stop: . 
A dollar sign: $ hello
A caret: ^
[
The @ symbol
]

A backslash \ and a forward slash /

Hang on, I see a further problem

… how do you match a literal \ then?

egrep '\\' files/examples/specials.txt

Two backslashes: \ \ 
A backslash \ and a forward slash /

Well that’s ugly
This is the price we pay for having to express searches for patterns using a language containing these same characters, which we may also want to search for.
I promise this will pay off, though.

Line delimiters

Use ^ to match the start of a string.
Use $ to match the end of a string.

Line delimiters

Use ^ to match the start of a string.
Use $ to match the end of a string.

egrep '^a' files/examples/basics.txt

apple, banana, pear
apple pie, apple, apple cake
apple

egrep 'a$' files/examples/basics.txt

Banana
Guava
Alabama

Matching start and end

To force a regular expression to only match a complete string, anchor it with both ^ and $

egrep 'apple' files/examples/basics.txt

apple, banana, pear
apple pie, apple, apple cake
apple
peach, apple, plum

egrep '^apple$' files/examples/basics.txt

apple

Matching character classes

\d matches any digit. \s matches any whitespace (e.g. space, tab, newline). [abc] matches a, b, or c. [^abc] matches anything except a, b, or c.

Alternation

Use parentheses to make the precedence of | clear:

# e or a variant 
egrep 'gr(e|a)y' files/examples/basics.txt

grey
gray

Repeated patterns

? is 0 or 1
+ is 1 or more
* is 0 or more

Repeated patterns

? is 0 or 1
+ is 1 or more
* is 0 or more

cat files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Repeated patterns

? is 0 or 1
+ is 1 or more
* is 0 or more

egrep 'CC+' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Repeated patterns

? is 0 or 1
+ is 1 or more
* is 0 or more

egrep 'C[LX]+' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

egrep 'C{2}' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

egrep 'C{2,}' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

egrep 'C{2,3}' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

By default regexps make greedy matches. You can make them match the shortest string possible by adding a ?. This is often very useful!

egrep 'C{2,3}?' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

{n} is exactly n
{n,} is n or more
{,m} is at most m
{n,m} is between n and m

By default regexps make greedy matches. You can make them match the shortest string possible by adding a ?. This is often very useful!

egrep 'C[LX]+?' files/examples/roman.txt

1888 is the longest year in Roman numerals: MDCCCLXXXVIII

And finally … backreferences

cat files/examples/fruit.txt

apple
apricot
avocado
banana
bell pepper
bilberry
blackberry
blackcurrant
blood orange
blueberry
boysenberry
breadfruit
canary melon
cantaloupe
cherimoya
cherry
chili pepper
clementine
cloudberry
coconut
cranberry
cucumber
currant
damson
date
dragonfruit
durian
eggplant
elderberry
feijoa
fig
goji berry
gooseberry
grape
grapefruit
guava
honeydew
huckleberry
jackfruit
jambul
jujube
kiwi fruit
kumquat
lemon
lime
loquat
lychee
mandarine
mango
mulberry
nectarine
nut
olive
orange
pamelo
papaya
passionfruit
peach
pear
persimmon
physalis
pineapple
plum
pomegranate
pomelo
purple mangosteen
quince
raisin
rambutan
raspberry
redcurrant
rock melon
salal berry
satsuma
star fruit
strawberry
tamarillo
tangerine
ugli fruit
watermelon

Grouping and backreferences

Find all fruits that have a repeated pair of letters:

# Using basic grep here because `rg` doesn't support backreferences
grep -E '(..)\1' files/examples/fruit.txt

banana
coconut
cucumber
jujube
papaya
salal berry

Grouping and backreferences

Backreferences and grouping are very useful for string replacements.

OK that was a lot

Learning and testing regexps

Practice with a tester like https://regexr.com or https://regex101.com

Or an app like Patterns

The regex engine or “flavor” used by `stringr` is Perl- or PCRE-like.

Beyond `grep` in the shell

There’s also Perl, a programming language that’s been displaced to some extent by Python but which remains very good at compactly manipulating strings and being a kind of “glue language” for work in the shell. Perl can act as a kind of more consistent and powerful superset of shell stream-of-strings tools like grep, sed, and awk.

One useful (but be-careful-not-to-cut-yourself dangerous) thing Perl can do is easily edit a lot of files “in place”.

# Find every Quarto file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.qmd" | xargs perl -p -i -e "s/percent_format/label_percent/g"

Beyond `grep` in the shell

You can protect a bit against the dangers of doing this by making the -i option create backup files of everything it touches:

# Find every quarto file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.qmd" | xargs perl -p -i.orig -e "s/percent_format/label_percent/g"

Here the -i.orig flag will back up e.g. analysis.qmd to analysis.qmd.orig before changing analysis.qmd.
The other protection, of course, is to have your working files under version control, which will get to later in the semester.
For more on Perl oneliners see, for example, the Perl one-liners cookbook.

Regular Expressions in R

Why they appear

To detect text, to extract it, to replace or transform it.

Example: Politics and Placenames

library(tidyverse)

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019

# A tibble: 3,320 × 13
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberavon          50747 Conservat… Charlott…  6518               20.6
 3 W07000… Aberavon          50747 The Brexi… Glenda D…  3108                9.8
 4 W07000… Aberavon          50747 Plaid Cym… Nigel Hu…  2711                8.6
 5 W07000… Aberavon          50747 Liberal D… Sheila K…  1072                3.4
 6 W07000… Aberavon          50747 Independe… Captain …   731                2.3
 7 W07000… Aberavon          50747 Green      Giorgia …   450                1.4
 8 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 9 W07000… Aberconwy         44699 Labour     Emily Ow… 12653               39.7
10 W07000… Aberconwy         44699 Plaid Cym… Lisa Goo…  2704                8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency)

# A tibble: 3,320 × 13
# Groups:   constituency [650]
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberavon          50747 Conservat… Charlott…  6518               20.6
 3 W07000… Aberavon          50747 The Brexi… Glenda D…  3108                9.8
 4 W07000… Aberavon          50747 Plaid Cym… Nigel Hu…  2711                8.6
 5 W07000… Aberavon          50747 Liberal D… Sheila K…  1072                3.4
 6 W07000… Aberavon          50747 Independe… Captain …   731                2.3
 7 W07000… Aberavon          50747 Green      Giorgia …   450                1.4
 8 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 9 W07000… Aberconwy         44699 Labour     Emily Ow… 12653               39.7
10 W07000… Aberconwy         44699 Plaid Cym… Lisa Goo…  2704                8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes)

# A tibble: 650 × 13
# Groups:   constituency [650]
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 3 S14000… Aberdeen No…      62489 Scottish … Kirsty B… 20205               54  
 4 S14000… Aberdeen So…      65719 Scottish … Stephen … 20388               44.7
 5 S14000… Aberdeenshi…      72640 Conservat… Andrew B… 22752               42.7
 6 S14000… Airdrie & S…      64008 Scottish … Neil Gray 17929               45.1
 7 E14000… Aldershot         72617 Conservat… Leo Doch… 27980               58.4
 8 E14000… Aldridge-Br…      60138 Conservat… Wendy Mo… 27850               70.8
 9 E14000… Altrincham …      73096 Conservat… Graham B… 26311               48  
10 W07000… Alyn & Dees…      62783 Labour     Mark Tami 18271               42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup()

# A tibble: 650 × 13
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 3 S14000… Aberdeen No…      62489 Scottish … Kirsty B… 20205               54  
 4 S14000… Aberdeen So…      65719 Scottish … Stephen … 20388               44.7
 5 S14000… Aberdeenshi…      72640 Conservat… Andrew B… 22752               42.7
 6 S14000… Airdrie & S…      64008 Scottish … Neil Gray 17929               45.1
 7 E14000… Aldershot         72617 Conservat… Leo Doch… 27980               58.4
 8 E14000… Aldridge-Br…      60138 Conservat… Wendy Mo… 27850               70.8
 9 E14000… Altrincham …      73096 Conservat… Graham B… 26311               48  
10 W07000… Alyn & Dees…      62783 Labour     Mark Tami 18271               42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name)

# A tibble: 650 × 2
   constituency                    party_name             
   <chr>                           <chr>                  
 1 Aberavon                        Labour                 
 2 Aberconwy                       Conservative           
 3 Aberdeen North                  Scottish National Party
 4 Aberdeen South                  Scottish National Party
 5 Aberdeenshire West & Kincardine Conservative           
 6 Airdrie & Shotts                Scottish National Party
 7 Aldershot                       Conservative           
 8 Aldridge-Brownhills             Conservative           
 9 Altrincham & Sale West          Conservative           
10 Alyn & Deeside                  Labour                 
# ℹ 640 more rows

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name) |>
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )"))

# A tibble: 650 × 14
   constituency party_name shire field dale  pool  ton   wood  saint port  ford 
   <chr>        <chr>      <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 Aberavon     Labour     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2 Aberconwy    Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 3 Aberdeen No… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 4 Aberdeen So… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 5 Aberdeenshi… Conservat… TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 6 Airdrie & S… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 7 Aldershot    Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 8 Aldridge-Br… Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 9 Altrincham … Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Alyn & Dees… Labour     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# ℹ 640 more rows
# ℹ 3 more variables: by <lgl>, boro <lgl>, ley <lgl>

Example: Politics and Placenames

library(tidyverse)
library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name) |>
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |>
  pivot_longer(shire:ley, names_to = "toponym")

# A tibble: 7,800 × 4
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab <- ukvote2019 |> 
  group_by(constituency) |> 
  slice_max(votes) |> 
  ungroup() |> 
  select(constituency, party_name) |> 
  # We could write these more efficiently but we don't care about that rn
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |> 
  pivot_longer(shire:ley, names_to = "toponym")

Example: Politics and Placenames

place_tab

# A tibble: 7,800 × 4
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym)

# A tibble: 7,800 × 4
# Groups:   party_name, toponym [120]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour"))

# A tibble: 6,816 × 4
# Groups:   party_name, toponym [24]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 6,806 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name)

# A tibble: 6,816 × 4
# Groups:   toponym, party_name [24]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 6,806 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value))

# A tibble: 24 × 3
# Groups:   toponym [12]
   toponym party_name    freq
   <chr>   <chr>        <int>
 1 boro    Conservative     7
 2 boro    Labour           1
 3 by      Conservative     6
 4 by      Labour           2
 5 dale    Conservative     3
 6 dale    Labour           1
 7 field   Conservative    10
 8 field   Labour          10
 9 ford    Conservative    17
10 ford    Labour          12
# ℹ 14 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq))

# A tibble: 24 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 boro    Conservative     7 0.875
 2 boro    Labour           1 0.125
 3 by      Conservative     6 0.75 
 4 by      Labour           2 0.25 
 5 dale    Conservative     3 0.75 
 6 dale    Labour           1 0.25 
 7 field   Conservative    10 0.5  
 8 field   Labour          10 0.5  
 9 ford    Conservative    17 0.586
10 ford    Labour          12 0.414
# ℹ 14 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative")

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 boro    Conservative     7 0.875
 2 by      Conservative     6 0.75 
 3 dale    Conservative     3 0.75 
 4 field   Conservative    10 0.5  
 5 ford    Conservative    17 0.586
 6 ley     Conservative    26 0.722
 7 pool    Conservative     2 0.286
 8 port    Conservative     3 0.333
 9 saint   Conservative     3 0.5  
10 shire   Conservative    37 0.974
11 ton     Conservative    37 0.507
12 wood    Conservative     7 0.636

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative") |>
  arrange(desc(pct))

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 shire   Conservative    37 0.974
 2 boro    Conservative     7 0.875
 3 by      Conservative     6 0.75 
 4 dale    Conservative     3 0.75 
 5 ley     Conservative    26 0.722
 6 wood    Conservative     7 0.636
 7 ford    Conservative    17 0.586
 8 ton     Conservative    37 0.507
 9 field   Conservative    10 0.5  
10 saint   Conservative     3 0.5  
11 port    Conservative     3 0.333
12 pool    Conservative     2 0.286

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative") |>
  arrange(desc(pct))

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 shire   Conservative    37 0.974
 2 boro    Conservative     7 0.875
 3 by      Conservative     6 0.75 
 4 dale    Conservative     3 0.75 
 5 ley     Conservative    26 0.722
 6 wood    Conservative     7 0.636
 7 ford    Conservative    17 0.586
 8 ton     Conservative    37 0.507
 9 field   Conservative    10 0.5  
10 saint   Conservative     3 0.5  
11 port    Conservative     3 0.333
12 pool    Conservative     2 0.286

Text Editors

Choices, choices

There are many good text editors.
The main point is: pick one, and learn the hell out of it.
The RStudio IDE has many of the features of a good editor built in, as well as doing other things.
Several of the other editors also have good support for R and many other languages.

RStudio’s Text Editor

I’ll mostly confine my examples to RStudio’s text editor

Danger, Will Robinson

One view of things

Danger, Will Robinson

Endlessly futzing with your text editor’s setup is a displacement activity.
The tools are not magic. They cannot by themselves make you do good work. Or any work.

Things any good text editor will do

Specialized text display

Syntax highlighting
Brace and parenthesis matching
Outlining / Folding

Edit text!

Navigation and action via keyboard shortcuts
Keyboard-based selection and movement of text, lines, and logical sections
Search and replace using regular expressions

Things most good text editors also do

Cursor and Insertion-Point tricks

Multiple Cursors
Rectangular / Columnar editing
A snippet system of some sort

VS Code: cmd-shift-L for example

Emacs:

C-x r M-w: Save the text of the region-rectangle as the last killed rectangle (copy-rectangle-as-kill).

C-x r y Yank [paste] the last killed rectangle with its upper left corner at point (yank-rectangle).

C-x r o Insert blank space to fill the space of the region-rectangle (open-rectangle). This pushes the previous contents of the region-rectangle to the right.

C-x r N Insert line numbers along the left edge of the region-rectangle (rectangle-number-lines). This pushes the previous contents of the region-rectangle to the right.

C-x r c Clear the region-rectangle by replacing all of its contents with spaces (clear-rectangle).

The command C-x SPC (rectangle-mark-mode) toggles whether the region-rectangle or the standard region is highlighted (first activating the region if necessary). When this mode is enabled, commands that resize the region (C-f, C-n etc.) do so in a rectangular fashion, and killing and yanking operate on the rectangle. See Killing and Moving Text. The mode persists only as long as the region is active.

Things most good text editors also do

IDE-like functionality

Diffing files
Integration with documentation
Static Analysis and Linting
Integration with a REPL
A Command Palette (Windows: Ctrl+Shift+P Mac: Command+Shift+P)
Integration with version control systems
Remote editing

Other things

Specifically for academia

Citation and reference management
Integration with Zotero
Connection to Pandoc

Search and Edit Text

Getting around in the Shell

Your command history

Aside: Standard Modifier Key Symbols

Searching inside files

grep

The grep family

Integrate rg and fzf

Regular Expressions

Setup

Regular Expressions

Regular Expressions

Searching for patterns

Searching for patterns

Searching for patterns

Searching for patterns

Searching for patterns

Searching for patterns

Searching for patterns

Hang on, I see a further problem

Hang on, I see a further problem

Line delimiters

Line delimiters

Matching start and end

Matching character classes

Alternation

Repeated patterns

Repeated patterns

Repeated patterns

Repeated patterns

Exact numbers of repetitions

Exact numbers of repetitions

Exact numbers of repetitions

Exact numbers of repetitions

Exact numbers of repetitions

Exact numbers of repetitions

And finally … backreferences

Grouping and backreferences

Grouping and backreferences

OK that was a lot

Learning and testing regexps

Practice with a tester like https://regexr.com or https://regex101.com

Or an app like Patterns

The regex engine or “flavor” used by stringr is Perl- or PCRE-like.

Beyond grep in the shell

Beyond grep in the shell

Regular Expressions in R

Why they appear

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Example: Politics and Placenames

Text Editors

Choices, choices

RStudio’s Text Editor

Danger, Will Robinson

Danger, Will Robinson

Things any good text editor will do

Specialized text display

Edit text!

Things most good text editors also do

Cursor and Insertion-Point tricks

Things most good text editors also do

IDE-like functionality

Other things

`grep`

The `grep` family

Integrate `rg` and `fzf`

The regex engine or “flavor” used by `stringr` is Perl- or PCRE-like.

Beyond `grep` in the shell

Beyond `grep` in the shell