grep 'Stately' files/examples/ulysses.txt
Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
Modern Plain Text Social Science: Week 3
October 7, 2024
Eventually you will accumulate a history of shell commands you have typed, many of which you may want to reuse from time to time.
↑
^R
^R
will also work for history search at the RStudio console and in many other places.Symbol | Key | Unicode | Symbol | Key | Unicode |
---|---|---|---|---|---|
⎋ | Escape | U+238B |
⌫ | Backspace | U+232B |
⇥ | Tab | U+21E5 |
⌦ | Delete | U+2326 |
⇪ | Caps Lock | U+21EA |
⇱ | Home | U+21F1 |
⇧ | Shift | U+21E7 |
⇲ | End | U+21F2 |
⌃ | Control | U+2303 |
⇞ | Page Up | U+21DE |
⌥ | Option/Alt | U+2325 |
⇟ | Page Down | U+21DF |
⌘ | Command | U+2318 |
⌤ | Enter | U+2324 |
⏎ | Return | U+23CE |
␣ | Space | U+2423 |
grep
find
searches file and folder names only. To search inside files we use grep
. Or rather we will use a flavor of grep called egrep
.
Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
Search more than one file:
files/examples/jabberwocky.txt:“Beware the Jabberwock, my son!
files/examples/jabberwocky.txt: The Jabberwock, with eyes of flame,
files/examples/jabberwocky.txt:“And hast thou slain the Jabberwock?
files/examples/ulysses.txt:jabber on their girdles: roguewords, tough nuggets patter in their
grep
familygrep and its derivatives are very powerful thanks to their ability to use regular expressions. We will learn about those momentarily. There are also more recent command-line search tools like ripgrep, or rg
, a modern version of grep
that is very fast, automatically searches subfolders, and has other nice additional features. For more details see the ripgrep project page.
rg
and fzf
fzf
is a command-line fuzzy-finder. It makes ^R
really powerful and convenient. For details see the fzf project page.
Or,
Waiter, there appears to be a language inside my language
Regexps are their own world of text processing
☜ This book is a thing of beauty.
Here’s a file:
Search basics.txt
for apple
:
@
symbol”, or “Dates formatted as YYYY/MM/DD
” or “Any word that’s repeated”, and so on..
apple, banana, pear
Apple, Banana, Pear,
apple pie, apple, apple cake
apple
Apple
Apple. Banana. Pear.
Banana
Guava
Alabama
pear
peach, apple, plum
grey
gray
griy
groy
A period at the end.
Everything in the file matches the .
pattern.
.
” matches any character, how do you specifically match the character “.
”?\n
for “New Line”).\.
”Now the only match is the period (highlighted in red).
… how do you match a literal \
then?
… how do you match a literal \
then?
Well that’s ugly
This is the price we pay for having to express searches for patterns using a language containing these same characters, which we may also want to search for.
I promise this will pay off, though.
^
to match the start of a string.$
to match the end of a string.^
to match the start of a string.$
to match the end of a string.To force a regular expression to only match a complete string, anchor it with both ^
and $
apple, banana, pear
apple pie, apple, apple cake
apple
peach, apple, plum
\d
matches any digit. \s
matches any whitespace (e.g. space, tab, newline). [abc]
matches a, b, or c. [^abc]
matches anything except a, b, or c.
Use parentheses to make the precedence of |
clear:
?
is 0 or 1+
is 1 or more*
is 0 or more?
is 0 or 1+
is 1 or more*
is 0 or more?
is 0 or 1+
is 1 or more*
is 0 or more?
is 0 or 1+
is 1 or more*
is 0 or more{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and m{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and m{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and m{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and m{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and mBy default regexps make greedy matches. You can make them match the shortest string possible by adding a ?. This is often very useful!
{n}
is exactly n{n,}
is n or more{,m}
is at most m{n,m}
is between n and mBy default regexps make greedy matches. You can make them match the shortest string possible by adding a ?. This is often very useful!
apple
apricot
avocado
banana
bell pepper
bilberry
blackberry
blackcurrant
blood orange
blueberry
boysenberry
breadfruit
canary melon
cantaloupe
cherimoya
cherry
chili pepper
clementine
cloudberry
coconut
cranberry
cucumber
currant
damson
date
dragonfruit
durian
eggplant
elderberry
feijoa
fig
goji berry
gooseberry
grape
grapefruit
guava
honeydew
huckleberry
jackfruit
jambul
jujube
kiwi fruit
kumquat
lemon
lime
loquat
lychee
mandarine
mango
mulberry
nectarine
nut
olive
orange
pamelo
papaya
passionfruit
peach
pear
persimmon
physalis
pineapple
plum
pomegranate
pomelo
purple mangosteen
quince
raisin
rambutan
raspberry
redcurrant
rock melon
salal berry
satsuma
star fruit
strawberry
tamarillo
tangerine
ugli fruit
watermelon
Find all fruits that have a repeated pair of letters:
Backreferences and grouping are very useful for string replacements.
stringr
is Perl- or PCRE-like.grep
in the shellThere’s also Perl, a programming language that’s been displaced to some extent by Python but which remains very good at compactly manipulating strings and being a kind of “glue language” for work in the shell. Perl can act as a kind of more consistent and powerful superset of shell stream-of-strings tools like grep
, sed
, and awk
.
One useful (but be-careful-not-to-cut-yourself dangerous) thing Perl can do is easily edit a lot of files “in place”.
grep
in the shell-i
option create backup files of everything it touches:# Find every quarto file beneath the current directory
# Then edit each one in place to replace every instance of
# `percent_format` with `label_percent`
find . -name "*.qmd" | xargs perl -p -i.orig -e "s/percent_format/label_percent/g"
-i.orig
flag will back up e.g. analysis.qmd
to analysis.qmd.orig
before changing analysis.qmd
.To detect text, to extract it, to replace or transform it.
# A tibble: 3,320 × 13
cid constituency electorate party_name candidate votes vote_share_percent
<chr> <chr> <int> <chr> <chr> <int> <dbl>
1 W07000… Aberavon 50747 Labour Stephen … 17008 53.8
2 W07000… Aberavon 50747 Conservat… Charlott… 6518 20.6
3 W07000… Aberavon 50747 The Brexi… Glenda D… 3108 9.8
4 W07000… Aberavon 50747 Plaid Cym… Nigel Hu… 2711 8.6
5 W07000… Aberavon 50747 Liberal D… Sheila K… 1072 3.4
6 W07000… Aberavon 50747 Independe… Captain … 731 2.3
7 W07000… Aberavon 50747 Green Giorgia … 450 1.4
8 W07000… Aberconwy 44699 Conservat… Robin Mi… 14687 46.1
9 W07000… Aberconwy 44699 Labour Emily Ow… 12653 39.7
10 W07000… Aberconwy 44699 Plaid Cym… Lisa Goo… 2704 8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
# vrank <int>, turnout <dbl>, fname <chr>, lname <chr>
# A tibble: 3,320 × 13
# Groups: constituency [650]
cid constituency electorate party_name candidate votes vote_share_percent
<chr> <chr> <int> <chr> <chr> <int> <dbl>
1 W07000… Aberavon 50747 Labour Stephen … 17008 53.8
2 W07000… Aberavon 50747 Conservat… Charlott… 6518 20.6
3 W07000… Aberavon 50747 The Brexi… Glenda D… 3108 9.8
4 W07000… Aberavon 50747 Plaid Cym… Nigel Hu… 2711 8.6
5 W07000… Aberavon 50747 Liberal D… Sheila K… 1072 3.4
6 W07000… Aberavon 50747 Independe… Captain … 731 2.3
7 W07000… Aberavon 50747 Green Giorgia … 450 1.4
8 W07000… Aberconwy 44699 Conservat… Robin Mi… 14687 46.1
9 W07000… Aberconwy 44699 Labour Emily Ow… 12653 39.7
10 W07000… Aberconwy 44699 Plaid Cym… Lisa Goo… 2704 8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
# vrank <int>, turnout <dbl>, fname <chr>, lname <chr>
# A tibble: 650 × 13
# Groups: constituency [650]
cid constituency electorate party_name candidate votes vote_share_percent
<chr> <chr> <int> <chr> <chr> <int> <dbl>
1 W07000… Aberavon 50747 Labour Stephen … 17008 53.8
2 W07000… Aberconwy 44699 Conservat… Robin Mi… 14687 46.1
3 S14000… Aberdeen No… 62489 Scottish … Kirsty B… 20205 54
4 S14000… Aberdeen So… 65719 Scottish … Stephen … 20388 44.7
5 S14000… Aberdeenshi… 72640 Conservat… Andrew B… 22752 42.7
6 S14000… Airdrie & S… 64008 Scottish … Neil Gray 17929 45.1
7 E14000… Aldershot 72617 Conservat… Leo Doch… 27980 58.4
8 E14000… Aldridge-Br… 60138 Conservat… Wendy Mo… 27850 70.8
9 E14000… Altrincham … 73096 Conservat… Graham B… 26311 48
10 W07000… Alyn & Dees… 62783 Labour Mark Tami 18271 42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
# vrank <int>, turnout <dbl>, fname <chr>, lname <chr>
# A tibble: 650 × 13
cid constituency electorate party_name candidate votes vote_share_percent
<chr> <chr> <int> <chr> <chr> <int> <dbl>
1 W07000… Aberavon 50747 Labour Stephen … 17008 53.8
2 W07000… Aberconwy 44699 Conservat… Robin Mi… 14687 46.1
3 S14000… Aberdeen No… 62489 Scottish … Kirsty B… 20205 54
4 S14000… Aberdeen So… 65719 Scottish … Stephen … 20388 44.7
5 S14000… Aberdeenshi… 72640 Conservat… Andrew B… 22752 42.7
6 S14000… Airdrie & S… 64008 Scottish … Neil Gray 17929 45.1
7 E14000… Aldershot 72617 Conservat… Leo Doch… 27980 58.4
8 E14000… Aldridge-Br… 60138 Conservat… Wendy Mo… 27850 70.8
9 E14000… Altrincham … 73096 Conservat… Graham B… 26311 48
10 W07000… Alyn & Dees… 62783 Labour Mark Tami 18271 42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
# vrank <int>, turnout <dbl>, fname <chr>, lname <chr>
# A tibble: 650 × 2
constituency party_name
<chr> <chr>
1 Aberavon Labour
2 Aberconwy Conservative
3 Aberdeen North Scottish National Party
4 Aberdeen South Scottish National Party
5 Aberdeenshire West & Kincardine Conservative
6 Airdrie & Shotts Scottish National Party
7 Aldershot Conservative
8 Aldridge-Brownhills Conservative
9 Altrincham & Sale West Conservative
10 Alyn & Deeside Labour
# ℹ 640 more rows
library(tidyverse)
library(ukelection2019)
ukvote2019 |>
group_by(constituency) |>
slice_max(votes) |>
ungroup() |>
select(constituency, party_name) |>
mutate(shire = str_detect(constituency, "shire"),
field = str_detect(constituency, "field"),
dale = str_detect(constituency, "dale"),
pool = str_detect(constituency, "pool"),
ton = str_detect(constituency, "(ton$)|(ton )"),
wood = str_detect(constituency, "(wood$)|(wood )"),
saint = str_detect(constituency, "(St )|(Saint)"),
port = str_detect(constituency, "(Port)|(port)"),
ford = str_detect(constituency, "(ford$)|(ford )"),
by = str_detect(constituency, "(by$)|(by )"),
boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )"))
# A tibble: 650 × 14
constituency party_name shire field dale pool ton wood saint port ford
<chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 Aberavon Labour FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 Aberconwy Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 Aberdeen No… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 Aberdeen So… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 Aberdeenshi… Conservat… TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 Airdrie & S… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
7 Aldershot Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
8 Aldridge-Br… Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 Altrincham … Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Alyn & Dees… Labour FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# ℹ 640 more rows
# ℹ 3 more variables: by <lgl>, boro <lgl>, ley <lgl>
library(tidyverse)
library(ukelection2019)
ukvote2019 |>
group_by(constituency) |>
slice_max(votes) |>
ungroup() |>
select(constituency, party_name) |>
mutate(shire = str_detect(constituency, "shire"),
field = str_detect(constituency, "field"),
dale = str_detect(constituency, "dale"),
pool = str_detect(constituency, "pool"),
ton = str_detect(constituency, "(ton$)|(ton )"),
wood = str_detect(constituency, "(wood$)|(wood )"),
saint = str_detect(constituency, "(St )|(Saint)"),
port = str_detect(constituency, "(Port)|(port)"),
ford = str_detect(constituency, "(ford$)|(ford )"),
by = str_detect(constituency, "(by$)|(by )"),
boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |>
pivot_longer(shire:ley, names_to = "toponym")
# A tibble: 7,800 × 4
constituency party_name toponym value
<chr> <chr> <chr> <lgl>
1 Aberavon Labour shire FALSE
2 Aberavon Labour field FALSE
3 Aberavon Labour dale FALSE
4 Aberavon Labour pool FALSE
5 Aberavon Labour ton FALSE
6 Aberavon Labour wood FALSE
7 Aberavon Labour saint FALSE
8 Aberavon Labour port FALSE
9 Aberavon Labour ford FALSE
10 Aberavon Labour by FALSE
# ℹ 7,790 more rows
place_tab <- ukvote2019 |>
group_by(constituency) |>
slice_max(votes) |>
ungroup() |>
select(constituency, party_name) |>
# We could write these more efficiently but we don't care about that rn
mutate(shire = str_detect(constituency, "shire"),
field = str_detect(constituency, "field"),
dale = str_detect(constituency, "dale"),
pool = str_detect(constituency, "pool"),
ton = str_detect(constituency, "(ton$)|(ton )"),
wood = str_detect(constituency, "(wood$)|(wood )"),
saint = str_detect(constituency, "(St )|(Saint)"),
port = str_detect(constituency, "(Port)|(port)"),
ford = str_detect(constituency, "(ford$)|(ford )"),
by = str_detect(constituency, "(by$)|(by )"),
boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |>
pivot_longer(shire:ley, names_to = "toponym")
# A tibble: 7,800 × 4
constituency party_name toponym value
<chr> <chr> <chr> <lgl>
1 Aberavon Labour shire FALSE
2 Aberavon Labour field FALSE
3 Aberavon Labour dale FALSE
4 Aberavon Labour pool FALSE
5 Aberavon Labour ton FALSE
6 Aberavon Labour wood FALSE
7 Aberavon Labour saint FALSE
8 Aberavon Labour port FALSE
9 Aberavon Labour ford FALSE
10 Aberavon Labour by FALSE
# ℹ 7,790 more rows
# A tibble: 7,800 × 4
# Groups: party_name, toponym [120]
constituency party_name toponym value
<chr> <chr> <chr> <lgl>
1 Aberavon Labour shire FALSE
2 Aberavon Labour field FALSE
3 Aberavon Labour dale FALSE
4 Aberavon Labour pool FALSE
5 Aberavon Labour ton FALSE
6 Aberavon Labour wood FALSE
7 Aberavon Labour saint FALSE
8 Aberavon Labour port FALSE
9 Aberavon Labour ford FALSE
10 Aberavon Labour by FALSE
# ℹ 7,790 more rows
# A tibble: 6,816 × 4
# Groups: party_name, toponym [24]
constituency party_name toponym value
<chr> <chr> <chr> <lgl>
1 Aberavon Labour shire FALSE
2 Aberavon Labour field FALSE
3 Aberavon Labour dale FALSE
4 Aberavon Labour pool FALSE
5 Aberavon Labour ton FALSE
6 Aberavon Labour wood FALSE
7 Aberavon Labour saint FALSE
8 Aberavon Labour port FALSE
9 Aberavon Labour ford FALSE
10 Aberavon Labour by FALSE
# ℹ 6,806 more rows
# A tibble: 6,816 × 4
# Groups: toponym, party_name [24]
constituency party_name toponym value
<chr> <chr> <chr> <lgl>
1 Aberavon Labour shire FALSE
2 Aberavon Labour field FALSE
3 Aberavon Labour dale FALSE
4 Aberavon Labour pool FALSE
5 Aberavon Labour ton FALSE
6 Aberavon Labour wood FALSE
7 Aberavon Labour saint FALSE
8 Aberavon Labour port FALSE
9 Aberavon Labour ford FALSE
10 Aberavon Labour by FALSE
# ℹ 6,806 more rows
# A tibble: 24 × 3
# Groups: toponym [12]
toponym party_name freq
<chr> <chr> <int>
1 boro Conservative 7
2 boro Labour 1
3 by Conservative 6
4 by Labour 2
5 dale Conservative 3
6 dale Labour 1
7 field Conservative 10
8 field Labour 10
9 ford Conservative 17
10 ford Labour 12
# ℹ 14 more rows
# A tibble: 24 × 4
# Groups: toponym [12]
toponym party_name freq pct
<chr> <chr> <int> <dbl>
1 boro Conservative 7 0.875
2 boro Labour 1 0.125
3 by Conservative 6 0.75
4 by Labour 2 0.25
5 dale Conservative 3 0.75
6 dale Labour 1 0.25
7 field Conservative 10 0.5
8 field Labour 10 0.5
9 ford Conservative 17 0.586
10 ford Labour 12 0.414
# ℹ 14 more rows
# A tibble: 12 × 4
# Groups: toponym [12]
toponym party_name freq pct
<chr> <chr> <int> <dbl>
1 boro Conservative 7 0.875
2 by Conservative 6 0.75
3 dale Conservative 3 0.75
4 field Conservative 10 0.5
5 ford Conservative 17 0.586
6 ley Conservative 26 0.722
7 pool Conservative 2 0.286
8 port Conservative 3 0.333
9 saint Conservative 3 0.5
10 shire Conservative 37 0.974
11 ton Conservative 37 0.507
12 wood Conservative 7 0.636
# A tibble: 12 × 4
# Groups: toponym [12]
toponym party_name freq pct
<chr> <chr> <int> <dbl>
1 shire Conservative 37 0.974
2 boro Conservative 7 0.875
3 by Conservative 6 0.75
4 dale Conservative 3 0.75
5 ley Conservative 26 0.722
6 wood Conservative 7 0.636
7 ford Conservative 17 0.586
8 ton Conservative 37 0.507
9 field Conservative 10 0.5
10 saint Conservative 3 0.5
11 port Conservative 3 0.333
12 pool Conservative 2 0.286
# A tibble: 12 × 4
# Groups: toponym [12]
toponym party_name freq pct
<chr> <chr> <int> <dbl>
1 shire Conservative 37 0.974
2 boro Conservative 7 0.875
3 by Conservative 6 0.75
4 dale Conservative 3 0.75
5 ley Conservative 26 0.722
6 wood Conservative 7 0.636
7 ford Conservative 17 0.586
8 ton Conservative 37 0.507
9 field Conservative 10 0.5
10 saint Conservative 3 0.5
11 port Conservative 3 0.333
12 pool Conservative 2 0.286
I’ll mostly confine my examples to RStudio’s text editor