Search and Edit Text

Modern Plain Text Social Science: Week 3

Kieran Healy

September 6, 2024

Getting around in the Shell

Your command history

Eventually you will accumulate a history of shell commands you have typed, many of which you may want to reuse from time to time.

  • Go to the previous command with the up arrow,
  • Search your command history with control-R, ^R
  • ^R will also work for history search at the RStudio console and in many other places.

Aside: Standard Modifier Key Symbols

Symbol Key Unicode Symbol Key Unicode
Escape U+238B Backspace U+232B
Tab U+21E5 Delete U+2326
Caps Lock U+21EA Home U+21F1
Shift U+21E7 End U+21F2
Control U+2303 Page Up U+21DE
Option/Alt U+2325 Page Down U+21DF
Command U+2318 Enter U+2324
Return U+23CE Space U+2423

Searching inside files

grep

find searches file and folder names only. To search inside files we use grep. Or rather we will use a flavor of grep called egrep.

grep 'Stately' files/examples/ulysses.txt 
Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of

Search more than one file:

## -i for case-insensitive search
egrep -i 'jabber' files/examples/*.txt
files/examples/jabberwocky.txt:“Beware the Jabberwock, my son! 
files/examples/jabberwocky.txt:      The Jabberwock, with eyes of flame, 
files/examples/jabberwocky.txt:“And hast thou slain the Jabberwock? 
files/examples/ulysses.txt:jabber on their girdles: roguewords, tough nuggets patter in their

The grep family

grep and its derivatives are very powerful thanks to their ability to use regular expressions. We will learn about those momentarily. There are also more recent command-line search tools like ripgrep, or rg, a modern version of grep that is very fast, automatically searches subfolders, and has other nice additional features. For more details see the ripgrep project page.

# On a Mac with Homebrew (https://brew.sh)
brew install rg

Integrate rg and fzf

fzf is a command-line fuzzy-finder. It makes ^R really powerful and convenient. For details see the fzf project page.

# On a Mac
brew install fzf

Regular Expressions

Setup

Regular Expressions

Or,

Waiter, there appears to be a language inside my language

Regular Expressions

Regexps are their own world of text processing

☜ This book is a thing of beauty.

Searching for patterns

  • A regular expression is a way of searching for a piece of text, or pattern, inside some larger body of text, called a string.
  • The simplest sort of search is like the “Find” functionality in a Word Processor. The pattern is a literal letter, number, punctuation mark, word or series of words; the strings are all the sentences in a document.
  • When searching a plain-text file, our strings are the lines of some plain text file. We search the file for some pattern and we want to see every line in the file where there is a match.

Searching for patterns

Here’s a file:

# cat sends the contents of a file to the console
cat files/examples/basics.txt
apple, banana, pear
Apple, Banana, Pear,
apple pie, apple, apple cake
apple
Apple
Apple. Banana. Pear.
Banana
Guava
Alabama
pear
peach, apple, plum
grey
gray
griy
groy
A period at the end.

Searching for patterns

Search basics.txt for apple:

egrep 'apple' files/examples/basics.txt 
apple, banana, pear
apple pie, apple, apple cake
apple
peach, apple, plum

Searching for patterns

  • Regular expressions get their real power from their ability to search for patterns that match more than just literal strings.
  • We want to match things like “Any word that follows an @ symbol”, or “Dates formatted as YYYY/MM/DD” or “Any word that’s repeated”, and so on.
  • To do this we need a way of expressing search terms like “any word” or “a four digit number” and so on. Regexps do this by creating a little mini-language where some tokens stand for classes of things we might search for.
  • The most general matching pattern is, “Match everything!” This is represented by the period, or .

Searching for patterns

egrep '.' files/examples/basics.txt
apple, banana, pear
Apple, Banana, Pear,
apple pie, apple, apple cake
apple
Apple
Apple. Banana. Pear.
Banana
Guava
Alabama
pear
peach, apple, plum
grey
gray
griy
groy
A period at the end.

Everything in the file matches the . pattern.

Searching for patterns

  • But … if “.” matches any character, how do you specifically match the character “.”?
  • You have to “escape” the period to tell the regex you want to match it exactly, or literally, rather than interpret it as meaning “match anything”.
  • As in the shell, regular expressions use the backslash character, \, to signal “escape the next character”, or “treat the next character in a predefined special way”. (E.g. \n for “New Line”).
  • To match a “.”, you need the regex “\.

Searching for patterns

egrep '\.' files/examples/basics.txt
Apple. Banana. Pear.
A period at the end.

Now the only match is the period (highlighted in red).

Hang on, I see a further problem

… how do you match a literal \ then?

cat files/examples/specials.txt

Two backslashes: \ \ 
A period or full stop: . 
A dollar sign: $ hello
A caret: ^
[
The @ symbol
]

A backslash \ and a forward slash / 

Hang on, I see a further problem

… how do you match a literal \ then?

egrep '\\' files/examples/specials.txt
Two backslashes: \ \ 
A backslash \ and a forward slash / 
  • Well that’s ugly

  • This is the price we pay for having to express searches for patterns using a language containing these same characters, which we may also want to search for.

  • I promise this will pay off, though.

Line delimiters

  • Use ^ to match the start of a string.
  • Use $ to match the end of a string.

Line delimiters

  • Use ^ to match the start of a string.
  • Use $ to match the end of a string.
egrep '^a' files/examples/basics.txt
apple, banana, pear
apple pie, apple, apple cake
apple
egrep 'a$' files/examples/basics.txt
Banana
Guava
Alabama

Matching start and end

To force a regular expression to only match a complete string, anchor it with both ^ and $


egrep 'apple' files/examples/basics.txt
apple, banana, pear
apple pie, apple, apple cake
apple
peach, apple, plum


egrep '^apple$' files/examples/basics.txt
apple

Matching character classes

\d matches any digit. \s matches any whitespace (e.g. space, tab, newline). [abc] matches a, b, or c. [^abc] matches anything except a, b, or c.

Alternation

Use parentheses to make the precedence of | clear:

# e or a variant 
egrep 'gr(e|a)y' files/examples/basics.txt
grey
gray

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
cat files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
egrep 'CC+' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
egrep 'C[LX]+' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
egrep 'C{2}' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
egrep 'C{2,}' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
egrep 'C{2,3}' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

egrep 'C{2,3}?' files/examples/roman.txt 
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

egrep 'C[LX]+?' files/examples/roman.txt
1888 is the longest year in Roman numerals: MDCCCLXXXVIII

And finally … backreferences

cat files/examples/fruit.txt
apple
apricot
avocado
banana
bell pepper
bilberry
blackberry
blackcurrant
blood orange
blueberry
boysenberry
breadfruit
canary melon
cantaloupe
cherimoya
cherry
chili pepper
clementine
cloudberry
coconut
cranberry
cucumber
currant
damson
date
dragonfruit
durian
eggplant
elderberry
feijoa
fig
goji berry
gooseberry
grape
grapefruit
guava
honeydew
huckleberry
jackfruit
jambul
jujube
kiwi fruit
kumquat
lemon
lime
loquat
lychee
mandarine
mango
mulberry
nectarine
nut
olive
orange
pamelo
papaya
passionfruit
peach
pear
persimmon
physalis
pineapple
plum
pomegranate
pomelo
purple mangosteen
quince
raisin
rambutan
raspberry
redcurrant
rock melon
salal berry
satsuma
star fruit
strawberry
tamarillo
tangerine
ugli fruit
watermelon

Grouping and backreferences

Find all fruits that have a repeated pair of letters:

# Using basic grep here because `rg` doesn't support backreferences
grep -E '(..)\1' files/examples/fruit.txt
banana
coconut
cucumber
jujube
papaya
salal berry

Grouping and backreferences

Backreferences and grouping are very useful for string replacements.

OK that was a lot

Learning and testing regexps

Practice with a tester like https://regexr.com

Or an app like Patterns

The regex engine or “flavor” used by stringr is Perl- or PCRE-like.

Beyond grep in the shell

  • There’s also Perl, a programming language that’s been displaced to some extent by Python but which remains very good at compactly manipulating strings and being a kind of “glue language” for work in the shell. Perl can act as a kind of more consistent and powerful superset of shell stream-of-strings tools like grep, sed, and awk.
  • One useful (but be-careful-not-to-cut-yourself dangerous) thing Perl can do is easily edit a lot of files in place.
# Find every Quarto file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.qmd" | xargs perl -p -i -e "s/percent_format/label_percent/g"

Beyond grep in the shell

  • You can protect a bit against the dangers of doing this by making the -i option create backup files of everything it touches:
# Find every quarto file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.qmd" | xargs perl -p -i.orig -e "s/percent_format/label_percent/g"
  • Here the -i.orig flag will back up e.g. analysis.qmd to analysis.qmd.orig before changing analysis.qmd.
  • The other protection, of course, is to have your working files under version control, which will get to later in the semester.
  • For more on Perl oneliners see, for example, the Perl one-liners cookbook.

Text Editors

Choices, choices

  • There are many good text editors.
  • The main point is: pick one, and learn the hell out of it.
  • The RStudio IDE has many of the features of a good editor built in, as well as doing other things.
  • Several of the other editors also have good support for R and many other languages.

RStudio’s Text Editor

  • I’ll mostly confine my examples to RStudio’s text editor

Danger, Will Robinson

One view of things

Danger, Will Robinson

  • Endlessly futzing with your text editor’s setup is a displacement activity.
  • The tools are not magic. They cannot by themselves make you do good work. Or any work.

Things any good text editor will do

Specialized text display

  • Syntax highlighting
  • Brace and parenthesis matching
  • Outlining / Folding

Edit text!

  • Easy navigation with keyboard shortcuts
  • Keyboard-based selection and movement of text, lines, and logical sections
  • Search and replace using regular expressions

Things most good text editors also do

Cursor and Insertion-Point tricks

  • Multiple Cursors
  • Rectangular / Columnar editing
  • A snippet system of some sort

Things most good text editors also do

IDE-like functionality

  • Integration with documentation
  • Static Analysis and Linting
  • Integration with a REPL
  • Diffing files
  • Integration with version control systems
  • Remote editing

Other things

Specifically for academia

  • Citation and reference management
  • Integration with Zotero
  • Connection to Pandoc