Text Editing

Modern Plain Text Social Science: Week 3

Kieran Healy

September 18, 2023

Shell Scripting

Shell Scripts

  • If you find yourself doing the same task repeatedly, think about whether it makes sense to write a script
  • Shell scripts can become mini-programs, but can also be just one or two lines that pull together a few commands
  • They really show their strength when there’s some fiddly thing you want to do to a lot of files or directories

Shell Scripts

#!/usr/bin/env bash

echo "Hello World!"
  • #! or “shebang” line saying where the interpreter is
  • chmod 755 script.sh or chmod +x script.sh to make executable
  • The interpreter doesn’t have to be the shell: other languages can be scripted too

Shell Scripts

#!/usr/bin/env bash

# Make a thumbnail for each PNG
for i in *.png; do

  FILENAME=$(basename -- "$i") # Full filename
  EXTENSION="${FILENAME##*.}" # Extension only
  FILENAME="${FILENAME%.*}" # Filename without extension

  convert "$i" -thumbnail 500 "$FILENAME-thumb.$EXTENSION";

done;

Shell Scripts

  • The shell can talk to the clipboard:
echo I am sending this sentence to the clipboard | pbcopy
  • Back from the clipboard:
pbpaste | wc -c
      44

Regular Expressions

Setup

  • We’ll do this in R because that’s where most of you will use regexps first.
  • Most of the ideas we’ll cover carry over to the shell and other regexp contexts
library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine
library(stringr)   # string processing

Regular Expressions

Or,

Waiter, there appears to be a language inside my language

Regular Expressions

  • Regexps are their own world of text processing

☜ This book is a thing of beauty.

Searching for patterns

  • A regular expression is a way of searching for a piece of text, or pattern, inside some larger body of text, called a string.
  • The simplest sort of search is like the “Find” functionality in a Word Processor, where the pattern is a literal letter, number, punctuation mark, word or series of words and the text is a document that gets searched one line at a time. The next step up is “Find and Replace”

Searching for patterns

  • Every pattern-searching function in stringr has the same basic form:
str_view(<STRING>, <PATTERN>, [...]) # where [...] means "maybe some options"

Searching for patterns

  • Functions that replace as well as detect strings all have this form:
str_replace(<STRING>, <PATTERN>, <REPLACEMENT>)

Searching for patterns

x <- c("apple", "banana", "pear")

str_view(x, "an")
[2] │ b<an><an>a

Searching for patterns

  • Regular expressions get their real power from wildcards, i.e. tokens that match more than just literal strings, but also more general and more complex patterns.
  • The most general pattern-matching token is, “Match everything!” This is represented by the period, or .
  • But … if “.” matches any character, how do you specifically match the character “.”?

Escaping

  • You have to “escape” the period to tell the regex you want to match it exactly, rather than interpret it as meaning “match anything”.
  • regexs use the backslash, \, to signal “escape the next character”.
  • To match a “.”, you need the regex “\.

Hang on, I see a further problem

  • We use strings to represent regular expressions. \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string “\\.
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
\.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>

But … how do you match a literal \?

x <- "a\\b"
writeLines(x)
a\b
#> a\b

str_view(x, "\\\\") # In R you need four!
[1] │ a<\>b
  • Well that’s ugly

  • This is the price we pay for having to express searches for patterns using a language containing these same characters, which we may also want to search for.

  • I promise this will pay off!

Line delimiters

  • Use ^ to match the start of a string.
  • Use $ to match the end of a string.

Line delimiters

  • Use ^ to match the start of a string.
  • Use $ to match the end of a string.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
[1] │ <a>pple
str_view(x, "a$")
[2] │ banan<a>

Matching start and end

  • To force a regular expression to only match a complete string, anchor it with both ^ and $
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake


str_view(x, "^apple$")
[2] │ <apple>

Matching character classes

\d matches any digit. \s matches any whitespace (e.g. space, tab, newline). [abc] matches a, b, or c. [^abc] matches anything except a, b, or c.

Matching the special characters

Look for a literal character that normally has special meaning in a regex:

Example 1

x <- c("abc", "a.c", "a*c", "a c")
str_view(x, "a[.]c")
[2] │ <a.c>

Example 2

str_view(x, ".[*]c")
[3] │ <a*c>

Alternation

Use parentheses to make the precedence of | clear:

str_view(c("groy", "grey", "griy", "gray"), "gr(e|a)y")
[2] │ <grey>
[4] │ <gray>

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
str_view(x, "CC+")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Repeated patterns

  • ? is 0 or 1
  • + is 1 or more
  • * is 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'C[LX]+')
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
str_view(x, "C{2}")
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
str_view(x, "C{2,}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m
str_view(x, "C{2,3}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

str_view(x, 'C{2,3}?')
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII

Exact numbers of repetitions

  • {n} is exactly n
  • {n,} is n or more
  • {,m} is at most m
  • {n,m} is between n and m

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

str_view(x, 'C[LX]+?')
[1] │ 1888 is the longest year in Roman numerals: MDCC<CL>XXXVIII

And finally … backreferences

fruit # built into stringr
 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            "lychee"            "mandarine"        
[49] "mango"             "mulberry"          "nectarine"        
[52] "nut"               "olive"             "orange"           
[55] "pamelo"            "papaya"            "passionfruit"     
[58] "peach"             "pear"              "persimmon"        
[61] "physalis"          "pineapple"         "plum"             
[64] "pomegranate"       "pomelo"            "purple mangosteen"
[67] "quince"            "raisin"            "rambutan"         
[70] "raspberry"         "redcurrant"        "rock melon"       
[73] "salal berry"       "satsuma"           "star fruit"       
[76] "strawberry"        "tamarillo"         "tangerine"        
[79] "ugli fruit"        "watermelon"       

Grouping and backreferences

Find all fruits that have a repeated pair of letters:

str_view(fruit, "(..)\\1", match = TRUE)
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

Grouping and backreferences

Backreferences and grouping are very useful for string replacements.

OK that was a lot

Learning and testing regexps

Practice with a tester like https://regexr.com

Or an app like Patterns

The regex engine or “flavor” used by stringr is Perl- or PCRE-like.

Regexes in the Shell

  • Grep searches for text inside files
# Search recursively through all subdirs below the current one
grep -r "Grep searches for text" . 


grep "format: " *.qmd


# Count the number of matches
grep -c "format: " *.qmd

Regexes in the Shell

  • Ripgrep, or rg is quicker than grep and has some nice features
rg Kieran .

Regexes in the Shell

  • Ripgrep, or rg is quicker than grep and has some nice features
rg -t yaml "url:" .

Regexes in the Shell

  • Standard shell tools like sed, awk, and grep can all use some version of regular expressions.
grep -E "^The sky|night.$" files/examples/sentences.txt


grep -E "^The sky|night.$" files/examples/sentences.txt

Regexes in the Shell

  • There’s also Perl, a programming language that’s been displaced to some extent by Python but which remains very good at compactly manipulating strings.
  • One useful (but be-careful-not-to-cut-yourself dangerous) thing Perl can do is easily edit a lot of files in place.
# Find every Rmarkdown file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.Rmd" | xargs perl -p -i -e "s/percent_format/label_percent/g"

Regexes in the Shell

  • You can protect a bit against the dangers of doing this by making the -i option create backup files of everything it touches:
# Find every Rmarkdown file beneath the current directory
# Then edit each one in place to replace every instance of 
# `percent_format` with `label_percent`
find . -name "*.Rmd" | xargs perl -p -i.orig -e "s/percent_format/label_percent/g"
  • Here the -i.orig flag will back up e.g. analysis.Rmd to analysis.Rmd.orig.
  • For more on Perl oneliners see, for example, the Perl one-liners cookbook

Text Editors

Choices, choices

  • There are many good text editors.
  • The main point is: pick one, and learn the hell out of it.
  • The RStudio IDE has many of the features of a good editor built in, as well as doing other things.
  • Several of the other editors also have good support for R and many other languages.

Danger, Will Robinson

One view of things

Danger, Will Robinson

  • Endlessly futzing with your text editor’s setup is a displacement activity.
  • The tools are not magic. They cannot by themselves make you do good work. Or any work.

Things any good text editor will do

Specialized text display

  • Syntax highlighting
  • Brace and parenthesis matching
  • Outlining / Folding

Edit text!

  • Easy navigation with keyboard shortcuts
  • Keyboard-based selection and movement of text, lines, and logical sections
  • Search and replace using regular expressions

Things most good text editors also do

Cursor and Insertion-Point tricks

  • Multiple Cursors
  • Rectangular / Columnar editing
  • A snippet system of some sort

Things most good text editors also do

IDE-like functionality

  • Integration with documentation
  • Static analysis and Linting
  • Integration with a REPL
  • Diffing files
  • Integration with version control systems
  • Remote editing

Other things

Specifically for academia

  • Citation and reference management
  • Integration with Zotero
  • Connection to Pandoc