Files & the File System

Modern Plain Text Computing
Week 01b

Kieran Healy

August 28, 2024

Files

What is a file?

You very likely have never used one of these. Perhaps you’ve never even seen one in real life.

The file cabinet!

“Could capitalism, surveillance, and governance have developed in the twentieth century without filing cabinets? Of course, but only if there had been another way to store and circulate paper efficiently; if that had been the case, that technology would be the object of this book.” — Craig Robertson The Filing Cabinet: A Vertical History of Information (University of Minnesota Press, 2021), 3.

The file cabinet!

“Cabinet logic involves the creation of interior compartments to organize storage space according to classification and indexing systems … Partitions made from paper, not wood, divided storage space to create rigorous order; these partitions took the form of tabbed manila folders separated by tabbed guide cards. This iteration of the logic dispensed with a separate index to make paper discoverable by utilizing the “very organization of the material and its location” with the “vertical guides serving as locating medium.” Elimination of an index was signaled in filing literature by the terms “direct alphabet index” and “automatic index” … Without the need to consult a separate index, a clerk grouped papers together on their edge behind tabs labeled with classifications, so any given paper could be found quickly.” — Robertson, The Filing Cabinet, 104–5.

Index cards

Like a filing cabinet, but smol

Index cards

Automating Information and Control

A music box

A Jacquard Loom

Jacquard Loom Cards

Tabulation Machines

Hollerith Cards

Hollerith Machines

Hollerith Machines

Hollerith Machines

Hollerith Operators

Demonstrating a older card-puncher, probably to show how things had improved with census tabulation methods. This is likely the “Before” picture with a roll from the 1890 Census. The card-puncher is a Pantograph.

Hollerith Operators

Same woman as the previous photo; her colleague on the right is demonstrating the newer, faster IBM Type 001 Key Puncher. (Again, probably a re-enactment / demo of earlier techniques.)

Programmable Computers

Logic from Sand

The best book to read about how the guts of a programmable computer works is Charles Petzold’s Code: The Hidden Language of Computer Hardware and Software, 2nd ed. (Microsoft Press, 2022).

IBM punch cards

In the longer term, punch card writers got much more efficient. And now they could be fed into machines that could use them to run programs instead of just tabulate the punches.

IBM punch cards

An IBM punch card is 80 columns wide. The first CRT terminals displayed 80 columns of text for this reason. You’ll see 80 columns of text pop up as a standard in all kinds of places.

Big Iron

No screens! Paper in, paper out for the operator; magnetic tapes for storage in the background. This is an IBM/360, the most important class of mainframe in the 1960s and early 1970s.

One thing that’s hard to convey in pictures is the way that—because of all the daisy-wheel or tractor-fed printing, mechanical card processing, and huge reels of tape spinning up and down—rooms like this were loud.

Storage

Notice that the “File” here is the machine itself, or at most a single disk platter.

Storage

The older way of speaking is still with us, as when we speak of someone’s “Application File” or “Tenure File”; that is, a file is a collection of related documents.

But the newer way, where “file” means “a single document”, is now dominant, especially in computing.

What Files Are

A file is a metaphor

  • Your computer does not have “files” in the way that a filing cabinet has files.
  • A file is an abstraction, a way of naming and organizing data on your computer that at a lower-level is “just zeros and ones” (and at a lower level than that is just patterns in some physical substrate that can be interpreted as zeros and ones)
  • The file metaphor in computing dates most prominently to the development of the Unix operating system in the early 1970s
  • Files are organized in filesystems

There are many kinds of files

  • As many as there are kinds of application.
  • Files have the name someone gives them. My Thesis, term_paper, and so on.
  • There’s a longstanding (though weak) convention about using file extensions, tagged on to the end of a name, to signal to users what kind of file it is: term_paper.docx, .xlsx, .ppt, .pdf, .sqlite, .png, .jpg, .ps, .mp3, .mp4, .gif, .csv, .Rmd, .qmd, .md, .txt.
  • Files don’t know what their extension is, a bit like how electrons don’t know what color the outside of their copper wire is.

Binary and Plain Text files

  • Understanding the general notion of “encoding information” is a very rich and deep topic that, sadly, we are going to skip.
  • If a file is in some binary format then in general you won’t be able to read its contents just by looking inside it. You will need an application that understands the file’s particular format; i.e. the way that information in it is encoded.
  • A .jpg file uses a set of rules to store numbers that can be interpreted as corresponding to things like the hue and location of a pixel. But you won’t see a picture if you look inside a .jpg file using a text editor. You’ll need an application that knows how to read .jpg files.

What is Plain Text?

  • Text files, though, are sort of special. What’s visibly in them appears to correspond much more closely to what they represent. A plain text file seems to represent the letter “A” with a symbol that looks like an “A”. So much so that we can say it is an “A”.

  • That means that when you look at a text file you can see what is in it immediately. And editing the contents of the file is the same as editing its text.

  • There’s still an “encoding” of course! It’s still necessary to have an application that can read the text file and display it on a screen, etc. But what’s inside seems much closer to being immediately interpretable “just by looking”, because most of it is letters and numbers.

But wait!

  • I thought you said computers just store ones and zeros?
  • Yes this is true. In ASCII encoding, for instance, an “A” is really just conventionally the symbol represented by the seven-bit binary number 1000001, which exists on some sort of storage medium (an SSD, a Hard Disk, a floppy disk, a punch card, a reel of paper, whatever) in such a way that some device can read its contents.
  • ASCII is the American Standard Code for Information Interchange. It was first specified in 1963.

ASCII

The venerable and now outdated ASCII character set: 26 uppercase letters; 26 lowercase letters; 10 digits; 32 printable symbols; and 33 control characters ultimately derived from telegraph code and teletype machines.

Binary ASCII Decimal Hexadecimal Octal
0000000 null 0 0 0
0000001 start of header 1 1 1
0000010 start of text 2 2 2
0000011 end of text 3 3 3
0000100 end of transmission 4 4 4
0000101 enquire 5 5 5
0000110 acknowledge 6 6 6
0000111 bell 7 7 7
0001000 backspace 8 8 10
0001001 horizontal tab 9 9 11
0001010 linefeed 10 A 12
0001011 vertical tab 11 B 13
0001100 form feed 12 C 14
0001101 carriage return 13 D 15
0001110 shift out 14 E 16
0001111 shift in 15 F 17
0010000 data link escape 16 10 20
0010001 device control 1/Xon 17 11 21
0010010 device control 2 18 12 22
0010011 device control 3/Xoff 19 13 23
0010100 device control 4 20 14 24
0010101 negative acknowledge 21 15 25
0010110 synchronous idle 22 16 26
0010111 end of transmission block 23 17 27
0011000 cancel 24 18 30
0011001 end of medium 25 19 31
0011010 end of file/ substitute 26 1A 32
0011011 escape 27 1B 33
0011100 file separator 28 1C 34
0011101 group separator 29 1D 35
0011110 record separator 30 1E 36
0011111 unit separator 31 1F 37
001e+05 space 32 20 40
0100001 ! 33 21 41
0100010 " 34 22 42
0100011 # 35 23 43
0100100 $ 36 24 44
0100101 % 37 25 45
0100110 & 38 26 46
0100111 ' 39 27 47
0101000 ( 40 28 50
0101001 ) 41 29 51
0101010 * 42 2A 52
0101011 + 43 2B 53
0101100 , 44 2C 54
0101101 - 45 2D 55
0101110 . 46 2E 56
0101111 / 47 2F 57
0110000 0 48 30 60
0110001 1 49 31 61
0110010 2 50 32 62
0110011 3 51 33 63
0110100 4 52 34 64
0110101 5 53 35 65
0110110 6 54 36 66
0110111 7 55 37 67
0111000 8 56 38 70
0111001 9 57 39 71
0111010 : 58 3A 72
0111011 ; 59 3B 73
0111100 < 60 3C 74
0111101 = 61 3D 75
0111110 > 62 3E 76
0111111 ? 63 3F 77
001e+06 @ 64 40 100
1000001 A 65 41 101
1000010 B 66 42 102
1000011 C 67 43 103
1000100 D 68 44 104
1000101 E 69 45 105
1000110 F 70 46 106
1000111 G 71 47 107
1001000 H 72 48 110
1001001 I 73 49 111
1001010 J 74 4A 112
1001011 K 75 4B 113
1001100 L 76 4C 114
1001101 M 77 4D 115
1001110 N 78 4E 116
1001111 O 79 4F 117
1010000 P 80 50 120
1010001 Q 81 51 121
1010010 R 82 52 122
1010011 S 83 53 123
1010100 T 84 54 124
1010101 U 85 55 125
1010110 V 86 56 126
1010111 W 87 57 127
1011000 X 88 58 130
1011001 Y 89 59 131
1011010 Z 90 5A 132
1011011 [ 91 5B 133
1011100 \ 92 5C 134
1011101 ] 93 5D 135
1011110 ^ 94 5E 136
1011111 _ 95 5F 137
1100000 ` 96 60 140
1100001 a 97 61 141
1100010 b 98 62 142
1100011 c 99 63 143
1100100 d 100 64 144
1100101 e 101 65 145
1100110 f 102 66 146
1100111 g 103 67 147
1101000 h 104 68 150
1101001 i 105 69 151
1101010 j 106 6A 152
1101011 k 107 6B 153
1101100 l 108 6C 154
1101101 m 109 6D 155
1101110 n 110 6E 156
1101111 o 111 6F 157
1110000 p 112 70 160
1110001 q 113 71 161
1110010 r 114 72 162
1110011 s 115 73 163
1110100 t 116 74 164
1110101 u 117 75 165
1110110 v 118 76 166
1110111 w 119 77 167
1111000 x 120 78 170
1111001 y 121 79 171
1111010 z 122 7A 172
1111011 { 123 7B 173
1111100 | 124 7C 174
1111101 } 125 7D 175
1111110 ~ 126 7E 176
1111111 DEL 127 7F 177

Modern Text: Unicode and UTF-8

  • ASCII is a seven bit system that only has \(2^7\) or 128 “code points” — i.e. individual slots that could represent anything. It left out all kinds of things. (Other alphabets, for instance. Also any diacritics or accents. And any number of symbols.)
  • Eight bit computers allowed for 256 code points. The second 128 never had a single standard for what they should represent. The most common extension was ISO-8859-1 or “Latin1” encoding, but there were others too. This created conflicts and confusion when a program or application expecting text encoded according to one standard was fed text encoded with a different standard.

Modern Text: Unicode and UTF-8

  • Encoding conflicts are why you still sometimes see this sort of thing on web pages: “Café” or “Caf◻” instead of “Café”.
  • It is surprisingly difficult to establish the encoding of a large text file that doesn’t explicitly declare how it’s encoded in some sort of metadata. (You can guess, but it can be super-annoying.)
  • Nowadays this has mostly been resolved by the adoption of Unicode and its simplest and most widespread encoding, UTF-8, which extends ASCII to 1,112,064 code points. It uses between one and four eight-bit elements to represent particular character glyphs.
  • Many older datasets may still be encoded in something other than UTF-8, however.

Organizing Files

Input/Output

  • Beginning in the 1970s, computing rapidly moves away from print I/O and towards screens.
  • Storage capacity and processing power increase radically (and get much smaller) with the development of hard drives and integrated circuits.
  • We get to a point where our “Teletype” interface with the machine is purely metaphorical: this is the command line or console.
  • And after that, in the late 1970s and early 1980s, an entirely new set of metaphors gets introduced: files represented by “icons” inside “windows”, first on on a metaphorical “desktop” and then later on a more abstract touch-based surface.

A late-model teletype (TTY) machine

The DEC VT-100 Terminal (1978)

The IBM PC (1981)

The Apple Macintosh (1984)

The macOS Terminal app icon

This is where we came in

  • The “Office” and “Engineering” models really start to diverge in the 1980s
  • A lot of computing gets done using the Engineering model and its metaphors, even as the Office model comes to dominate.
  • But many of these newer systems remain built on top of the world made out of the older metaphors. And in particular, the idea of named files living in a hierarchical file system that are acted on in sequence through written instructions remains extremely important for many computing tasks.
  • Especially the stuff we need to do.

Back to the file system

Files

  • Our data is stored — or represented as being stored — in a file system.
  • This is, again, a way of organizing items for our benefit.
  • The UNIX operating system developed at Bell Labs codifies the modern “file” metaphor.
  • Files are named items that live in a hierarchical file system. “Ordinary” documents like notes.txt are thought of as files, which seems natural to us now.
  • The hierarchy is made of folders or “directories” that, like a filing cabinet, can nest inside one another and inside larger storage units.
  • By navigating the hierarchy from its root, we can trace a path to any particular file.