Files & the File System

Modern Plain Text Computing
Week 01b

Kieran Healy

October 7, 2024

Files

What is a file?

You very likely have never used one of these. Perhaps you’ve never even seen one in real life.

The file cabinet!

“Could capitalism, surveillance, and governance have developed in the twentieth century without filing cabinets? Of course, but only if there had been another way to store and circulate paper efficiently; if that had been the case, that technology would be the object of this book.” — Craig Robertson The Filing Cabinet: A Vertical History of Information (University of Minnesota Press, 2021), 3.

“The ubiquity of the filing cabinet in the twentieth-century office space, along with its noticeable absence of style, has obscured its transformative role in the histories of both information technology and work. In the first in-depth history of this neglected artifact, Craig Robertson explores how the filing cabinet profoundly shaped the way that information and data have been sorted, stored, retrieved, and used.

Invented in the 1890s, the filing cabinet was a result of the nineteenth-century faith in efficiency. Previously, paper records were arranged haphazardly: bound into books, stacked in piles, curled into slots, or impaled on spindles. The filing cabinet organized loose papers in tabbed folders that could be sorted alphanumerically, radically changing how people accessed, circulated, and structured information.

Robertson’s unconventional history of the origins of the information age posits the filing cabinet as an information storage container, an “automatic memory” machine that contributed to a new type of information labor privileging manual dexterity over mental deliberation. Gendered assumptions about women’s nimble fingers helped to naturalize the changes that brought women into the workforce as low-level clerical workers. The filing cabinet emerges from this unexpected account as a sophisticated piece of information technology and a site of gendered labor that with its folders, files, and tabs continues to shape how we interact with information and data in today’s digital world.”

The file cabinet!

“Cabinet logic involves the creation of interior compartments to organize storage space according to classification and indexing systems … Partitions made from paper, not wood, divided storage space to create rigorous order; these partitions took the form of tabbed manila folders separated by tabbed guide cards. This iteration of the logic dispensed with a separate index to make paper discoverable by utilizing the “very organization of the material and its location” with the “vertical guides serving as locating medium.” Elimination of an index was signaled in filing literature by the terms “direct alphabet index” and “automatic index” … Without the need to consult a separate index, a clerk grouped papers together on their edge behind tabs labeled with classifications, so any given paper could be found quickly.” — Robertson, The Filing Cabinet, 104–5.

Index cards

Like a filing cabinet, but smol

Index cards

Automating Information and Control

A music box

A Jacquard Loom

Jacquard Loom Cards

Tabulation Machines

Hollerith Cards

“At the end of the 1800s Herman Hollerith invented the recording of data on a medium that could then be read by a machine,[dubious – discuss][13][14][15][16] developing punched card data processing technology for the 1890 U.S. census.[17] His tabulating machines read and summarized data stored on punched cards and they began use for government and commercial data processing. Initially, these electromechanical machines only counted holes, but by the 1920s they had units for carrying out basic arithmetic operations.[18]: 124 Hollerith founded the Tabulating Machine Company (1896) which was one of four companies that were amalgamated via stock acquisition to form a fifth company, Computing-Tabulating-Recording Company (CTR) (1911), later renamed International Business Machines Corporation (IBM) (1924). Other companies entering the punched card business included The Tabulator Limited (Britain, 1902), Deutsche Hollerith-Maschinen Gesellschaft mbH (Dehomag) (Germany, 1911), Powers Accounting Machine Company (US, 1911), Remington Rand (US, 1927), and H.W. Egli Bull (France, 1931).[19] These companies, and others, manufactured and marketed a variety of punched cards and unit record machines for creating, sorting, and tabulating punched cards, even after the development of electronic computers in the 1950s.”

Hollerith Machines

Field 1	Civil War Veterans	CM, CL, CS: Marine, Sailor, Soldier	Confederate Army
		UM, UL, US: Marine, Sailor, Soldier	United States Army
Field 2	Relationship (family)	Hd: Head of Household	Mb: Member of family
		Wf: Wife	O: Other
Field 3	Race	Jp: Japanese	In: Civilized Indian (taxpaying)
		Ch: Chinese	Mu: Mulatto
		B: Negro	Qd: Quadroon
		W: White	Oc: Octoroon
Field 4	Gender	Male or Female
Field 5 & 6	Your Age	Combination of numbers	0: if under 1 year old
		Un: Unknown	Mo: # of months if a baby
Field 7	Conjugal Condition	Dv: Divorced	Mr: Married more than 1 year
		Un: Unknown	CY: Married this year
		Wd: Widow/Widower	(in the Census Year)
		Sg: Single
Field 8	# of Children born	Up to 19 children
Field 9	# of Children alive	Up to 19 children
Field 10 & 11	Your birthplace	Combination of upper case letters	St: birth was in state of residence
		and lower case letters	Un: Unknown
Field 12	Mother’s birthplace	Two Letter Country Codes
Field 13	Father’s birthplace	Two Letter Country Codes
Field 14	# of years in the US
Field 15	Citizenship for Foreign-born	Al: Alien	Pa: First Papers submitted
		Na: Naturalized citizen	Un: Unknown
Field 16 & 17	Occupation	Combination of upper case letters and lower case letters	NG: Not Gainfully employed or
			in School
Field 18	Unemployed	1-12: Months unemployed	Oc
			O
Field 19	Education	OK: Can read and write	————-OR———–
		W: Can read but not write	1-9: Number of months in school
		R: : Can write but not read	0: : Not in school
Field 20	Language Spoken	Un: Understand English?	Ft: Father’s language
		En: English	Mt: Mother’s language
		Ot: Other
Field 21	Tenure	Fh: Farm Hired (Rented)	Hh: Home Hired (Rented)
		FM: Farm Owned (w/ mortgage)	Hf: Home owned Free and clear
		Ff: Farm owned Free and clear	Hm: Home Owned (w/ mortgage)
		X: : Does not apply
Far Left Side	Location	State, county, city,: local divisions and	Enumerator District

Hollerith Operators

Demonstrating a older card-puncher, probably to show how things had improved with census tabulation methods. This is likely the “Before” picture with a roll from the 1890 Census. The card-puncher is a Pantograph.

Hollerith Operators

Same woman as the previous photo; her colleague on the right is demonstrating the newer, faster IBM Type 001 Key Puncher. (Again, probably a re-enactment / demo of earlier techniques.)

Programmable Computers

Logic from Sand

The best book to read about how the guts of a programmable computer works is Charles Petzold’s Code: The Hidden Language of Computer Hardware and Software, 2nd ed. (Microsoft Press, 2022).

IBM punch cards

In the longer term, punch card writers got much more efficient. And now they could be fed into machines that could use them to run programs instead of just tabulate the punches.

IBM punch cards

An IBM punch card is 80 columns wide. The first CRT terminals displayed 80 columns of text for this reason. You’ll see 80 columns of text pop up as a standard in all kinds of places.

Big Iron

No screens! Paper in, paper out for the operator; magnetic tapes for storage in the background. This is an IBM/360, the most important class of mainframe in the 1960s and early 1970s.

One thing that’s hard to convey in pictures is the way that—because of all the daisy-wheel or tractor-fed printing, mechanical card processing, and huge reels of tape spinning up and down—rooms like this were loud.

Storage

Notice that the “File” here is the machine itself, or at most a single disk platter.

Storage

The older way of speaking is still with us, as when we speak of someone’s “Application File” or “Tenure File”; that is, a file is a collection of related documents.

But the newer way, where “file” means “a single document”, is now dominant, especially in computing.

What Files Are

A file is a metaphor

Your computer does not have “files” in the way that a filing cabinet has files.
A file is an abstraction, a way of naming and organizing data on your computer that at a lower-level is “just zeros and ones” (and at a lower level than that is just patterns in some physical substrate that can be interpreted as zeros and ones)
The file metaphor in computing dates most prominently to the development of the Unix operating system in the early 1970s
Files are organized in filesystems

There are many kinds of files

As many as there are kinds of application.
Files have the name someone gives them. My Thesis, term_paper, and so on.
There’s a longstanding (though weak) convention about using file extensions, tagged on to the end of a name, to signal to users what kind of file it is: term_paper.docx, .xlsx, .ppt, .pdf, .sqlite, .png, .jpg, .ps, .mp3, .mp4, .gif, .csv, .Rmd, .qmd, .md, .txt.
Files don’t know what their extension is, a bit like how electrons don’t know what color the outside of their copper wire is.

Binary and Plain Text files

Understanding the general notion of “encoding information” is a very rich and deep topic that, sadly, we are going to skip.
If a file is in some binary format then in general you won’t be able to read its contents just by looking inside it. You will need an application that understands the file’s particular format; i.e. the way that information in it is encoded.
A .jpg file uses a set of rules to store numbers that can be interpreted as corresponding to things like the hue and location of a pixel. But you won’t see a picture if you look inside a .jpg file using a text editor. You’ll need an application that knows how to read .jpg files.

What is Plain Text?

Text files, though, are sort of special. What’s visibly in them appears to correspond much more closely to what they represent. A plain text file seems to represent the letter “A” with a symbol that looks like an “A”. So much so that we can say it is an “A”.
That means that when you look at a text file you can see what is in it immediately. And editing the contents of the file is the same as editing its text.
There’s still an “encoding” of course! It’s still necessary to have an application that can read the text file and display it on a screen, etc. But what’s inside seems much closer to being immediately interpretable “just by looking”, because most of it is letters and numbers.

But wait!

I thought you said computers just store ones and zeros?

Yes this is true. In ASCII encoding, for instance, an “A” is really just conventionally the symbol represented by the seven-bit binary number 1000001, which exists on some sort of storage medium (an SSD, a Hard Disk, a floppy disk, a punch card, a reel of paper, whatever) in such a way that some device can read its contents.

ASCII is the American Standard Code for Information Interchange. It was first specified in 1963.

ASCII

The venerable and now outdated ASCII character set: 26 uppercase letters; 26 lowercase letters; 10 digits; 32 printable symbols; and 33 control characters ultimately derived from telegraph code and teletype machines.

Binary	ASCII	Decimal	Hexadecimal	Octal
0000000	null	0	0	0
0000001	start of header	1	1	1
0000010	start of text	2	2	2
0000011	end of text	3	3	3
0000100	end of transmission	4	4	4
0000101	enquire	5	5	5
0000110	acknowledge	6	6	6
0000111	bell	7	7	7
0001000	backspace	8	8	10
0001001	horizontal tab	9	9	11
0001010	linefeed	10	A	12
0001011	vertical tab	11	B	13
0001100	form feed	12	C	14
0001101	carriage return	13	D	15
0001110	shift out	14	E	16
0001111	shift in	15	F	17
0010000	data link escape	16	10	20
0010001	device control 1/Xon	17	11	21
0010010	device control 2	18	12	22
0010011	device control 3/Xoff	19	13	23
0010100	device control 4	20	14	24
0010101	negative acknowledge	21	15	25
0010110	synchronous idle	22	16	26
0010111	end of transmission block	23	17	27
0011000	cancel	24	18	30
0011001	end of medium	25	19	31
0011010	end of file/ substitute	26	1A	32
0011011	escape	27	1B	33
0011100	file separator	28	1C	34
0011101	group separator	29	1D	35
0011110	record separator	30	1E	36
0011111	unit separator	31	1F	37
001e+05	space	32	20	40
0100001	!	33	21	41
0100010	"	34	22	42
0100011	#	35	23	43
0100100	$	36	24	44
0100101	%	37	25	45
0100110	&	38	26	46
0100111	'	39	27	47
0101000	(	40	28	50
0101001	)	41	29	51
0101010	*	42	2A	52
0101011	+	43	2B	53
0101100	,	44	2C	54
0101101	-	45	2D	55
0101110	.	46	2E	56
0101111	/	47	2F	57
0110000	0	48	30	60
0110001	1	49	31	61
0110010	2	50	32	62
0110011	3	51	33	63
0110100	4	52	34	64
0110101	5	53	35	65
0110110	6	54	36	66
0110111	7	55	37	67
0111000	8	56	38	70
0111001	9	57	39	71
0111010	:	58	3A	72
0111011	;	59	3B	73
0111100	<	60	3C	74
0111101	=	61	3D	75
0111110	>	62	3E	76
0111111	?	63	3F	77
001e+06	@	64	40	100
1000001	A	65	41	101
1000010	B	66	42	102
1000011	C	67	43	103
1000100	D	68	44	104
1000101	E	69	45	105
1000110	F	70	46	106
1000111	G	71	47	107
1001000	H	72	48	110
1001001	I	73	49	111
1001010	J	74	4A	112
1001011	K	75	4B	113
1001100	L	76	4C	114
1001101	M	77	4D	115
1001110	N	78	4E	116
1001111	O	79	4F	117
1010000	P	80	50	120
1010001	Q	81	51	121
1010010	R	82	52	122
1010011	S	83	53	123
1010100	T	84	54	124
1010101	U	85	55	125
1010110	V	86	56	126
1010111	W	87	57	127
1011000	X	88	58	130
1011001	Y	89	59	131
1011010	Z	90	5A	132
1011011	[	91	5B	133
1011100	\	92	5C	134
1011101	]	93	5D	135
1011110	^	94	5E	136
1011111	_	95	5F	137
1100000	`	96	60	140
1100001	a	97	61	141
1100010	b	98	62	142
1100011	c	99	63	143
1100100	d	100	64	144
1100101	e	101	65	145
1100110	f	102	66	146
1100111	g	103	67	147
1101000	h	104	68	150
1101001	i	105	69	151
1101010	j	106	6A	152
1101011	k	107	6B	153
1101100	l	108	6C	154
1101101	m	109	6D	155
1101110	n	110	6E	156
1101111	o	111	6F	157
1110000	p	112	70	160
1110001	q	113	71	161
1110010	r	114	72	162
1110011	s	115	73	163
1110100	t	116	74	164
1110101	u	117	75	165
1110110	v	118	76	166
1110111	w	119	77	167
1111000	x	120	78	170
1111001	y	121	79	171
1111010	z	122	7A	172
1111011	{	123	7B	173
1111100	\|	124	7C	174
1111101	}	125	7D	175
1111110	~	126	7E	176
1111111	DEL	127	7F	177

Modern Text: Unicode and UTF-8

ASCII is a seven bit system that only has $2^7$ or 128 “code points” — i.e. individual slots that could represent anything. It left out all kinds of things. (Other alphabets, for instance. Also any diacritics or accents. And any number of symbols.)
Eight bit computers allowed for 256 code points. The second 128 never had a single standard for what they should represent. The most common extension was ISO-8859-1 or “Latin1” encoding, but there were others too. This created conflicts and confusion when a program or application expecting text encoded according to one standard was fed text encoded with a different standard.

Modern Text: Unicode and UTF-8

It is surprisingly difficult to establish the encoding of a large text file that doesn’t explicitly declare how it’s encoded in some sort of metadata. (You can guess, but it can be super-annoying.)
Nowadays this has mostly been resolved by the adoption of Unicode and its simplest and most widespread encoding, UTF-8, which extends ASCII to 1,112,064 code points. It uses between one and four eight-bit elements to represent particular character glyphs.
Many older datasets may still be encoded in something other than UTF-8, however.

Organizing Files

Input/Output

Beginning in the 1970s, computing rapidly moves away from print I/O and towards screens.
Storage capacity and processing power increase radically (and get much smaller) with the development of hard drives and integrated circuits.
We get to a point where our “Teletype” interface with the machine is purely metaphorical: this is the command line or console.
And after that, in the late 1970s and early 1980s, an entirely new set of metaphors gets introduced: files represented by “icons” inside “windows”, first on on a metaphorical “desktop” and then later on a more abstract touch-based surface.

A late-model teletype (TTY) machine

The DEC VT-100 Terminal (1978)

The IBM PC (1981)

The Apple Macintosh (1984)

The macOS Terminal app icon

This is where we came in

The “Office” and “Engineering” models really start to diverge in the 1980s
A lot of computing gets done using the Engineering model and its metaphors, even as the Office model comes to dominate.
But many of these newer systems remain built on top of the world made out of the older metaphors. And in particular, the idea of named files living in a hierarchical file system that are acted on in sequence through written instructions remains extremely important for many computing tasks.
Especially the stuff we need to do.

Back to the file system

Files

Our data is stored — or represented as being stored — in a file system.
This is, again, a way of organizing items for our benefit.
The UNIX operating system developed at Bell Labs codifies the modern “file” metaphor.
Files are named items that live in a hierarchical file system. “Ordinary” documents like notes.txt are thought of as files, which seems natural to us now.
The hierarchy is made of folders or “directories” that, like a filing cabinet, can nest inside one another and inside larger storage units.
By navigating the hierarchy from its root, we can trace a path to any particular file.