Week | Date | Topic |
---|---|---|
1 | Aug 25 | Big Picture: Doing your work properly |
2 | Sep 1 | No class (Labor Day) |
3 | Sep 8 | The file system; the shell; the terminal |
4 | Sep 15 | Using R to look at data |
5 | Sep 22 | Tidy data and dplyr |
6 | Sep 29 | Ingesting and cleaning data |
7 | Oct 6 | Better tables, better graphs |
8 | Oct 13 | No class (Fall break) |
9 | Oct 20 | Version Control: git and GitHub |
10 | Oct 27 | Working with models |
11 | Nov 3 | Databases and APIs |
12 | Nov 10 | Functional programming patterns |
13 | Nov 17 | Build systems, environments, and packages |
14 | Nov 24 | Leveraging Minions: What AI tools can and can't do for you |
Sociol 703: Modern Plain Text Computing
About this course
This course is an introduction to modern plain-text methods of data analysis, data management, and coding. It is required for first-year graduate students in the Sociology department. It introduces a collection of computing tools and techniques that are widely used in the department, the discipline, across the social sciences, and beyond. We will learn to use these tools and also learn why they exist and why they are important for producing work that is reliable, reproducible, and open to inspection.
Motivation
As researchers and scholars we depend on software to get our work done. But often, we do not know enough about how our computers work. Nor are we encouraged to reflect on why they work the way they do, or given any basic grounding in such matters as part of our training. Instead we end up fending for ourselves and pick things up informally. Or, instead of getting on with the task at hand, course instructors are forced to spend time quickly bringing people up to speed about where that document went, or what a file is, or why “that didn’t work” just now. In the worst case, we never get a feel for this stuff at all and end up marinating in an admixture of magical thinking about and sour resentment towards the machines we sit in front of for hours each day, and will likely sit in front of for the rest of our careers.
All of that is bad. This course is meant to help. The coding and data analysis tools we have are powerful, but the way the work tends to run against the grain of the devices we use most often: our phones. As a rule, apps on your phone hide their implementation details from you. They do not want you to worry too much about where things are stored, or how things are accomplished, or what happens if you need to do the same thing again later. They don’t talk to each other much, either. The fragmented and multifacted tasks associated with scholarly research, meanwhile, make distinctive demands on software. Most of them have to do with the need for control over what you are doing, and especially the importance of having a record of what you did that you can revisit and reproduce if needed. They also need to let us track down and diagnose errors. And they must assist us in pulling disparate pieces of a project together into a presentable final product like a talk, an article, or a book. This can be a tricky process to think about and manage in a systematic way.
To address these challenges, modern technical computing platforms provide us with a suite of powerful, modular, composable tools and techniques. The bad news is that they are not magic; they cannot do our thinking for us. The good news is that they are stable and reliable. Many are supported by helpful communities. Most are developed and improved in the open. Almost all are available for free. Nearly without exception, they tend to work through the medium of explicit, structured instructions written out in plain text. In other words they work by having you write some code, in the broadest sense. People who do research involving structured data of any kind should become familiar with these tools. Lack of familiarly with the basics encourages bad habits and unhealthy attitudes ranging from misplaced contempt to creeping despair.
Throughout the seminar we will move back and forth between two perspectives. First, we will learn about specific tools and tricks associated with using them. Concretely we will learn how to use the file system, the terminal, a text editor, and a programming language and development environment oriented to the analysis of tabular data. At this level we will focus on examples that come up in our everyday work. But second, we will try to develop a way of thinking about what we are doing. We don’t need to learn every tool in the box right away. There are far too many of them to even try, in any case. Rather, we will try to understand why these tools work the way they do, and why the approach they embody is so useful. In the process we will cultivate an attitude of determined curiosity that will help us notice and solve problems in our work as they (inevitably) arise, even when they are (undeniably) frustrating.
Reading
Required reading from books and articles will be provided on the course website, Canvas, or (in most cases) will be freely available online. Useful texts to acquire in hardcopy or to bookmark include:
- Hadley Wickham, Garrett Grolemund, and Mine Çetinkaya-Rundel, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, 2nd ed. (Sebastopol, CA: O’Reilly Media, 2023), https://r4ds.hadley.nz.
- Jeroen Janssens, Data Science at the Command Line, 2nd ed. (O’Reilly Media, 2021), https://jeroenjanssens.com/dsatcl/.
- Chester Ismay and Albert Y. Kim, Statistical Inference via Data Science (CRC Press, 2019), https://moderndive.com.
- Scott Chacon and Ben Straub, Pro Git, 2nd ed. (Apress, 2014), https://git-scm.com/book/en/v2.
- Kieran Healy, Data Visualization: A Practical Introduction (Princeton: Princeton University Press, 2019), http://socviz.co/.
- Jeffrey E F Friedl, Mastering Regular Expressions, 3rd ed. (Sebastopol, CA: O’Reilly Media, 2006).
- Will Landau, “The {Targets} R Package User Manual,” 2022, https://books.ropensci.org/targets/.
- Hadley Wickham and Jennifer Bryan, R Packages: Organize, Test, Document, and Share Your Code, 2nd ed. (O’Reilly Media, 2023), https://r-pkgs.org.
- Garrett Christensen, Jeremy Freese, and Edward Miguel, Transparent and Reproducible Social Science Research (Berkeley: University of California Press, 2019).
Software
We will do all most of our work class using Unix-style command line tools, R, and RStudio. R is a freely-available programming language that is designed for statistical computing and widely used across the natural and social sciences, as well as in the world of “data science” generally. RStudio is an integrated development environment, or IDE, for R, a kind of control center from which you can manage the engine-room of R itself. It is also freely available.
Website, Canvas, and Slack
The course website is at https://mptc.io. Each week, slides, readings, and other class material will be posted there along with additional examples and the problem set due the following week. We will use Duke’s Learning CMS, Canvas, for submission of assignments and to host some readings. There is also a Slack workspace for the course. Details about joining it will be provided via email.
Weekly Schedule
The schedule is shown in Table 1. It is subject to change.
Expectations
This is a graduate seminar. I take it for granted that you have a basic interest in the material, an enthusiastic attitude toward participation, and a respectful attitude to everyone in the room. I expect you to attend each meeting, do the required reading and assignments thoroughly and on time, and participate actively. Participating actively means contributing to class discussion and problem-solving, something that involves both speaking and listening. Do not silently wander off during class, as if the person speaking in front of you was on a screen. Each class meeting will have a short break in the middle.
The main purpose of the first year graduate sequence is to teach you some core things about the field that are either required for you to do good work, or extremely useful toward that end, but which you do not already know. This has some implications that might not be immediately obvious. First, I am not making you read or do things in order to waste your time or haze you in some weird fashion. Second, your role in the class is to try to learn things you don’t already know and not, for example, to try to impress me, your peers, or yourself with how clever you are. Third, this point also applies to my role as the instructor. Fourth, the people in the room—including me—are not your competitors or enemies; they are your interlocutors. Academic disciplines are just highly-structured, long-running conversations involving specialized show-and-tell and question-and-answer formats. This is where you start learning what the conversation topics are, what you need to know in order to make something useful, and how to begin making your own contributions. So please, trust the process. Everything will go better if you do.
Required work and assessment
Weekly Class Participation (50% of final grade) and Assignments (50% of final grade) will let you reflect on the reading and practice your skills. Assignments due by end of day the Friday after the class they are assigned. Submit assignments via Canvas.
Course policies
- Attendance is required, and important. I am a reasonable person; if you need to be absent please let me know in advance insofar as that is possible.
- Do the assigned readings in advance of class.
- Submit problem sets, or other assignments, on time.
- Do not use AI tools to complete assignments unless I explicitly say that you may do so. We’ll discuss them, their uses, and their limitations by the end of the course. But if you use them before then for an assigment, you will get an F for the course. I am not joking. The point of this seminar—the whole reason it is required—is for you to learn how to do things for yourself, and to understand what it is you are doing. When it comes to writing scripts or navigating documentation or debugging bits of code, AI can be very handy once you know what you are doing. But first you have to know what you are doing.
Duke community standard
Like all classes at the university, this course is conducted under the Duke Community Standard. Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to protect and promote a culture of integrity. To uphold the Duke Community Standard you will not lie, cheat, or steal in academic endeavors; you will conduct yourself honorably in all your endeavors; and you will act if the Standard is compromised.