Mass-replacing patterns in Emacs

Some people have project cars—cars that don’t work, that they put up on blocks or in the garage, and then lovingly restore to working order. I have some project books: e-texts that I got on a compilation CD in the 1990s, and finally decided to convert to EPUBs.

I’m using the Standard Ebooks style manual, since they seem to know what they’re doing. And these style guidelines can get pretty specific. For instance, if you have a numeric or date range, you should use an en dash surrounded by Unicode word joiner glyphs.

None of this is a problem if you’re using Emacs, of course. You can enter weird characters with M-x insert-char, which allows you to search for specific Unicode characters. And of course M-x query-replace-regex will find two numbers separated by an ASCII hyphen.

The clever part, though, was the realization that even though I have about 900 source xhtml files, they only come to about 35Mb, and in the 2020s, that’s not a lot (though it would have been science fiction back when I got that CD). So why not just load them all?

Once we do that, we can use occur-mode to look for patterns. It’s like grep for Emacs. In this case, we want to search for patterns in all of the source buffers, so we’ll use multi-occur-in-matching-buffers, which requires us to specify one pattern for the buffers to search (\.xhtml$) and another for the string or pattern to search in those buffers ([[:digit:]]-[[:digit:]]).

That brings up a buffer called *Occur* with all of the matching lines. And here’s the next cool bit: you can press e to switch to occur-edit mode: if you make a change to the *Occur* buffer, those changes propagate back to the source buffers. Which means that I can use standard tools like replace-string, query-replace-string, replace-regexp, and query-replace-regexp to either make changes in bulk if I’m sure of what I’m doing, or one at a time if I’m not.

One difficulty: what if some lines match what you want, and some don’t, but the “bad” lines match some other pattern? For instance, in my case, the CD publishers used an ASCII dash followed by a space for what should have been an em dash. But searching for “- ” (a dash followed by space) cluttered up my *Occur* buffer with a lot of HTML comments.

No problem: *Occur* is just another Emacs buffer, so all I had to do was use replace-regexp to delete the lines with HTML comments, which cleared away a lot of distracting noise. (And no, deleting whole lines in *Occur* doesn’t delete the corresponding lines in the source buffers.)

A Few More Thoughts on Literate Programming

A while back, I became intrigued by Donald Knuth’s idea of Literate Programming, and decided to give it a shot. That first attempt was basically just me writing down what I knew as quickly as I learned it, and trying to pass it off as a knowledgeable tutorial. More recently, I tried a second project, a web-app that solves Wordle, and thought I’d write it in the Literate style as well.

The first time around, I learned the mechanics. The second time, I was able to learn one or two things about the coding itself.

(For those who don’t remember, in literate programming, you write code intertwined with prose that explains the code, and a post-processor turns the result into a pretty document for humans to read, and ugly code for computers to process.

1) The thing I liked the most, the part where literate programming really shines, is having the code be grouped not by function or by class, but by topic. I could introduce a <div class="message-box"></div> in the main HTML file, and in the next paragraph introduce the CSS that styles it, and the JavaScript code that manipulates it.

2) In the same vein, several times I rearranged the source to make the explanations flow better, not discuss variables or functions until I had explained why they’re there and what they do, without it altering the underlying HTML or JavaScript source. In fact, this led to a stylistic quandary:

3) I defined a few customization variables. You know, the kind that normally go at the top for easy customization:

var MIN_FOO = 30;
var MAX_FOO = 1500;
var LOG_FILE = "/var/log/mylogfile.log";

Of course, the natural tendency was to put them next to the code that they affect, somewhere in the middle of the source file. Should I have put them at the top of my source instead?

4) Even smaller: how do you pass command-line option definitions to getopt()? If you have options -a, -b, and -c, each will normally be defined in its own section. So in principle, the literate thing to do would be to write

getopt("{{option-a}}{{option-b}}{{option-c}}");

and have a section that defines option-a as “a“. As you can see, though, defining single-letter strings isn’t terribly readable, and literate programming is all about readability.

5) Speaking of readability, one thing that can come in really handy is the ability to generate a pretty document for human consumption. Knuth’s original tools generated TeX, of course, and it doesn’t get prettier than that.

I used org-mode, which accepts TeX style math notation, but also allows you to embed images and graphviz graphs. In my case, I needed to calculate the entropy of a variable, so being able to use proper equations, with nicely-formatted sigmas and italicized variables, was very nice. I’ve worked in the past on a number of projects where it would have been useful to embed a diagram with circles and arrows, rather than using words or ASCII art.

6) I was surprised to find that I had practically no comments in the base code (in the JavaScript, HTML, and CSS that were generated from my org-mode source file). I normally comment a lot. It’s not that I was less verbose. In fact, I was more verbose than usual. It’s just that I was putting all of the explanations about what I was trying to do, and why things were the way they are, in the human-docs part of the source, not the parts destined for computer consumption. Which, I guess, was the point.

7) Related to this, I think I had fewer bugs than I would normally have gotten in a project of this size. I don’t know why, but I suspect that it was due to some combination of thinking “out loud” (or at least in prose) before pounding out a chunk of code, and of having related bits of code next to each other, and not scattered across multiple files.

8) I don’t know whether I could tackle a large project in this way. You might say, “Why not? Donald Knuth wrote both TeX and Metafont as literate code, and even published the source in two fat books!” Well, yeah, but he’s Donald Knuth. Also, he was writing before IDEs, or even color-coded code, were available.

I found org-mode to be the most comfortable tool for me to use for this project. But of course that effectively prevents people who don’t use Emacs (even though they obviously should) from contributing.

One drawback of org-mode as a literate programming development environment is that you’re pretty much limited to one source file, which obviously doesn’t scale. There are other tools out there, like noweb, but I found those harder to set up, or they forced me to use (La)TeX when I didn’t want to, or the like.

9) One serious drawback of org-mode is that it makes it nearly impossible to add cross-reference links. If you have a section like

function myFunc() {
var thing;
{{calculate thing}}
return thing;
}

it would be very useful to have {{calculate thing}} be a link that you can click on to go to the definition of that chunk. But this is much harder to do in org-mode than it should be. So is labeling chunks, so that people can chase cross-references even without convenient links. It has a lot of work to be done in that regard.

WFHing with Emacs: Work Mode and Command-Line Options

Like the rest of the world, I’m working from home these days. One of the changes I’ve made has been to set up Emacs to work from home.

I use Emacs extensively both at home and at work. So far, my method for keeping personal stuff and work stuff separate has been to, well, keep separate copies of ~/.emacs.d/ on work and home machines. But now that my home machine is my work machine, I figured I’d combine their configs.

To do this, I just added a -work command-line option, so that emacs -work runs in work mode. The command-switch-alist variable is useful here: it allows you to define a command-line option, and a function to call when it is encountered:

(defun my-work-setup (arg)
   ;; Do setup for work-mode
  )
(add-to-list 'command-switch-alist
  '("work" . my-work-setup))

Of course, I’ve never liked defining functions to be called only once. That’s what lambda expressions are for:

(add-to-list 'command-switch-alist
  '("work" .
    (lambda (arg)
      ;; Do setup for work-mode
      (setq my-mode 'work)
      )))

One thing to bear in mind about command-switch-alist is that it gets called as soon as the command-line option is seen. So let’s say you have a -work argument and a -logging option. And the -logging-related code needs to know whether work mode is turned on. That means you would always have to remember to put the -work option before the -logging option, which isn’t very elegant.

A better approach is to use the command-switch-alist entries to just record that a certain option has been set. The sample code above simply sets my-mode to 'work when the -work option is set. Then do the real startup stuff after the command line has been parsed and you know all of the options that have been passed in.

Unsurprisingly, Emacs has a place to put that: emacs-startup-hook:

(defvar my-mode 'home
  "Current mode. Either home or work.")
(add-to-list 'command-switch-alist
  '("work" . (lambda (arg)
                (setq my-mode 'work))))

(defvar logging-p nil
  "True if and only if we want extra logging.")
(add-to-list 'command-switch-alist
  '("logging" . (lambda (arg)
                  (setq logging-p t))))

(add-hook 'emacs-startup-hook
  (lambda nil
    (if (eq my-mode 'work)
      (message "Work mode is turned on."))
    (if logging-p
      (message "Extra logging is turned on."))
    (if (and (eq my-mode 'work)
             logging-p)
      (message "Work mode and logging are both turned on."))))

Check the *Messages* buffer to see the output.