Mass-replacing patterns in Emacs

Some people have project cars—cars that don’t work, that they put up on blocks or in the garage, and then lovingly restore to working order. I have some project books: e-texts that I got on a compilation CD in the 1990s, and finally decided to convert to EPUBs.

I’m using the Standard Ebooks style manual, since they seem to know what they’re doing. And these style guidelines can get pretty specific. For instance, if you have a numeric or date range, you should use an en dash surrounded by Unicode word joiner glyphs.

None of this is a problem if you’re using Emacs, of course. You can enter weird characters with M-x insert-char, which allows you to search for specific Unicode characters. And of course M-x query-replace-regex will find two numbers separated by an ASCII hyphen.

The clever part, though, was the realization that even though I have about 900 source xhtml files, they only come to about 35Mb, and in the 2020s, that’s not a lot (though it would have been science fiction back when I got that CD). So why not just load them all?

Once we do that, we can use occur-mode to look for patterns. It’s like grep for Emacs. In this case, we want to search for patterns in all of the source buffers, so we’ll use multi-occur-in-matching-buffers, which requires us to specify one pattern for the buffers to search (\.xhtml$) and another for the string or pattern to search in those buffers ([[:digit:]]-[[:digit:]]).

That brings up a buffer called *Occur* with all of the matching lines. And here’s the next cool bit: you can press e to switch to occur-edit mode: if you make a change to the *Occur* buffer, those changes propagate back to the source buffers. Which means that I can use standard tools like replace-string, query-replace-string, replace-regexp, and query-replace-regexp to either make changes in bulk if I’m sure of what I’m doing, or one at a time if I’m not.

One difficulty: what if some lines match what you want, and some don’t, but the “bad” lines match some other pattern? For instance, in my case, the CD publishers used an ASCII dash followed by a space for what should have been an em dash. But searching for “- ” (a dash followed by space) cluttered up my *Occur* buffer with a lot of HTML comments.

No problem: *Occur* is just another Emacs buffer, so all I had to do was use replace-regexp to delete the lines with HTML comments, which cleared away a lot of distracting noise. (And no, deleting whole lines in *Occur* doesn’t delete the corresponding lines in the source buffers.)