Removing Magic
So this was one of those real-life mysteries.
I like crossword puzzles. And in particular, I like indie crossword puzzles, because they tend to be more inventive and less censored than ones that run in newspapers. So I follow several crossword designers on Twitter.
Yesterday, one of them mentioned that people were having a problem with his latest puzzle. I tried downloading it on my iPad, and yeah, it wouldn’t open in Across Lite. Other people were saying that their computers thought the file was in PostScript format. I dumped the HTTP header with
lynx -head -dump http://url.to/crossword.puz
and found the header
Content-type: application/postscript
which was definitely wrong for a .puz file. What’s more, other .puz files in the same directory were showing up as
Content-type: application/octet-stream
as they should.
I mentioned all this to the designer, which led to us chatting back and forth to see what the problem was. And eventually I had the proverbial aha moment.
.puz files begin with a two-byte checksum. In this particular case, they turned out to be 0x25 and 0x21. Or, in ASCII, “%!“. And as it turns out, PostScript files begin with “%!“, according to Unix’s magic file.
So evidently what happened was: the hosting server didn’t have a default type for files ending in .puz. Not terribly surprising, since that’s not really a widely-used format. So since it didn’t recognize the filename extension, it did the next-best thing and looked at the first few bytes of the file (probably with file or something equivalent) to see if it could make an educated guess. It saw the checksum as “%!” and decided it was a PostScript file.
The obvious fix was to change something about the file: rewrite a clue, add a note, change the copyright statement, anything to change the contents of the file, and thus the checksum.
The more permanent solution was to add a .htaccess file to the puzzle file directory, with
AddType application/octet-stream .puz
assuming that the hosting provider used Apache or something compatible.
This didn’t take immediately; I think the provider cached this metadata for a few hours. But eventually things cleared up.
I’m not sure what the lesson is, here. “Don’t use two-byte checksums at offset 0”, maybe?
I actually agree with “Don’t use two-byte checksums at offset 0”! More generally, don’t use mystical, quasi-proprietary, impossible-to-change binary file formats with or without checksums at the beginning. This is one of the reasons why ipuz (www.ipuz.org) was created. It’s a flexible, modern format based on JSON, designed with expandability and extensibility from day one. And most crossword editing tools support ipuz export.
And thanks for figuring the mystery out. BTW, the web site should have been looking for %!PS, not just %!. It’s a 4-byte signature, like %PDF, not a 2-byte signature. So it’s actually a bug, not just a misdirected feature.
Upon reflection, I agree: offset 0 is a good place to put magic bytes; the same ones that misidentified the .puz files as PostScript.
You’d think so, but I did run across a magic file that only looked for “%!”. I haven’t checked, but I suspect that back in the 80s, when people still sometimes wrote PostScript by hand, it was acceptable to just use “%!”.
It doesn’t surprise me that people created files starting with just %!. They probably saw it in files and copied. % starts a comment, so what follows is irrelevant except for detecting file type. IIRC, %!PS was supposed to indicate an EPS file, not a PostScript file. EPS files are a subset of PS files that would never be created by hand. I actually wrote a lot of PostScript in the ’80s and early ’90s (more than I’d like to admit) and I never put %! at the start of a file.
Brought to you by the same people who thought, “hey, we can save two bytes by omitting the “19” from years! It’s not like that part is ever going to change.”