Category Things I’ve Learned

Literate Lists

I’ve written before about literate programming, and how one of its most attractive features is that you can write code with the primary goal of conveying information to a person, and only secondarily of telling a computer what to do. So there’s a bit in my .bashrc that adds directories to $PATH that isn’t as reader-friendly as I’d like:

for dir in \
    /usr/sbin \
    /opt/sbin \
    /usr/local/sbin \
    /some/very/specific/directory \
    ; do
    PATH="$dir:$PATH"
done

I’d like to be add a comment to each directory entry, explaining why I want it in $PATH, but sh syntax won’t let me: there’s just no way to interleave strings and comments this way. So far, I’ve documented these directories in a comment above the for loop, but that’s not exactly what I’d like to do. In fact, I’d like to do something like:

$PATH components

  • /usr/sbin
  • /usr/local/bin
for dir in \
    {{path-components}} \
    ; do
    PATH="$dir:$PATH"
done

Or even:

$PATH components

DirectoryComments
/usr/sbinsbin directories contain sysadminny stuff, and should go before bin directories.
/usr/local/binLocally-installed utilities take precedence over vendor-installed ones.
for dir in \
    {{path-components}} \
    ; do
    PATH="$dir:$PATH"
done

Spoiler alert: both are possible with org-mode.

Lists

The key is to use Library of Babel code blocks: these allow you to execute org-mode code blocks and use the results elsewhere. Let’s start by writing the code that we want to be able to write:

#+name: path-list
- /usr/bin
- /opt/bin
- /usr/local/bin
- /sbin
- /opt/sbin
- /usr/local/sbin

#+begin_src bash :noweb no-export :tangle list.sh
  for l in \
      <<org-list-to-sh(l=path-list)>> \
      ; do
      PATH="$l:$PATH"
  done
#+end_src

Note the :noweb argument to the bash code block, and the <<org-list-to-sh()>> call in noweb brackets. This is a function we need to write. It’ll (somehow) take an org list as input and convert it into a string that can be inserted in this fragment of bash code.

This function is a Babel code block that we will evaluate, and which will return a string. We can write it in any supported language we like, such as R or Python, but for the sake of simplicity and portability, let’s stick with Emacs lisp.

Next, we’ll want a test rig to actually write the org-list-to-sh function. Let’s start with:

#+name: org-list-to-sh
#+begin_src emacs-lisp :var l='nil
  l
#+end_src

#+name: test-list
- First
- Second
- Third

#+CALL: org-list-to-sh(l=test-list) :results value raw

The begin_src block at the top defines our function. For now, it simply takes one parameter, l, which defaults to nil, and returns l. Then there’s a list, to provide test data, and finally a #+CALL: line, which contains a call to org-list-to-sh and some header arguments, which we’ll get to in a moment.

If you press C-c C-c on the #+CALL line, Emacs will evaluate the call and write the result to a #+RESULTS block underneath. Go ahead and experiment with the Lisp code and any parameters you might be curious about.

The possible values for the :results header are listed under “Results of Evaluation” in the Org-Mode manual. There are a lot of them, but the one we care the most about is value: we’re going to execute code and take its return value, not its printed output. But this is the default, so it can be omitted.

If you tangle this file with C-c C-v C-t, you’ll see the following in list.sh:

for l in \
    ((/usr/bin) (/opt/bin) (/usr/local/bin) (/sbin) (/opt/sbin) (/usr/local/sbin)) \
    ; do
    PATH="$l:$PATH"
done

    It looks as though our org-mode list got turned into a Lisp list. As it turns out, yes, but not really. Let’s change the source of the org-list-to-sh() function to illustrate what’s going on:

    #+name: org-list-to-sh
    #+begin_src emacs-lisp :var l='nil :results raw
      (format "aaa %s zzz" l)
    #+end_src

    Now, when we tangle list.sh, it contains

        aaa ((/usr/bin) (/opt/bin) (/usr/local/bin) (/sbin) (/opt/sbin) (/usr/local/sbin)) zzz \

    So the return value from org-list-to-sh was turned into a string, and that string was inserted into the tangled file. This is because we chose :results raw in the definition of org-list-to-sh. If you play around with other values, you’ll see why they don’t work: vector wraps the result in extraneous parentheses, scalar adds extraneous quotation marks, and so on.

    Really, what we want is a plain string, generated from Lisp code and inserted in our sh code as-is. So we’ll need to change the org-list-to-sh code to return a string, and use :results raw to insert that string unchanged in the tangled file.

    We saw above that org-list-to-sh sees its parameter as a list of lists of strings, so let’s concatenate those strings, with space between them:

    #+name: org-list-to-sh
    #+begin_src emacs-lisp :var l='nil :results raw
      (mapconcat 'identity
    	     (mapcar
    	      (lambda (elt)
    		(car elt)
    		)
    	      l)
    	     " ")
    #+end_src

    This yields, in list.sh:

    for l in \
        /usr/bin /opt/bin /usr/local/bin /sbin /opt/sbin /usr/local/sbin \
        ; do
        PATH="$l:$PATH"
    done

    which looks pretty nice. It would be nice to break that list of strings across multiple lines, and also quote them (in case there are directories with spaces in them), but I’ll leave that as an exercise for the reader.

    Tables

    That takes care of converting an org-mode list to a sh string. But earlier I said it would be even better to define the $PATH components in an org-mode table, with directories in the first column and comments in the second. This is easy, with what we’ve already done with strings. Let’s add a test table to our org-mode code, and some code to just return its input:

    #+name: echo-input
    #+begin_src emacs-lisp :var l='nil :results raw
      l
    #+end_src
    
    #+name: test-table
    | *Name*   | *Comment*        |
    |----------+------------------|
    | /bin     | First directory  |
    | /sbin    | Second directory |
    | /opt/bin | Third directory  |
    
    #+CALL: echo-input(l=test-table) :results value code
    
    #+RESULTS:

    Press C-c C-c on the #+CALL line to evaluate it, and you’ll see the results:

    #+RESULTS:
    #+begin_src emacs-lisp
    (("/bin" "First directory")
     ("/sbin" "Second directory")
     ("/opt/bin" "Third directory"))
    #+end_src

    First of all, note that, just as with lists, the table is converted to a list of lists of strings, where the first string in each list is the name of the directory. So we can just reuse our existing org-list-to-sh code. Secondly, org has helpfully stripped the header line and the horizontal rule underneath it, giving us a clean set of data to work with (this seems a bit fragile, however, so in your own code, be sure to sanitize your inputs). Just convert the list of directories to a table of directories, and you’re done.

    Conclusion

    We’ve seen how to convert org-mode lists and tables to code that can be inserted into a sh (or other language) source file when it’s tangled. This means that when our code includes data best represented by a list or table, we can, in the spirit of literate programming, use org-mode formatting to present that data to the user as a good-looking list or table, rather than just list it as code.

    One final homework assignment: in the list or table that describes the path elements, it would be nice to use org-mode formatting for the directory name itself: =/bin= rather than /bin. Update org-list-to-sh to strip the formatting before converting to sh code.

    Renewing an Overdue Docker Cert in QNAP

    Writing this down before I forget, somewhere where I won’t think to look for it the next time I need it.

    So you’re running Container Station (i.e., Docker) on a QNAP NAS, and naturally you’ve created a cert for it, because why wouldn’t you?, except that it expired a few days ago and you forgot to renew it, because apparently you didn’t have calendar technology when you originally created the cert, and now Container Station won’t renew the cert because it’s expired, and it won’t tell you that: it just passively-aggressively lets you click the Renew Certificate button, but nothing changes and the Docker port continues using the old, expired cert. What to do?

    1. Stop Container Station
    2. Log in to the NAS and delete /etc/docker/tls (or just rename it).
    3. Restart Container Station. Open it, and note the dialog box saying that the cert needs to be renewed.
    4. Under Preferences → Docker Certificate, download the new certificate.
    5. Restart Container Station to make it pick up the new cert.
    6. Unzip the cert in your local Docker certificate directory: either ~/.docker or whatever you’ve set $DOCKER_CERT_PATH to.
    7. Check that you have the right cert: the cert.pem that you just unzipped should be from the same keypair that’s being served by the Docker server:
      openssl x509 -noout -modulus -in cert.pem | openssl md5
      and
      openssl s_client -connect $DOCKER_HOST:$DOCKER_PORT | openssl x509 -noout -modulus | openssl md5
      should return the same string.
    8. Check the expiration date on the new cert. Subtract 7 days, open a calendar at that date and write down “Renew Docker certificate” this time.
    A Few More Thoughts on Literate Programming

    A while back, I became intrigued by Donald Knuth’s idea of Literate Programming, and decided to give it a shot. That first attempt was basically just me writing down what I knew as quickly as I learned it, and trying to pass it off as a knowledgeable tutorial. More recently, I tried a second project, a web-app that solves Wordle, and thought I’d write it in the Literate style as well.

    The first time around, I learned the mechanics. The second time, I was able to learn one or two things about the coding itself.

    (For those who don’t remember, in literate programming, you write code intertwined with prose that explains the code, and a post-processor turns the result into a pretty document for humans to read, and ugly code for computers to process.

    1) The thing I liked the most, the part where literate programming really shines, is having the code be grouped not by function or by class, but by topic. I could introduce a <div class="message-box"></div> in the main HTML file, and in the next paragraph introduce the CSS that styles it, and the JavaScript code that manipulates it.

    2) In the same vein, several times I rearranged the source to make the explanations flow better, not discuss variables or functions until I had explained why they’re there and what they do, without it altering the underlying HTML or JavaScript source. In fact, this led to a stylistic quandary:

    3) I defined a few customization variables. You know, the kind that normally go at the top for easy customization:

    var MIN_FOO = 30;
    var MAX_FOO = 1500;
    var LOG_FILE = "/var/log/mylogfile.log";

    Of course, the natural tendency was to put them next to the code that they affect, somewhere in the middle of the source file. Should I have put them at the top of my source instead?

    4) Even smaller: how do you pass command-line option definitions to getopt()? If you have options -a, -b, and -c, each will normally be defined in its own section. So in principle, the literate thing to do would be to write

    getopt("{{option-a}}{{option-b}}{{option-c}}");

    and have a section that defines option-a as “a“. As you can see, though, defining single-letter strings isn’t terribly readable, and literate programming is all about readability.

    5) Speaking of readability, one thing that can come in really handy is the ability to generate a pretty document for human consumption. Knuth’s original tools generated TeX, of course, and it doesn’t get prettier than that.

    I used org-mode, which accepts TeX style math notation, but also allows you to embed images and graphviz graphs. In my case, I needed to calculate the entropy of a variable, so being able to use proper equations, with nicely-formatted sigmas and italicized variables, was very nice. I’ve worked in the past on a number of projects where it would have been useful to embed a diagram with circles and arrows, rather than using words or ASCII art.

    6) I was surprised to find that I had practically no comments in the base code (in the JavaScript, HTML, and CSS that were generated from my org-mode source file). I normally comment a lot. It’s not that I was less verbose. In fact, I was more verbose than usual. It’s just that I was putting all of the explanations about what I was trying to do, and why things were the way they are, in the human-docs part of the source, not the parts destined for computer consumption. Which, I guess, was the point.

    7) Related to this, I think I had fewer bugs than I would normally have gotten in a project of this size. I don’t know why, but I suspect that it was due to some combination of thinking “out loud” (or at least in prose) before pounding out a chunk of code, and of having related bits of code next to each other, and not scattered across multiple files.

    8) I don’t know whether I could tackle a large project in this way. You might say, “Why not? Donald Knuth wrote both TeX and Metafont as literate code, and even published the source in two fat books!” Well, yeah, but he’s Donald Knuth. Also, he was writing before IDEs, or even color-coded code, were available.

    I found org-mode to be the most comfortable tool for me to use for this project. But of course that effectively prevents people who don’t use Emacs (even though they obviously should) from contributing.

    One drawback of org-mode as a literate programming development environment is that you’re pretty much limited to one source file, which obviously doesn’t scale. There are other tools out there, like noweb, but I found those harder to set up, or they forced me to use (La)TeX when I didn’t want to, or the like.

    9) One serious drawback of org-mode is that it makes it nearly impossible to add cross-reference links. If you have a section like

    function myFunc() {
    var thing;
    {{calculate thing}}
    return thing;
    }

    it would be very useful to have {{calculate thing}} be a link that you can click on to go to the definition of that chunk. But this is much harder to do in org-mode than it should be. So is labeling chunks, so that people can chase cross-references even without convenient links. It has a lot of work to be done in that regard.

    Readable Code: Variable Overload

    It’s well known that reading code is a lot harder than writing it. But I recently got some insight as to why that is.

    I was debugging someone else’s sh script. This one seemed harder to read than most. There was a section that involved figuring out a bunch of dates associated with a particular object. The author was careful to show their work, and not just have a bunch of opaque calculations in the code. I won’t quote their code, but imagine something like:

    NOW=$(date +%s)
    THING_CREATION_EPOCH=$(<get $THING creation time, in Unix time format>)
    THING_AGE_EPOCH=$(( $NOW - $THING_CREATION_EPOCH ))
    THING_AGE_DAYS=$(( $THING_AGE_EPOCH / 86400 ))
    

    Now imagine this for three or four aspects of $THING, like last modification time, which other elements use $THING, things like that. The precise details don’t matter.

    Each variable makes sense. But there are four of them, just for the thing’s age since creation. And if you’re not as intimately familiar with it as someone who just wrote this code, that means you have to keep track of four variables in your head, and that gets very difficult very quickly.

    Part of the problem is that it’s unclear which variables will be needed further down (or even above, in functions that use global variables), so you have to hold on to these variables; you can’t mentally let them fall. Compare this to something like

    <Initialize some stuff>
    for i in $LIST_OF_THINGS; do
        ProcessStuff $i
    done
    <Finalize some stuff>
    

    Here, you can be reasonably sure that $i won’t be used outside the loop. Once you get past done, you can let it go. Yes, it still exists, and it’s not illegal to use $i in the finalization code, but a well-meaning author won’t do that.

    Which leads to a lesson to be learned from this: limit the number of variables that are used in any given chunk of code. You don’t want to have to remember some variable five pages away. To do this, try to break your code down into independent modules. These can be functions, classes, even just paragraph-sized chunks in some function. Ideally, you want your reader to be able to close the chapter, forget most of the details, and move on to the next bit.

    In the same vein, this illustrates one reason global variables are frowned upon: they can potentially be accessed from anywhere in the code, which means that they never really disappear and can’t be completely ignored.

    Ansible: Roles, Role Dependencies, and Variables

    I just spent some time banging my head against Ansible, and thought I’d share in case anyone else runs across it:

    I have a Firefox role that allows you to define a Firefox profile with various plugins, config settings, and the like. And I have a work-from-home (WFH) role that, among other things, sets up a couple of work profiles in Firefox, with certain proxy settings and plugins. I did this the way the documentation says to:

    dependencies:
      - name: Profile Work 1
        role: firefox
        vars:
          - profile: work1
            addons:
              - ublock-origin
              - privacy-badger17
              - solarize-fox
            prefs: >-
              network.proxy.http: '"proxy.host.com"'
              network.proxy.http_port: 1234
      - name: Profile Work 2
        role: firefox
        vars:
          - profile: work2
            addons:
              - ublock-origin
              - privacy-badger17
              - solarized-light
            prefs: >-
              network.proxy.http: '"proxy.host.com"'
              network.proxy.http_port: 1234
    

    The WFH stuff worked fine at first, but then I added a new profile.

    - name: Roles
      hosts: my-host
      roles:
        - role: wfh
        - role: firefox
          profile: third
          addons:
            - bitwarden-password-manager
            - some fancy-theme

    This one didn’t have any prefs, but Ansible was applying the prefs from the WFH role.

    Eventually, I found that the problem lay in the two vars blocks in the wfh role’s dependencies: apparently those get set as variables for the entire task or play, not just for that invocation of the firefox role. The solution turned out to be undocumented: drop the vars blocks and pull the role parameters up a level:

    dependencies:
      - name: Profile Work 1
        role: firefox
        profile: work1
        addons:
          - ublock-origin
          - privacy-badger17
          - solarize-fox
        prefs: >-
          network.proxy.http: '"proxy.host.com"'
          network.proxy.http_port: 1234
      - name: Profile Work 2
        role: firefox
        profile: work2
        addons:
          - ublock-origin
          - privacy-badger17
          - solarized-light
        prefs: >-
          network.proxy.http: '"proxy.host.com"'
          network.proxy.http_port: 1234
    

    I do like Ansible, but it’s full of fiddly stupid crap like this.

    Ansible: Running Commands in Dry-Run Mode in Check Mode

    Say you have an Ansible playbook that invokes a command. Normally, that command executes when you run ansible normally, and doesn’t execute at all when you run ansible in check mode.

    But a lot of commands, like rsync have a -n or --dry-run argument that shows what would be done, without actually making any changes. So it would be nice to combine the two.

    Let’s start with a simple playbook that copies some files with rsync:

    - name: Copy files
      tasks:
        - name: rsync the files
          command: >-
            rsync
            -avi
            /tmp/source/
            /tmp/destination/
      hosts: localhost
      become: no
      gather_facts: no
    

    When you execute this playboook with ansible-playbook foo.yml rsync runs, and when you run in check mode, with ansible-playbook -C foo.yml, rsync doesn’t run.

    This is inconvenient, because we’d like to see what rsync would have done before we commit to doing it. So let’s force it to run even in check mode, with check_mode: no, but also run rsync in dry-run mode, so we don’t make changes while we’re still debugging the playbook:

    - name: Copy files
      tasks:
        - name: rsync the files
          command: >-
            rsync
            --dry-run
            -avi
            /tmp/source/
            /tmp/destination/
          check_mode: no
      hosts: localhost
      become: no
      gather_facts: no
    

    Now we just need to remember to remove the --dry-run argument when we’re ready to run it for real. And turn it back on again when we need to debug the playbook.

    Or we could do the smart thing, and try to add that argument only when we’re running Ansible in check mode. Thankfully, there’s a variable for that: ansible_check_mode, so we can set the argument dynamically:

    - name: Copy files
      tasks:
        - name: rsync the files
          command: >-
            rsync
            {{ '--dry-run' if ansible_check_mode else '' }}
            -avi
            /tmp/source/
            /tmp/destination/
          check_mode: no
      hosts: localhost
      become: no
      gather_facts: no
    

    You can check that this works with ansible-playbook -v -C foo.yml and ansible-playbook -v foo.yml.

    City Mileage

    I’ve known forever that city mileage for cars is worse than highway mileage, but I never knew why. But it’s a bit like riding a bike.

    When you ride a bike, you have to put in a lot of work at the beginning, getting up to speed. And after that, you can mostly coast. Assuming you’re on flat ground, you have to pedal a bit because friction is slowing you down, but it’s nowhere near how hard you had to work getting up to speed.

    And then you stop at a red light, and you throw away all the energy you had, so that when the light turns green, you have to put in another burst of work getting up to speed. And, of course, in the city you’re stopping like this all the time. Every few blocks, you throw away all your accumulated energy, and have to start over. This applies to cars the same way as to bikes, except that on your bike, you immediately sense this in your legs, not at the end of the week when the tank is empty.

    So it’s not that highways somehow enhance your mileage. Rather, it’s cities that make for wasteful driving. It’s a bit embarrassing that it took me this long to figure it out, but better late than never.

    Grand Juries: Some Suggestions on Note-Taking

    In which I share some things I only figured out after over a month of grand-jurying.

    If your grand jury is anything like mine, each session you have a pad of paper with your name on it, that you get back every week, and a docket list listing the cases for that week, and for each case, the case number, name of the accused, name of the police officer, list of charges, and some other details.

    The docket has a lot of blank space on it, so you can take notes. The pad of paper is also good for taking notes. However, note that the pad stays, while the docket is different every week. So try to use the margins of the docket for things that only matter that week, and use the pad for notes that you’ll want to refer to in future weeks.

    The latter category includes things like:

    • Legal definitions: what’s the difference between first-, second-, and third-degree burglary? How much heroin is considered “personal use”, and how much is possession with intent to distribute?
    • Anything to do with ongoing cases. In particular, if you hear from a witness one week, it’s likely that you’ll hear from another witness in the same case another week. You’ll be happy you wrote down names and places: it’ll help you get a better idea of what went on.
    • Anything else you might need to know in weeks to come, like the phone number of the courthouse, or the names of your co-jurors.

    Now, the docket lists the cases you’ll be hearing that week. But if your case load is anything like mine, you’ll hear anywhere from twenty to forty cases per session, so by the time you’re ready to vote, you won’t be able to keep all of the different cases straight.

    This means that as the police officer (or whoever) reads the police report for each case, you need to listen for the elements of the crime the person is charged with. Put a check mark next to each one as you hear it: if the report contains “… a search of the vehicle revealed a digital scale, forty baggies, and twelve rounds of ammunition”, you can check off “possession of drug paraphernalia” and “illegal possession of ammunition”.

    If the report later says, “upon questioning, Smith said that the baggies were his, but the bullets belonged to his friend”, you can add an A (for “admitted”) next to “possession of drug paraphernalia”, in case that makes it easier to determine whether there’s probable cause.

    During the reading of the report, pay attention to where the information came from: if it begins with “On April 17, defendant Smith threatened victim Jones with a pistol”, who is saying that? Witnesses? A police officer describing surveillance camera footage? As a grand juror, you’re not determining guilt or innocence; only probable cause. In practice, “no probable cause” means that a cop made it up, or something along those lines. It’s a low bar to hurdle, but make sure you do.

    Don’t be afraid to ask where the information comes from. If you or someone you cared about were accused of a crime, would you want them indicted by someone who simply took a cop’s word for it, and didn’t ask any questions?

     

    Programming Tip: Open and Close at the Same Time

    One useful programming habit I picked up at some point is: if you open or start something, immediately close it or end it. If you open a bracket, immediately write its closing bracket. If you open a file, immediately write the code to close it.

    These days, development environments take care of the niggling little details like matching parentheses and brackets for you. That’s great, but that’s just syntax. The same principle extends further, and automatic tools can’t guess what it is you want to do.

    There’s a problem in a lot of code called a resource leak. The classic example is memory leaks in C: the code asks for, and gets, a chunk of memory. But if you don’t free the memory when you’re done with it, then your program will get larger and larger — like a coffee table where a new magazine is added every month but none are ever taken away — until eventually the machine runs out of memory.

    These days, languages keep track of memory for you, so it’s easier to avoid memory leaks than it used to be. But the best way I’ve found to manage them is: when you allocate memory (or some other resource), plan to release it when you’re done.

    The same principle applies to any resource: if you read or write a file, you’ll need a file handle. If you never close them, they’ll keep lying around, and you’ll eventually run out. So plan ahead, and free the resource as soon as you’ve alocated it:

    Once you’ve written

    open INFILE, "<", "/path/to/myfile";

    go ahead and immediately write the code to close that file:

    open INFILE, "<", "/path/to/myfile";
    close INFILE;

    and only then write the code to do stuff with the file:

    open INFILE, "<", "/path/to/myfile";
    while ()
    {
    	print "hello\n" if /foo/;
    }
    close INFILE;

    The corollary of this is, if you’ve written the open but aren’t sure where to put the close, then you may want to take a look at the structure of your code, and refactor it.

    This same principle applies in many situations: when you open a connection to a remote web server, database server, etc., immediately write the code to close the connection. If you’re writing HTML, and you’ve written <foo>, immediately write the corresponding </foo>. If you’ve sent off an asynchronous AJAX request, figure out where you’re going to receive the reply. When you throw an exception, decide where you’re going to catch it.

    And only then write the meat of the code, the stuff that goes between the opening and closing code.

    As I said, I originally came across this as a tip for avoiding memory leaks. But I’ve found that doing things this way forces me to be mindful of the structure of my code, and avoid costly surprises down the line.

    Overfitting

    One of the things I learned in math is that a polynomial of degree N can pass through N+1 arbitrary points. A straight line goes through any two points, a parabola goes through any three points, and so forth. The practical upshot of this is that if your equation is complex enough, you can fit it to any data set.

    That’s basically what happened to the geocentric model: it started out simple, with planets going around the Earth in circles. Except that some of the planets wobbled a bit. So they added more terms to the equations to account for the wobbles. Then there turned out to be more wobbles on top of the first wobbles, and more terms had to be added to the equations to take those into account, and so on until the theory collapsed under its own weight. There wasn’t any physical mechanism or cause behind the epicycles (as these wobbles were called). They were just mathematical artifacts. And so, one could argue that the theory was simpler when it had fewer epicycles and didn’t explain all of the data, but also was less wrong.

    Take another example (adapted from Russell Glasser, who got it from his CS instructor): let’s say you and I order a pizza, and it comes with olives. I hate olives and you love them, so we want to cut it up in such a way that we both get slices of the same size, but your slice has as many of the olives as possible, and mine have as few as possible. (And don’t tell me we could just order a half-olive pizza; I’m using this as another example.)

    We could take a photo of the pizza, feed it into an algorithm that’ll find the position of each olive and come up with the best way to slice the pizza fairly, but with a maximum of olives on your slices.

    The problem is, this tells us nothing about how to slice the next such pizza that we order. Unless there’s some reason to think that the olives on the next pizza will be laid out in some similar way on the next pizza, we can’t tell the pizza parlor how to slice it up when we place our next order.

    In contrast, imagine if we’d looked at the pizza and said, “Hm. Looks like the cook is sloppy, and just tossed a handful of olives on the left side, without bothering to spread them around.” Then we could ask the parlor slice to slice it into wedges, and we have good odds of winding up with three slices with extra olives and three with minimal olives. Or if we’d found that the cook puts the olives in the middle and doesn’t spread them around. Then we could ask the parlor to slice the pizza into a grid; you take the middle pieces, and I’ll take the outside ones.

    But our original super-optimal algorithm doesn’t allow us to do that: by trying to perfectly account for every single olive in that one pizza, it doesn’t help us at all in trying to predict the next pizza.

    In The Signal and the Noise, Nate Silver calls this overfitting. It’s often tempting to overfit, because then you can say, “See! My theory of Economic Epicycles explains 29 of the last 30 recessions, as well as 85% of the changes in the Dow Jones Industrial Average!” But is this exciting new theory right? That is, does it help us figure out what the future holds; whether we’re looking at a slight economic dip, a recession, or a full-fledged depression?

    We’ve probably all heard the one about how the Dow goes up and down along with skirt hems. Or that the performance of the Washington Redskins predicts the outcome of US presidential elections. Of course, there’s no reason to think that fashion designers control Wall Street, or that football players have special insight into politics. More importantly, it goes to show that if you dig long enough, you can find some data set that matches the one you’re looking at. And in this interconnected, online, googlable world, it’s easier than ever to find some data set that matches what you want to see.

    These two examples are easy to see through, because there’s obviously no causal relationship between football and politics. But we humans are good at telling convincing stories. What if I told you that pizza sales (with or without olives) can help predict recessions? After all, when people have less spending money, they eat out less, and pizza sales suffer.

    I just made this up, both the pizza example and the explanation. So it’s bogus, unless by some million-to-one chance I stumbled on something right. But it’s a lot more plausible than the skirt or football examples, and thus we need to be more careful before believing it.

    Update: John Armstrong pointed out that the first paragraph should say “N+1”, not “N”.

    Update 2: As if on cue, Wonkette helps demonstrate the problems with trying to explain too much in this post about Glenn Beck somehow managing to tie together John Kerry’s presence or absence on a boat, his wife’s seizure, and Hillary Clinton’s answering or not answering questions about Benghazi. Probably NSFW because hey, Wonkette. But also full of Glenn Beck-ey crazy.