No, Jesse, monkeys will not type Shakespeare

This has to be the most pointless science “experiment” that I have ever come across. That article was in today’s print edition of The Times of India; the original researcher, Jesse Anderson’s report is here. The claim is that a bunch of computer-simulated “monkeys” have typed all of Shakespeare’s works — as, theoretically, it is widely claimed, is possible.

The reason it annoys me is that probability theory is already confusing enough, not just to lay people but even to experts, that there is no need for headlines like this to mess things up more. The probability of monkeys, typing insanely fast, reproducing a single page of Shakespeare accurately — let alone his entire oeuvre — is vanishingly small. If it is not likely to happen in the age of the universe, it is fair to say that it is impossible. This equally applies to virtual monkeys, on present-day computers (and any imaginable future computers).

And, if you read the TOI article, it turns out that this is not what is happening. The virtual monkeys are generating random text. Any sequence of 9 characters that happens to appear in Shakespeare is deemed to be “correct”. Once all “9-mers” in a Shakespeare work have been typed (in arbitrary order), that work is deemed to be complete.

Let’s simplify things and reduce the Shakespeare works to the uppercase and lowercase letters; the ten digits; the space; and nine punctuation marks (single and double quotes, full stop, comma, semicolon, colon, dash, question mark, exclamation point). That gives us 72 characters. How many 9-mers can be constructed of these characters? The answer is 729 = 51998697814228992. If the monkeys typed a million characters a second, they would need 1648 years to reproduce a single string of 9 characters.

So how do Anderson’s “monkeys” do it? By simplifying even further. Anderson considers only the 26 lowercase letters and no punctuation (not even spaces). Then there are 269 = 5.5 trillion possible 9-mers, a feasible number to explore exhaustively, which is all his monkeys are doing. Every time a 9-mer “agrees” with a 9-mer in Shakespeare, it is deemed a “hit”, and a Shakespeare work is deemed reproduced if it is entirely covered in “hits”.

In a little over a month, over 5 trillion of these 5.5 trillion 9-mers have been reproduced by the monkeys. Why 9-mers? Obviously to make it interesting. On the same computers, all possible 8-mers would have been produced in about 1-2(*) days — hardly very newsworthy. (And, to take a trivial example, all possible 1-mers or 2-mers would have taken a few milliseconds.) All possible 10-mers would have taken a couple of years(*) — perhaps the media would have lost interest, or perhaps the computer time would have been too expensive.

Having produced each one of the 5.5 trillion possible sequences of 9 letters, the monkeys will, by the author’s definition of “reproduced”, have reproduced not only all of Shakespeare, but all of the literature ever written in the English language (and other languages in the Roman script) since the beginning of time — and done that in barely a month. And if the authors had chosen 7-mers instead of 9-mers, it would have taken only a few hours. And by typing “a b c d e f g h i j k l m n o p q r s t u v w x y z”, I have reproduced all of Shakespeare in 1-mers: just strike off every character there against Shakespeare’s folio, ignoring case, space, punctuation and all non-letter symbols, and see what is left.

The only thought that occurs to me is — what a waste of computer resources.

(*)edit — these numbers corrected from first draft

Advertisements
Leave a comment

6 Comments

  1. David Lang

     /  September 29, 2011

    it’s even worse than you make it out to be, they just did lower case letters and spaces.

    Reply
  2. Commenting on the idiocy of Times of India is like trying to prove an axiom … how low will you stoop?

    Reply
  3. It isn’t really a waste of computer resources, as it appears to have been run on his own laptop, and from a quick look at his report of the matter he wasn’t paid any public money to do it. But it’s a complete waste of time, because running a programme doesn’t tell you anything that you can’t learn from working out the probability figures, as You’ve done in the post. It looks like it was intended to gull the paper into giving him some publicity. It tells us nothing we didn’t know already.

    Reply
  4. I just switched back to *The Hindu*.

    Reply
  5. Rahul Siddharthan

     /  October 3, 2011

    gd, Amri — I don’t really want to bash TOI on this. The headline was sensationalist but the article actually gave an accurate impression of what he actually did, and the concluding paragraph is a sound enough criticism. Besides, he seems to have conned many other media outlets into publicising his work.

    CIngram — well, I should have looked more closely. If he used only his own laptop, well, it’s a waste of electricity but not much else. Maybe he is really doing an experiment to test credulity.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s