An h-index for Test cricket batsmen

The h-index is a way to assess the impact of the published work of a scientist, in terms of citations. It is an attempt to get around more simple-minded citation measures and works rather well in the sense that the scientists whom you’d expect to have high h-indices, usually do; while scientists who happen to have published one or two high-impact papers, but have had otherwise unremarkable careers, won’t score very high.

Basically, the definition of the h-index is: it is the maximum number h such that the scientist has published h papers, each with at least h citations. That is, if I have published 10 papers that have each been cited 10 or more times, but have not published 11 papers which have each been cited 11 or more times, my h-index is 10.

Based on a facebook wisecrack by a friend (on the occasion of Sachin falling short of his 100th international 100), the thought occurred to me: how about ranking batsmen in cricket by an analogous score? That is, a batsman has an index h if, on h occasions, he has scored h or more runs. (As before, we take the maximum possible h.)

It turns out that the top five in this list (for Test cricket only) are basically the top five rungetters, in the same order: Sachin Tendulkar, Rahul Dravid, Ricky Ponting, Jacques Kallis, Brian Lara. The top 20 or so almost all appear in the top 20 list for rungetters. So it’s not very interesting — yet. Tendulkar’s h-score is 76 — that is, on 76 occasions he has scored 76 or more. There is a big gap between him and Dravid (69) but the others follow closely behind.

Suppose we modify it as follows: the nh index is that value of h, for a given n, such that on h occasions the batsman has scored nh or more runs. For examples, the 10h index would be: if on 5 occasions I have scored 50 runs or more (and I have not scored 60 runs or more on 6 occasions) I have a 10h index of 5. For n > 1, basically, I am giving more importance to higher-scoring innings, and also benefiting those who played fewer matches (most older players played far fewer games than Tendulkar and can’t remotely approach either his career aggregate, or his h-score).

What is the 10h ranking of batsmen, then? It turns out to be substantially different. The top 6 batsmen are now DG Bradman, Lara, Tendulkar, V Sehwag, KS Sangakkara and DPMD Jayawardene. Bradman scored 180 or more on 18 occasions; Lara’s 10h score is 17, Tendulkar’s is 16, and the other three get 15 each. Also in the top 10 (well, tied for 10) is Gary Sobers, who ranks quite far below both in career aggregate and in h-index. Immediately after him is Wally Hammond, who drops off today’s lists in the aggregrate as well as the above h-index.

Specifically, the top-20 list goes like this:

h Batsman Score (at least)
18 DG Bradman 185
17 BC Lara 178
16 SR Tendulkar 160
15 V Sehwag 164
15 KC Sangakkara 152
15 DPMD Jayawardene 150
14 SR Waugh 150
14 RT Ponting 150
14 R Dravid 148
14 JH Kallis 148
14 GS Sobers 145
13 WR Hammond 140
13 SM Gavaskar 147
13 ML Hayden 131
13 Javed Miandad 145
13 IVA Richards 135
13 GC Smith 133
13 G Kirsten 133
12 Zaheer Abbas 126
12 Younis Khan 126

As you may expect, a 5h ranking sort of interpolates these: Tendulkar now tops again, with Bradman and Lara tied next (and close behind). Sobers and Hammond continue to rank high.

While it is always difficult to rank batsmen from different eras, it seems to me that any list of all-time-great batsmen must put Bradman at or near the top, and must include Sobers, Hammond, Sunil Gavaskar, Vivian Richards, Zaheer Abbas, Javed Miandad and other past greats in the top 20. The nh-index seems to do this, for suitable choices of n. But what is the optimal choice?

(This is based on raw batting data downloaded from cricinfo.)


UPDATE 07 Dec 2011: Gangan Prathap points, in the comments, to this 2010 paper, by him, where he proposes a “mock h index” (different from my nh-index above); his scheme ranks Bradman above four top Indian batsmen, but does not consider other international greats.

Leave a comment

16 Comments

  1. Bala

     /  November 28, 2011

    An excellent post, as usual, and a great idea (the h-index for batsmen). As for the optimal value of n, resisting the temptation to say “42” (which, as we know, is the ultimate answer to all deep questions), I’d still say “1”. So we’re back to the original h index, and Tendulkar is numero uno once again, and all is right with the world:-)

    Reply
    • Rahul Siddharthan

       /  December 1, 2011

      Bala – so, just for fun, I looked at the numbers with n=42. Now Bradman, Sehwag and Lara are on top (6 times they have scored 6*42=252 or more), followed by Abbas, Hammond, Dravid, Miandad and Jayawardene on 5. Sachin, with a bunch of others, is on 4.

      I disagree that 1 is optimal (in fact, I think it is not optimal even in publishing) but it would be an interesting exercise to work out, and justify theoretically, what is optimal in a given field.

      Reply
      • Bala

         /  December 1, 2011

        I figured n = 1 might be optimal only in a vague hand-waving way,
        not after any real thought. It was more that I couldn’t see any
        obvious natural ‘scale’ to fix n, and so fell back on 1. Maybe one
        should go nonlinear and try n = sqrt{h} or something. As for
        publishing, I think there IS no optimal measure that is anywhere
        near universally applicable. After all, there are as many motivations
        and styles of doing science as there are scientists themselves.
        Of course one speaks of impact, but I think that is a corollary of
        our present-day myopic impatience with things like perspective
        and the big picture. A milieu that increasingly regards yesterday as
        ancient history can’t imagine waiting for a decade or more to assess
        the real impact of anything.

        Reply
        • Rahul Siddharthan

           /  December 1, 2011

          Yes, the question of a natural “scale” is important. Also important is the quality of the opposition (in cricket), the citing papers (in research), etc — which is ignored by the h-index and this generalisation. People have come up with more sophisticated metrics in literature, but none have the simplicity of the h-index. I agree this is rather unreliable for recent research, and it may also be unfair to some older research — but that is why people talk about “impact” metrics, not “quality” metrics. The problem comes when impact is confused with quality. All things being fair, I believe that high-impact papers will, 99% of the time, be high-quality; but of course people are gaming the system now and not all highly-cited papers are actually so worthwhile.

          Reply
  2. Alpan

     /  December 1, 2011

    Interesting. Have you tried doing this for bowlers (i.e., largest h such that the bowler has taken nh wickets in h matches)?

    Reply
    • Rahul Siddharthan

       /  December 1, 2011

      No, I haven’t tried. Given that the maximum any bowler has ever taken in a match is 18, while many bowlers have played over 100 matches, I’d guess the choice of n needs to be much less than 1. I was thinking of trying other sports, including some that I’m not familiar with, just to see how it does in an unbiased situation (ie, in cricket I believe Bradman should be on top, but in baseball, say, I’d have no clue — but then I’ll need a baseball expert to comment on the resutls).

      Reply
  3. Wonderful idea. The ideal number from my point of view (I’m an Essex boy) would be n: Graham Gooch = nº 1, but I think we’d have to go non-linear (as Bala suggests) and then cheat a bit to get it to work ;-) It would be easier I think to make Jack Hobbes top for all first class cricket.

    But I wouldn’t argue with Tendulkar as nº 1. And even less would I argue with the Don. I remember reading, as a very young boy, the story of his last test series in England (told by I can’t remember who). The drama of the narrative was compelling. Reading the lines I could see before me this greatest of all batsmen doing impossible things before proving that he was just, only just, on our side of immortality. I wanted to watch him play, and couldn’t understand why my father said it wasn’t possible because he’d retired many years before. Time means nothing to a six-year-old.

    It would be good to come up with something similar for bowlers, and then perhaps to get the ICC, or at least Cricinfo, interested in the idea.

    Reply
    • Rahul Siddharthan

       /  December 5, 2011

      Heh. It will take a lot more work before I’d consider getting the ICC or even Cricinfo interested!

      Reply
  4. sunny

     /  December 4, 2011

    Can you please try 16.
    Here is how I calculated it. Average cricket career is 8 years. 2 Big test scores a year. I think that can be taken as good output from a batsman.

    Reply
    • Rahul Siddharthan

       /  December 5, 2011

      Ok, here you go: Bradman (13), Lara (12), Sehwag/Tendulkar/Sangakkara (11), Hammond/Gavaskar/Dravid/Miandad/Jayawardene (10), and about 20 batsmen on 9. Not a bad argument, by the way. I’d have tried to fix the ideal “high score” first (centuries are too common, double-centuries perhaps too rare) and the number of times that score should be reached by a really good batsman. So I’d have thought a really good batsman should get, say, 150 about 10 times in his career. That leads to n=15 — almost the same as what you ask — with Bradman still on top, Lara still 2nd, Sangakkara now alone 3rd, and a bunch of others tied below.

      Reply
  5. Alpan

     /  December 7, 2011

    Ok, here’s a really geeky approach to optimize n: a maximum entropy algorithm. Clearly if n is too large, everyone will have a h-index of 0, i.e., the same value. Similarly, if n is too small, everyone’s h-index will be approximately equal to the number of matches they have played. If we normalize by number of matches, again everyone will have almost the same h-index. Thus, an optimal value of n is one which most differentiates between batsmen. So my recommended (greedy) algorithm would run as follows:
    For each n, compute the normalized h-index for all (or a large subset of) batsmen; from the normalized h-index distribution compute the entropy; end. Pick the n value that corresponds to the largest entropy.

    I don’t have the data, so not sure how well this works!

    Reply
    • Rahul Siddharthan

       /  December 7, 2011

      alpan — what do you mean by “normalised h-index”? The sum of h over all batsmen should be 1? That doesn’t really make sense to me.

      One could calculate the probability of a random batsman having h, for each value of h (not normalised): then the sum over h of p(h) would be 1 and one can calculate and maximise the entropy of this distribution. That is not hard to do, but it turns out that the maximum entropy value of n is 1 (or 0, if we allow 0).

      Reply
  6. Is there a place for a mock h-index?
    Gangan Prathap
    Scientometrics (2010) 84:153–165

    Although the h-index was introduced in a publication-citation context, it can easily be
    applied to other source-item relations, as was done for library classification (Liu and
    Rousseau 2009). An obvious application is its application to sports. It is usual to use
    averages, e.g., batting average = C/P, where C is the total number of runs scored in a and
    career of P innings played, say, borrowing an example from cricket. This is meaningful,
    where the distribution is Gaussian. However, if it is Lotkaian or Paretian, as is often the
    case, a composite measure of correction for quality like (C2/P)1/3 may be profitably used to
    rank such performers. Figure 3 shows an example compiled quite tediously a few years ago
    showing the performances of four leading Indian batsmen and how they compared with alegendary great, Sir Donald Bradman of Australia. As seen, the distribution is identical to
    what will obtain in bibliometrics, with tall cores and long tails. If a conventional h-index
    evaluation is done, then Tendulkar ranks as the best in this list, and Bradman ranks last,
    only because in those days, cricketers got to play fewer games in an active career
    (Table 6). No self-respecting cricket enthusiast will accept such a conclusion. The h-index
    procedure fails to recognize the very tall core and the very short tail implied by Bradman’s
    career. However, the mock h procedure captures this very faithfully, restoring Bradman
    rightfully to the top of the list.

    Reply
    • Rahul Siddharthan

       /  December 7, 2011

      Thanks – your mock h looks very interesting. But it would have been nice if you had looked at a few more batsman — at least the rest of the current top 5, i.e. Lara, Ponting, Kallis. Also, your h-score for Bradman (44) looks different from what I get (48): how did you calculate it? For Gavaskar we agree, for Ganguly I get 51 where you have 47. Tendulkar and Dravid are, of course, still playing (and Dravid has had a good run since you published your paper) so some difference is expected.

      Reply
  1. On the batting h-index | Posts
  2. On the batting h-index | Posts

Leave a comment