An h-index for Test cricket batsmen

The h-index is a way to assess the impact of the published work of a scientist, in terms of citations. It is an attempt to get around more simple-minded citation measures and works rather well in the sense that the scientists whom you’d expect to have high h-indices, usually do; while scientists who happen to have published one or two high-impact papers, but have had otherwise unremarkable careers, won’t score very high.

Basically, the definition of the h-index is: it is the maximum number h such that the scientist has published h papers, each with at least h citations. That is, if I have published 10 papers that have each been cited 10 or more times, but have not published 11 papers which have each been cited 11 or more times, my h-index is 10.

Based on a facebook wisecrack by a friend (on the occasion of Sachin falling short of his 100th international 100), the thought occurred to me: how about ranking batsmen in cricket by an analogous score? That is, a batsman has an index h if, on h occasions, he has scored h or more runs. (As before, we take the maximum possible h.)

It turns out that the top five in this list (for Test cricket only) are basically the top five rungetters, in the same order: Sachin Tendulkar, Rahul Dravid, Ricky Ponting, Jacques Kallis, Brian Lara. The top 20 or so almost all appear in the top 20 list for rungetters. So it’s not very interesting — yet. Tendulkar’s h-score is 76 — that is, on 76 occasions he has scored 76 or more. There is a big gap between him and Dravid (69) but the others follow closely behind.

Suppose we modify it as follows: the nh index is that value of h, for a given n, such that on h occasions the batsman has scored nh or more runs. For examples, the 10h index would be: if on 5 occasions I have scored 50 runs or more (and I have not scored 60 runs or more on 6 occasions) I have a 10h index of 5. For n > 1, basically, I am giving more importance to higher-scoring innings, and also benefiting those who played fewer matches (most older players played far fewer games than Tendulkar and can’t remotely approach either his career aggregate, or his h-score).

What is the 10h ranking of batsmen, then? It turns out to be substantially different. The top 6 batsmen are now DG Bradman, Lara, Tendulkar, V Sehwag, KS Sangakkara and DPMD Jayawardene. Bradman scored 180 or more on 18 occasions; Lara’s 10h score is 17, Tendulkar’s is 16, and the other three get 15 each. Also in the top 10 (well, tied for 10) is Gary Sobers, who ranks quite far below both in career aggregate and in h-index. Immediately after him is Wally Hammond, who drops off today’s lists in the aggregrate as well as the above h-index.

Specifically, the top-20 list goes like this:

h	Batsman	Score (at least)
18	DG Bradman	185
17	BC Lara	178
16	SR Tendulkar	160
15	V Sehwag	164
15	KC Sangakkara	152
15	DPMD Jayawardene	150
14	SR Waugh	150
14	RT Ponting	150
14	R Dravid	148
14	JH Kallis	148
14	GS Sobers	145
13	WR Hammond	140
13	SM Gavaskar	147
13	ML Hayden	131
13	Javed Miandad	145
13	IVA Richards	135
13	GC Smith	133
13	G Kirsten	133
12	Zaheer Abbas	126
12	Younis Khan	126

As you may expect, a 5h ranking sort of interpolates these: Tendulkar now tops again, with Bradman and Lara tied next (and close behind). Sobers and Hammond continue to rank high.

While it is always difficult to rank batsmen from different eras, it seems to me that any list of all-time-great batsmen must put Bradman at or near the top, and must include Sobers, Hammond, Sunil Gavaskar, Vivian Richards, Zaheer Abbas, Javed Miandad and other past greats in the top 20. The nh-index seems to do this, for suitable choices of n. But what is the optimal choice?

(This is based on raw batting data downloaded from cricinfo.)

UPDATE 07 Dec 2011: Gangan Prathap points, in the comments, to this 2010 paper, by him, where he proposes a “mock h index” (different from my nh-index above); his scheme ranks Bradman above four top Indian batsmen, but does not consider other international greats.

16 Comments

by Rahul Siddharthan on November 28, 2011 • Permalink

Posted in Uncategorized

Posted by Rahul Siddharthan on November 28, 2011

https://horadecubitus.wordpress.com/2011/11/28/an-h-index-for-test-cricket-batsmen/

16 Comments

Bala
/ November 28, 2011

An excellent post, as usual, and a great idea (the h-index for batsmen). As for the optimal value of n, resisting the temptation to say “42” (which, as we know, is the ultimate answer to all deep questions), I’d still say “1”. So we’re back to the original h index, and Tendulkar is numero uno once again, and all is right with the world:-)

Reply
- Rahul Siddharthan
  / December 1, 2011
  
  Bala – so, just for fun, I looked at the numbers with n=42. Now Bradman, Sehwag and Lara are on top (6 times they have scored 6*42=252 or more), followed by Abbas, Hammond, Dravid, Miandad and Jayawardene on 5. Sachin, with a bunch of others, is on 4.
  
  I disagree that 1 is optimal (in fact, I think it is not optimal even in publishing) but it would be an interesting exercise to work out, and justify theoretically, what is optimal in a given field.
  
  Reply
  - Bala
    / December 1, 2011
    
    I figured n = 1 might be optimal only in a vague hand-waving way,
    not after any real thought. It was more that I couldn’t see any
    obvious natural ‘scale’ to fix n, and so fell back on 1. Maybe one
    should go nonlinear and try n = sqrt{h} or something. As for
    publishing, I think there IS no optimal measure that is anywhere
    near universally applicable. After all, there are as many motivations
    and styles of doing science as there are scientists themselves.
    Of course one speaks of impact, but I think that is a corollary of
    our present-day myopic impatience with things like perspective
    and the big picture. A milieu that increasingly regards yesterday as
    ancient history can’t imagine waiting for a decade or more to assess
    the real impact of anything.
    
    Reply
    - Rahul Siddharthan
      / December 1, 2011
      
      Yes, the question of a natural “scale” is important. Also important is the quality of the opposition (in cricket), the citing papers (in research), etc — which is ignored by the h-index and this generalisation. People have come up with more sophisticated metrics in literature, but none have the simplicity of the h-index. I agree this is rather unreliable for recent research, and it may also be unfair to some older research — but that is why people talk about “impact” metrics, not “quality” metrics. The problem comes when impact is confused with quality. All things being fair, I believe that high-impact papers will, 99% of the time, be high-quality; but of course people are gaming the system now and not all highly-cited papers are actually so worthwhile.
      
      Reply
Alpan
/ December 1, 2011

Interesting. Have you tried doing this for bowlers (i.e., largest h such that the bowler has taken nh wickets in h matches)?

Reply
- Rahul Siddharthan
  / December 1, 2011
  
  No, I haven’t tried. Given that the maximum any bowler has ever taken in a match is 18, while many bowlers have played over 100 matches, I’d guess the choice of n needs to be much less than 1. I was thinking of trying other sports, including some that I’m not familiar with, just to see how it does in an unbiased situation (ie, in cricket I believe Bradman should be on top, but in baseball, say, I’d have no clue — but then I’ll need a baseball expert to comment on the resutls).
  
  Reply
CIngram
/ December 2, 2011

Wonderful idea. The ideal number from my point of view (I’m an Essex boy) would be n: Graham Gooch = nº 1, but I think we’d have to go non-linear (as Bala suggests) and then cheat a bit to get it to work ;-) It would be easier I think to make Jack Hobbes top for all first class cricket.

But I wouldn’t argue with Tendulkar as nº 1. And even less would I argue with the Don. I remember reading, as a very young boy, the story of his last test series in England (told by I can’t remember who). The drama of the narrative was compelling. Reading the lines I could see before me this greatest of all batsmen doing impossible things before proving that he was just, only just, on our side of immortality. I wanted to watch him play, and couldn’t understand why my father said it wasn’t possible because he’d retired many years before. Time means nothing to a six-year-old.

It would be good to come up with something similar for bowlers, and then perhaps to get the ICC, or at least Cricinfo, interested in the idea.

Reply
- Rahul Siddharthan
  / December 5, 2011
  
  Heh. It will take a lot more work before I’d consider getting the ICC or even Cricinfo interested!
  
  Reply
sunny
/ December 4, 2011

Can you please try 16.
Here is how I calculated it. Average cricket career is 8 years. 2 Big test scores a year. I think that can be taken as good output from a batsman.

Reply
- Rahul Siddharthan
  / December 5, 2011
  
  Ok, here you go: Bradman (13), Lara (12), Sehwag/Tendulkar/Sangakkara (11), Hammond/Gavaskar/Dravid/Miandad/Jayawardene (10), and about 20 batsmen on 9. Not a bad argument, by the way. I’d have tried to fix the ideal “high score” first (centuries are too common, double-centuries perhaps too rare) and the number of times that score should be reached by a really good batsman. So I’d have thought a really good batsman should get, say, 150 about 10 times in his career. That leads to n=15 — almost the same as what you ask — with Bradman still on top, Lara still 2nd, Sangakkara now alone 3rd, and a bunch of others tied below.
  
  Reply
Alpan
/ December 7, 2011

Ok, here’s a really geeky approach to optimize n: a maximum entropy algorithm. Clearly if n is too large, everyone will have a h-index of 0, i.e., the same value. Similarly, if n is too small, everyone’s h-index will be approximately equal to the number of matches they have played. If we normalize by number of matches, again everyone will have almost the same h-index. Thus, an optimal value of n is one which most differentiates between batsmen. So my recommended (greedy) algorithm would run as follows:
For each n, compute the normalized h-index for all (or a large subset of) batsmen; from the normalized h-index distribution compute the entropy; end. Pick the n value that corresponds to the largest entropy.

I don’t have the data, so not sure how well this works!

Reply
- Rahul Siddharthan
  / December 7, 2011
  
  alpan — what do you mean by “normalised h-index”? The sum of h over all batsmen should be 1? That doesn’t really make sense to me.
  
  One could calculate the probability of a random batsman having h, for each value of h (not normalised): then the sum over h of p(h) would be 1 and one can calculate and maximise the entropy of this distribution. That is not hard to do, but it turns out that the maximum entropy value of n is 1 (or 0, if we allow 0).
  
  Reply
Gangan Prathap
/ December 7, 2011

Is there a place for a mock h-index?
Gangan Prathap
Scientometrics (2010) 84:153–165

Although the h-index was introduced in a publication-citation context, it can easily be
applied to other source-item relations, as was done for library classification (Liu and
Rousseau 2009). An obvious application is its application to sports. It is usual to use
averages, e.g., batting average = C/P, where C is the total number of runs scored in a and
career of P innings played, say, borrowing an example from cricket. This is meaningful,
where the distribution is Gaussian. However, if it is Lotkaian or Paretian, as is often the
case, a composite measure of correction for quality like (C2/P)1/3 may be profitably used to
rank such performers. Figure 3 shows an example compiled quite tediously a few years ago
showing the performances of four leading Indian batsmen and how they compared with alegendary great, Sir Donald Bradman of Australia. As seen, the distribution is identical to
what will obtain in bibliometrics, with tall cores and long tails. If a conventional h-index
evaluation is done, then Tendulkar ranks as the best in this list, and Bradman ranks last,
only because in those days, cricketers got to play fewer games in an active career
(Table 6). No self-respecting cricket enthusiast will accept such a conclusion. The h-index
procedure fails to recognize the very tall core and the very short tail implied by Bradman’s
career. However, the mock h procedure captures this very faithfully, restoring Bradman
rightfully to the top of the list.

Reply
- Rahul Siddharthan
  / December 7, 2011
  
  Thanks – your mock h looks very interesting. But it would have been nice if you had looked at a few more batsman — at least the rest of the current top 5, i.e. Lara, Ponting, Kallis. Also, your h-score for Bradman (44) looks different from what I get (48): how did you calculate it? For Gavaskar we agree, for Ganguly I get 51 where you have 47. Tendulkar and Dravid are, of course, still playing (and Dravid has had a good run since you published your paper) so some difference is expected.
  
  Reply

E's flat, ah's flat too

An h-index for Test cricket batsmen

16 Comments

Bala

Rahul Siddharthan

Bala

Rahul Siddharthan

Alpan

Rahul Siddharthan

CIngram

Rahul Siddharthan

sunny

Rahul Siddharthan

Alpan

Rahul Siddharthan

Gangan Prathap

Rahul Siddharthan

Leave a comment Cancel reply

Categories

Archives

Meta

E's flat, ah's flat too

An h-index for Test cricket batsmen

Related

16 Comments

Bala

Rahul Siddharthan

Bala

Rahul Siddharthan

Alpan

Rahul Siddharthan

CIngram

Rahul Siddharthan

sunny

Rahul Siddharthan

Alpan

Rahul Siddharthan

Gangan Prathap

Rahul Siddharthan

Leave a comment Cancel reply

Categories

Archives

Meta