The problem with the ICC and cricket data

On the 11th of June 2022 Renata de Sousa of Brazil was bowled by Asmita Kohli of Germany while playing a T20 International in Rwanda. The scorecards on both Cricinfo and CricketArchive agree on this, as does the commentary on Cricinfo. The problem is that it isn't true. Kohli did not bowl de Sousa, she had de Sousa caught at cover. The scorecards are incorrect (not for the first time), and nobody seems to have noticed. So how did I identify the issue?

I've recently added new validation checks for my Cricsheet ball-by-ball data to identify deliveries where the batter and non-striker are different from what would be expected. Each of these checks takes into account runs (whether extras or by the batter) and wickets on each delivery and adjusts the batters in response. If the batter and non-striker on a delivery are different from expected then an issue is reported and I investigate further. Most issues aren't actually incorrect, and generally occur because of short runs (as I've learned during this experimentation, there are a lot of short runs in international cricket), but every once in a while a real issue turns up. The de Sousa dismissal is one of those real issues.

As mentioned I take wickets into account as part of the new validation, and one of the checks I've implemented is to recognise that when a player is dismissed in certain ways (bowled, lbw, stumped, hit wicket, or hit the ball twice) the next batter in should end up at the same end as the dismissed player. This means that the new batter should face the next delivery if the over continues, or be the non-striker if the over ends after the wicket. If on that next delivery the batter or non-striker is different then something has gone awry. This is how I identified the issue with the de Sousa dismissal. While de Sousa was listed as bowled (in the middle of the over) the new batter did not face the following delivery, and instead the original non-striker (Avery) was facing, resulting in further investigation.

Thankfully the full match, including the suspect dismissal, is available to watch online as the entire tournament it was part of was streamed and archived. This allowed me to easily find the wicket and see that JE Ronalds took the catch at cover to dismiss de Sousa. As batters can cross during a catch it made sense that the non-striker could be on strike for the next ball, and I confirmed that was correct in the video. Issue resolved, and the data updated to match what had actually happened.

I'd like to make clear is that I'm not criticising the Kwibuka tournament. An international tournament featuring 8 countries outside the "full members" from 3 different continents with live streaming coverage and a full archive of all of the matches? That's absolutely outstanding! It's also massively more impressive than anything the ICC do. The fact that the video is available is the sole reason I can actually correct the issue. Errors can creep in, and I don't blame the organisers for that. I blame the ICC.

Cricsheet, the project I've run since 2009, provides ball-by-ball match data for just over 12,000 cricket matches (as I write). I really want to continue to provide this ball-by-ball data for all levels of international cricket, however, as shown, it's becoming hard to fully trust information for many matches. This issue is not helped by the fact that the ICC's own coverage is getting worse. An ICC event generally doesn't have the matches available after they are played (if they even manage to get streaming working at all), that they don't include the full match coverage on their own streaming site, and often don't even mention official internationals on their own website. With the Kwibuka tournament I had the chance to find and correct the error, with an ICC event I often just have an unresolvable issue.

The example I've raised here is a small enough error, but it could have been detected automatically if the ICC cared to do so. I'm one person using a hacked together ruby script to check ball-by-ball data, and I detected the possible issue. The ICC are the body in charge of international cricket, and could and should be doing this themselves, but they don't appear to care.

I've manually re-scored a number of recent matches (while hunting down ball-by-ball issues) but this is dispiriting to have to do. It angers me that the ICC don't ensure that proper standards are being followed, don't put the most basic checks in place for matches, and don't seem to do anything to correct on-going issues. I see other types of issues on a regular basis, such as matches having incorrect innings totals, balls (and the consequent runs) being assigned to incorrect batters, as well as random runs being added to totals. For a sport that claims to value records so much it's appalling that the organisation that "governs and administrates the game" lets errors enter the record books without even the most rudimentary validation.

What can be done about this example problem, and the general issue of inaccurate data? A fair question, and not one I necessarily have all (or any) of the answers for. I'm sure there are people who would be better qualified than I am to solve this conundrum (if the will were there), but I do have a few suggestions that may help. I'm not a natural writer and this has already taken far longer to write than I would have liked so I'll leave those suggestions for another post and will write them up in the next few weeks.

Have you written a response to this? Let me know the URL...