Wednesday, 22 November 2017

An xG Timeline for Sevilla 3 Liverpool 3.

Expected goals is the most visible public manifestation of a data driven approach to analyzing a variety of footballing scenarios.

As with any metric (or subjective assessment, so beloved of Soccer Saturday) it is certainly flawed, but useful. It can be applied at a player or team level and can be used as the building block to both explain past performance or track and predict future levels of attainment.

Expected goals is at its most helpful when aggregated over a longer period of time to identify the quality of a side's process and may more accurately predict the course of future outcomes. rather than relying on the more statistically noisy conclusion that arise from simply taking scorelines at face value.

However, it is understandable that xG is also frequently used to give a more nuanced view of a single game, despite the intrusion of heaps of randomness and the frequent tactical revisions that occur because of the state of the game.

Simple addition of the xG values for each goal attempt readily provides a process driven comparison against a final score, but this too has obvious, if easily mitigated flaws.

Two high quality chances, within seconds of each other can hardly be seen as independent events, although a simple summation of xG values will fail to make the distinction.

There were two prime examples from Liverpool's entertaining 3-3 draw in Sevilla, last night.

Both Firmino goals followed on within seconds of another relatively high quality chance, the first falling to Wijnaldum, the second to Mane.

Liverpool may have been overwhelming their hosts in the first half hour, they were alert enough to have Firmino on hand to pick up the pieces from two high quality failed chances, but a simple summation of these highly related chances must overstate Liverpool's dominance to a degree.

The easy way around this problem is to simulated highly dependent scoring events as such, to prevent two goals occurring from two chances separated by one or two seconds.

It's also become commonplace to expand on the information provided by the cumulative xG "scoreline" by simulating all attempts in a game, with due allowance for connected events, to quote how frequently each team wins an iteration of this shooting contest and how often the game ends stalemated.

Here's the xG shot map and cumulative totals from last night's match from the InfoGolApp.

There's a lot of useful information in the graphic. Liverpool outscored Sevilla in xG, they had over half a dozen high quality chances, some connected, compared to a single penalty and other, lower quality efforts for the hosts.

Once each attempt is simulated and the possible outcomes summed, Liverpool win just under 60% of these shooting contests, Sevilla 18%, with the remainder drawn.

Simulation is an alternative way of presenting xG outputs rather than as totals that accounts for connected events, the variance inherent in lots of lower quality attempts compared to fewer, better chances and also  describes most likely match outcomes in a probabilistic way that some may be more comfortable with.

Liverpool "winning" 2.95-1.82 xG may be a more intuitive piece of information for some (although as we've seen it may be flawed by failing to adequately describe distributions and multiple, common events), compared to Liverpool "winning" nearly 6 out of ten such contests.

None of this is ground breaking, I've been blogging about this type of application for xG figures for years, But there's no real reason why we need to wait until the final whistle to run such simulations of the attempts created in a game.

xG timelines have been used to show the accumulation of xG by each team as the game progresses, but suffer particularly from a failure to highlight connected chances.

In a simulation based alternative, I've run 10,000 attempt simulations of all attempts that had been taken up to a particular stage in last night's game.

I've then plotted the likelihood that either Liverpool or Sevilla would be leading or the game would be level up based on the outcome of those attempt simulations.

Liverpool's first dual attempt event came in the first minute. Wijnaldum's misplaced near post header, immediately followed by Firmino's far post shot.

Simulated as a single event, there's around a 45% chance Liverpool lead, 55% chance the game is still level and (not having had an attempt yet) a 0% chance Sevilla are ahead.

If you re-run the now four attempt simulation following Nolito's & Ben Yedder's efforts after 19 minutes, a draw is marginally the most likely current state of the game, followed by a lead for either team.

A flurry of high quality chances then make the Reds a near 90% to reach half time with a lead, enabling the halftime question as to whether Liverpool are deservedly leading to be answered with a near emphatic, yes.

Sevilla's spirited, if generally low quality second half comeback does eat into Liverpool's likelihood of leading throughout the second half, but it was still a match that the visitors should have returned from with an average of around two UCL points.

Sunday, 22 October 2017

Excitement Quotas in the Premier League.

Excitement at a sporting event is a subjective measurement.

It doesn't quite equate to brilliance, as a 7-2 thrashing has to be appreciated for the excellence of the performance of one of the teams, but as the score differential climbs, morbid fascination takes over, at least for the uncommitted.

Nor does it tally with technical expertise. A delicately crafted passing movement doesn't quite set the pulse racing like a half scuffed close range shot that deflects off the keepers knee and loops agonisingly over the bar with the game on the line.

You can attempt to quantify excitement using a couple of benchmark requirements.

The game should contain a fair number of dramatic moments that potentially might have changed the course of the outcome or actually do lead to a significant alteration to the score.

It's easy to measure the change in win probability associated with an actual goal.

A goal that breaks a tied game in the final minutes will advance the chances of the scoring team by a significant amount, whilst the seventh goal in a 7-2 win merely rubs salt into the goal difference of the defeated side.

Spurned chances at significant junctures are only slightly more difficult to quantify.

You can take a probabilistic view and attach the likelihood that a chance was taken based on the chance's expected goals figure to the effect that an actual goal would have had on the winning chances of each side.

Summing the actual and probabilistic changes in win probability for each goal attempt in each match played in the 2016/17 Premier League season gives the five most "in the balance", chance laden matches from that season.

                               Top Five Games for Excitement 2016/17 Premier League

No surprise to see the Swansea/Palace game as the season's most exciting encounter, with Palace staging a late comeback, before an even later Swansea response claimed all three points in a nine goal thriller.

Overall I've ranked each of the 380 matches from 2016/17 in order of excitement as measured by the actual and potential outcomes of the chances created by each team in the game

Bournemouth's games had the biggest share of late, game swinging goals, along with the most unconverted endeavour when the match was still in the balance.

While Tottenham, despite playing in the season's second most exciting game, a very late 3-2 win over West Ham, more typically romped away with games, leaving the thrill seekers looking for a match with more competitive balance to tune into.

Middlesbrough fans not only saw their side relegated, but they did so in rather bland encounters, as well.

Saturday, 14 October 2017

Player Projections. It's All About The Distribution Part 15

A couple of football analytics' little obsessions are correlations and extrapolations.

Many player metrics have been deemed flawed because they fail to correlate from one season to the next, but there are probably good reasons why the diminished sample sizes available for individuals lead to poor season on season correlation.

Simple random variation, players suffer injury, a change in team mates or role within a club, atypically small sample sizes often lead to see sawing rate measurements and inevitably players age and so can be on a very different career trajectory to others within the sample.

The problems associated with neglecting the age profile of a group of players when attempting to identify trends for use in future projections is easily demonstrated by looking at the playing time (as a proxy for ability) enjoyed by players who were predominated aged 20 and 30 when members of a Premier League squad and how that time altered in their 21st and 31st years.

The 30 year oldies played Premier League minutes equivalent to 15 full matches, falling to 12 matches in their 31st year. So they were still valued enough to play fairly regularly, but perhaps due to the onset of decline in their abilities they featured, on average, less than they had done.

The reverse, as you may expected was true for the younger players. They won the equivalent of seven full games in their 20th year and nine the following season.

It seems clear that if you want to project a player's abilities from one season to the next and playing time provides a decent talent proxy, you should expect improvement from the youngster and decline from the older pro.

However, as with many such problems, we might be guilty of attempting to impose a linear relationship onto a population that is much better defined by a distribution of possible outcomes.

The table above shows the range of minutes played by 21 and 31 year olds who had played 450 minutes or fewer in the previous season as 20 or 30 year old players.

As before, we may describe the change in playing time as an average. In this subset, the older players play very slightly more than they did as 30 year olds, the equivalent of two games, improving to 2.2.

The younger players jump from 1.8 games to 3.6.

However, just as cumulative xG figures can hide very different distributions, particularly of big chances which subtly alter our expectation for different teams, the distribution of playing minutes that comprise the average change of playing time can be both heavily skewed and vary between the two groups.

Over three quarters of 30 year old didn't get on the field at all during the next Premier League season, likewise 2/3 of the younger ones..

21% of young players played a similar amount of time to the previous season, between one and 450 minutes, compared to just 14% of the older ones. And 17% of youngsters exceeded the total from the previous season, as did just 10% of the veterans.

So if you use the baseline rate of increased playing time as a flat rate across all players that fall into these two categories in the future, you might be slightly disappointed, because overwhelmingly the experience of such players is one where they fail to play even a minute in the following season.

Knowing that there is an upside, on average for these two groups of players, based on historical precedent is a start, but knowing that 3 out of 4 the oldies and 2 out of 3 youngsters who you are considering didn't merit one minutes worth of play in an historical sample is also a fairly important, if not overriding input.