Pages

Saturday, 23 September 2017

30 Year old Messi is Likely in Decline.

'Tis the season for small sample sized hyperbole to be liberally launched on a expectant audience and the latest recipient of the "If he continues at this rate" award for unrealistic dreamland is none other than Lionel Messi.

While Ronaldo has been kicking his heels and the occasional Real Betis player, Messi has single-handedly (with the help of 10 teammates) launched Barcelona seven points clear of their perennial rivals from Madrid.

Messi turned 30 in the close season, he's playing in his 14th La Liga season and is undoubtedly one of the two best players of the last decade.

But he is still human and bound by the natural athletic decline that eventually sets in for every footballer.

Players improve with maturity and experience, peak, usually in their late twenties and then begin an inexorable decline, albeit from differing peaks.

Messi's post birthday, six game return in the UCL and La Liga, but discounting a two legged Spanish Super Cup defeat at the hands of Ronaldo's Madrid, has been spectacular, even by his standards.

It has spawned at least one article, liberally salted with stats to enhance credibility, eagerly anticipating the untold riches to come.

Unfortunately, five or six games is so small that you will inevitably get extremes of performance, either very good or very bad.

Particularly, if you selectively top and tail the games to eliminate a comprehensive defeat, devoid of any Messi goals from open play at the hands of your nearest rivals, but conclude with a three open play scoring performance from the Argentine.

Small samples are noisy, unbalanced and rarely definitively indicative of what will happen in the longer term or even just a single season.

Barcelona has played Alaves, Eibar, Getafe, Espanyol and Betis, only the latter is currently higher than 13th.

As a data point it is all but useless to project Messi's 2017/18 season.

Individual careers are statistically noisy. Injury, shifted positional play and team mate churn are just some of the factors that can make for an atypical seasonal return, even before we try to decide which metric is sufficiently robust to reflect individual performance.

If we use goals and assists to judge Messi up to his 30th birthday, his delta, the change in non penalty goals and assists per 90 from one season to the previous season trends negative when Messi was 27, guesstimating this was when he peaked.

If we include 2017/18's small sample sized explosion as a fully developed rate for this upcoming season, the trendline still becomes negative this year.

If we regress this current hot rate towards Messi's most recent deltas, as we should, Messi's peak stretches to his 28th birthday.

But by his own standards he has likely peaked.

Open play goals and expected goals for the last three and the first 5 games of 2017/18 tell a similar gentle decline, even allowing for Messi's recent spurt of scoring.


Actual, non penalty, open play goals/90 are trending downwards, as are Messi's xG per 90 on a 10 game rolling average.

The actual trendline is also probably more shallower because of the narrative driven choice of his three open play goal spree against Eibar providing the doorstop.

That Messi consistently over performs the average player xG isn't surprising, but the peaks, like the one he's currently enjoying is often driven by a glut of relegation threatened sides turning up in Barcelona's lumpy quality of schedule.

Enjoy the blips, but don't draw conclusions based on so little evidence.

Data from Infogolapp.

Sunday, 10 September 2017

Messi and Ronaldo. Expected Goals Makers, Takers or a Bit of Both.

With the increased availability of granular data, there has been a similar influx of advanced metrics, both for players and sides across a wider range of domestic leagues.

And while performance based numbers, often to a couple of decimal places, are the raw material for much of the analytically based content, their attractiveness and clarity of meaning rarely extend beyond the spreadsheet.

It therefore falls to visualisations to convey some of the rich seams of information available in such manipulated data sets in a clear and easily digestible format, such as Ted Knutson's  hugely popular radars.

Expected goals remain the flavour of the month, although BBC pundits are still immune, imploring players to "do better" with opportunities that are scored fewer than one time in 10.

A team or individual's attacking contribution can be neatly summarised by their expected goals and assists, standardised at least to a per 90 figure, with respect given to those who have achieved their numbers over a larger sample size compared to noisy small sample interlopers, ripe for regression.


Here's the xG/90 and xA/90 for the 70 largest cumulative, goal involvement achievers from La Liga's 2016/17 season.

Data is from @InfogolApp and has been restricted to open play chances and assists.

Messi and Ronaldo are among a clutch of players who have broken away from the main body of the plot, although they are also quite a distance remove from each other.

Messi was involved in around 0.85 xg+xA per 90 and Ronaldo around 0.65.

However, the former, while slightly under-performing against the latter in getting on the end of xG scoring chances, more than compensated by creating over double the amount of expected assists per 90.

So a simple scatter plot can begin to reveal fundamental differences between even the most high profile of players.

More information can be extracted by simply running a straight line between a particular player's point on a scatter graph and the origin.

Moving down such a line, you'll encounter players who in the season under scrutiny, achieved ratios for xg and xA that closely resemble those of the line owning player.

The magnitude of their cumulative performance is less than those players that are further away from the origin, but their shot/assist characteristics will be consistent with any near neighbours.

Messi was a more sharing team mate in open play in 2016/17, whereas Ronaldo headed the line of takers, rather than makers.

Friday, 8 September 2017

Shot Blocking and the State of the Game.

It has long been appreciated that the dynamics of a game subtly alters as time elapses, scorelines alter or remain the same and pre match expectations are met, exceeded or under shot.

This shifting environment has traditionally been investigated using the simple measure of the current score.

This has been unfortunately labelled as games state, when simply "score differential" would have both succinctly described the underlying benchmark being applied, without hinting at a more nuanced approach than just subtracting one score from another.

As I blogged here, the problem is most acute when lumping the not uncommon, stalemated matches together.


Consider a game between a strong favourite and an outsider that finishes goalless.

Whereas the latter more than matches their pregame expectation, the former falls disappointingly short of theirs.

The average expectation at any point in a game can be represented in a number of ways, but perhaps the most intuitive is an estimation of the average number of points a team will pick up based on the relative strengths of themselves and their opponent, at the current scoreline and with the time that remains.

The plot above shows the relative movement of the expected points for a strong favourite playing weaker opposition to a 0-0 conclusion.

The favourite would expect to average around 2.5 points per match up at kick off, decaying exponentially to one actual point at full time.

So at any point in the match we can measure the favourite's current expectation compared to their pregame benchmark and use this to describe their own level of satisfaction with the state of the game.

Game state would be preferable, but that's already taken.

The same is true for the outsider. Their state of the game gradually increases compared to their much reduced pregame expectation.

Although the game is scoreless throughout for each side, things are getting progressively worse for the favourite and better for their opponents.

We can use these shifting state of the game environments to see if they have an effect on in game actions.

Intuitively you would expect the team doing less well compared to their expectations to gradually commit more resources to attack, in turn forcing their opponents onto the defensive.

This may increase shot volume for the former, but it is also likely that these attempts, particularly from open play will fall victim to more defensive actions, such as blocks.

The reverse would seem likely to be true for the weaker team. Although their shot count may fall, with less defensive duties being carried out by their opponents, their sparser shot count may evade more defensive interventions, again such as blocks.


Here's what the modelled fate of a shot from regular play from just outside the penalty area in a fairly central position looks like between two unequal teams as the match progresses.

Data is from a Premier League season via @infogolApp

In building the model, the decay in initial expectation has been used to describe the state of the game for the attacking team when each individual shot was attempted, rather than simply using score differential.

Initially the weaker team is less likely to have their shot blocked, although it is probably more accurate to say that the favoured side is more likely to suffer this fate.

As the game progresses, the better team sees a slight increase in the likelihood that a shot from just outside the box is blocked, perhaps suggesting that their opponents are initially heavily committed to a defensive structure.

The weaker side has a lower initial likelihood that such a shot is blocked, again implying a more normal amount of defensive pressure early in the game. But as the match progresses this likelihood that their shots are blocks falls even more.

This nuanced model appears to be illustrating the classic potential for a prolonged rearguard action from an underdog, followed by a late smash and grab opening goal, mitigated by the relative shot counts from each team.


Tuesday, 5 September 2017

Premier League Defensive Profiles.

Heat maps and the like have been around for ages as a way of visualising the sphere of a particular players influence.

However, it's always nice to have some numerical input to work with, so I've used the Opta event data that powers InfoGol's xG and in running app to develop metrics that describe how teams and individuals contribute over a season.

Defensive metrics have lagged well behind goals and assists, so I looked at that neglected side of the ball.

Unlike goal attempts, counting defensive stats tends to be a fairly futile exercise. No one willingly wants to keep making last ditch tackles and racking up ever higher defensive events is more often the sign of a team in trouble.

There's also the disparity in possession time which gives the possession poor team more chances to accrue defensive events.

Therefore, pitch position, rather than bulk events seems an obvious alternative.

Allowing a side lots of touches deep in your territory is intuitively a bad idea and the higher up the field a side is willing or able to engage their opponent would appear preferable.

Measurements have been calculated from the Opta X, Y point of an event to the centre of a team's own goal line.

Thus a tackle or clearance made on the half way line will be further from this point of reference if it is made near the touchline compared to if it completed on the centre spot.

This allows for defensive event profiles for both a team and also their opponents.


A quick eye test appears to show that the more successful Premier League teams do their defending further away from their own goal than the lesser sides are either willing or able to do.

That the idea that doing defensive stuff higher up the pitch is the product of a good team is further developed by plotting where a side defends on average and where they allow their opponents to defend, again on average.


The relegated teams from 2016/17 mostly suffered the doubly whammy of choosing or having to defend an average of around 34 yards from the centre of their own goal line compared to nearly 40 yards for some of the top 6 and they also allowed their opponents the luxury of making defensive actions around 38 yards from their own goal line.

Notably Pulis again muscles into an area apparently reserved for relegation fodder with his defensive voodoo.

At a player level it's a trivial problem to find the average pitch position where he makes a defensive action and then find how closely or far flung each individual action is from this average point.

These numbers can then be used as the average position for a player's defensive contribution, measured from the centre of his own goal and also how widely this area extends to.

N'Golo Kante's an obvious candidate to see if this simple exercise again passes the eye test.

In 2016/17 the average pitch position for Kante's defensive actions was 45 yards from his own goal.

The average distance between this average position and all the defensive actions he made was 23 yards

The latter was greater than the average for all defensive midfielders as a group.

We could perhaps say that Kante was relatively advanced in his defensive actions (he was seven yards further up field that his former team mate Nemanja Matic) and his field of influence was also more expansive compared again to Matic and his peers.

Charlie Adam, by contrast appears more constrained by the role required from him. In 2016/17 he tackled deeper than both Kante and Matic and strayed less far afield.

He more resembled a disciplined central defender in his defensive foraging and in doing so remained roughly where his energy bar lands on the pitch around the 70th minute.



Wednesday, 23 August 2017

Chance Quality From 1999.

Back in the late 90's when Gazza's career was on the wane and what might become football analytics was mainly done in public on gambling newsgroups, shot numbers where the new big thing.

"Goal expectation", calculated from a weighted and smoothed average from a side's actual number of goals from their last x number of matches, was often the raw material to use to work out the chances of Premier League high flyers, Leeds beating mid table Tottenham.

Shot numbers (which included headers) then became the new ingredient to throw into the mix and a team's shooting efficiency quickly became a go to stat.

Multi stage precursors to goal expectation models where further developed when shot data became available which was broken down into blocks, misses and on target attempts.

To score, a side had to avoid having their shots blocked, then get them on target and finally beat David James.

This new data allowed you to attach team specific probabilities to each stage of progression towards a goal and arrive at a probabilistic estimate of a team's conversion rate per attempt.

Unlike today's xG number, the figure told you nothing specific about a single shot, nor was it particularly useful in helping to describe the outcome of a single game, even with double digit attempts.

Aggregated over a larger series of matches by necessity, this nuanced conversion rate, that included information about a side's ability to avoid blocks, get their efforts on target and thereafter into the goal, allowed you to deduce something about a side's preferred attacking and defensive style.

Also if that preference persisted over seasons, this team specific conversion rate could be used alongside each team's raw shot count in the recent past to create novel, up to date and hopefully predictive set of defensive and attacking performance ratings.

Paper and pencil only lasts slightly longer than today's hard drive, so unfortunately I don't have any "goal expectation" figures for Liverpool circa 2002.

However, with the additional, detailed data from 2017, I decided to re-run these turn of the century, slightly flawed goal expectation models to see if these old school, team specific conversion rates offer anything in today's more data rich climate.

To distinguish them from today's xG I've re named the output as "chance quality".


Chance quality is an averaged likelihood that a side would negotiate the three stages needed to score.

Arsenal had the highest average chance quality per attempt in 2015/16.

The Gunners were amongst the most likely to avoid having their attempts blocked, those that weren't blocked were most likely to be on target and those that were on target were most likely to result in a goal.

Leicester, in their title winning season also created high quality chances per attempt, but Tottenham appeared to opt for quantity verses quality. They were mid table for avoiding blocks and finding their target, but their on target attempts were, on average among the least likely to result in a goal.

Only Palace of the surviving sides were less likely to score with an on target attempt than Spurs.

 

Here's the same chance quality per attempt, but for attempts allowed, rather than created by the non relegated teams from the 2015/16 season.

The final two columns compare the estimated goal totals for each team using their shot count in that season and their conversion, chance quality from the previous year, to their actual values.

The thinking back in 2000 was that conversion rate from a previous season remained fairly consistent into the next season and so multiplying a side's chance quality by the number of shots they subsequently took or allowed would give a less statistically noisy estimate of their true scoring abilities.

Here's the correlation between the estimated and actual totals using chance quality from 2015/16 and shot numbers from 2016/17 to predict actual goals from 2016/17.


 


There does appear to be a correlation between average chance quality in a previous year, attempts made the next season and actual goals scored or allowed.

The correlation is stronger on the defensive side of the ball, perhaps suggesting less tinkering with the back 3, 4 or 5.

With full match video extremely rare in 2000, it might have been tempting to assume chance quality had remained relatively similar for most sides and any discrepancy between actual and predicted was largely a product of randomness.

Fortunately, greater access to granular data, availability of extensive match highlights and Pulisball, as a primitive benchmark for tactical extremes, has made it easier to recognise that tactical approaches and chance quality often varies, particularly if there is managerial change.

In this post I compared the distribution of xG for Stoke under Pulis' iron grip (fewer, but high chance quality attempts) and his successor Mark Hughes (higher attempt volumes, but lower quality attempts).

Subsequently, under Hughes, Stoke have tended to morph towards the Hughes ideal and away from Pulis' more occasional six yard box offensive free for all.

So a change of manager could lead a a genuine increase or decrease in average chance quality, which in turn might well alter a side's number of attempts. And any use of an updated version of chance quality should come with this important caveat.

For anyone who wants to party like it's 1999, here's the average chance quality per attempt from the 2016/17 season using this pre-Twitter methodology allied to present day location and shot type information.



Use them as a decent multiplier along with shot counts to produce a proxy for the more detailed cumulative xG now available during the upcoming season or as a new data point to assist in describing a side's tactical evolution across seasons.

In 2016/17, Crystal Palace improved their chance quality compared to 2015/16 with half a season of Allardyce and Arsenal maintained their reputation for trying to walk the ball into the net.

All data is from infogolApp, where 2017 expected goals are used to predict and rate the performance of teams in a variety of leagues and competitions.

Monday, 14 August 2017

Liverpool's Split Personality

Everyone likes a good mystery and Constantinos Chappas provided the raw material for a great one when he posted this breakdown of Liverpool's points per game performance in 2016/17 against the six teams from Everton and above and against the remaining 13 sides.


It's a great piece of work from Constantinos and Liverpool's split personality when playing very well against title contenders and Everton compared to when they do less well against lower class teams has generated much speculation.

These have generally fallen into two mutually exclusive groups, either narrative based tactical flaws of Klopp and Liverpool or odds based simulations that attempt to explain away the split as mere randomness.

It is unlikely that either approach will wholly account for Liverpool's apparent failure to dispatch mid and lower table teams with the authority they appeared to preserve for the league's stronger sides.

Football is awash with randomness as well as tactical nuances, so it seems much more likely that a combination of factors will have contributed to the 2016/17 season.

It's a simple task to simulate multiple seasons, often using bookmaker's odds as a proxy for team strength to arrive at the chances that a side, not necessarily Liverpool might exhibit a split personality.

However, it's a stretch to then conclude that either chance was the overriding factor or it can be excluded as a cause merely because this likelihood falls above or below an arbitrary level of certainty.

There is so much data swirling around football at the moment, particularly ExpG, that it seems helpful to use these number to shed some light on Constantinos' intriguing observation.

Rather than a pregame bookmaker's estimate a a side's chance, we have access to ExpG figures for all of Liverpool's 2016/17 matches.

ExpG have arisen from the tactical and talent based interaction that took place on the field and spread over 90+ minutes of all 38 games they perhaps provide a larger sample of events with which to explain a series of game outcomes, rather than simply using 38 individual sets of match odds, however skillfully assembled.

One aspect of a low scoring sport, such as football, where ExpG struggles is how teams adopt different approaches to achieve the aim of winning the most available number of points.

A side may take a fairly comfortable lead early in a contest and then chose to commit more to defence against a weaker or numerically deficient opponent.

An extreme case was Burnley's win over Chelsea, where early actual goals allowed the visitors to concede large amounts of ExpG and just few enough actual ones to handsomely lose the ExpG contest, but win the match.

ExpG figures are inevitably tainted by actual real events, such as goals and red cards, but it is still at its most useful when used in conjunction with simulations to attempt to describe the range and likelihood of particular events occurring.

Scoring first (and 2nd and 3rd, along with Chelsea going down to 10 men) was a big assistance to Burnley and Andrew Beasley has written about the importance of the first goal here, for Pinnacle.

If we look at the size of the ExpG figures for all goal attempts in a game and the order in which they arrived, there may be enough data that is not distorted by actual events to estimate which side was most likely to open the scoring, allowing them then to be able to more readily dictate how the game evolves.


In games against the 13 lowest finishing teams, Liverpool took the initial lead 16 times, compared to a most likely figure of 15.

With the interaction of attempts allowed and taken, Liverpool ended up 1-0 to the good or bad or goalless throughout about as often as their process deserved.


They fared much better against the top teams.

In those 12 games Liverpool took the 1-0 lead nine times compared to a most likely expectation of just six based on the ExpG in their games.

It was around a 7% chance that an average team repeats this if Liverpool carve out and allow the chances for them.

It's understandable to look to the heights that may be achieved, rather than the lowly foothills left behind.

But based on Liverpool's 2016/17 process from an ExpG and first goal perspective, perhaps their relatively disappointing record against lower grade sides is not the outlier, but rather their exceptional top 6 results are.

Scoring fewer first goals than they actually did in these top of the table clashes would likely decrease their ppg in these games, while inevitably increasing those of their six challengers.

This would shift the top six group gradually to the right in the initial plot and Liverpool slightly more substantially to the left until they perhaps formed a more homogenous group with no outlier.

It's traditional to wind up with "nothing to see, randomness wins again", particularly when one set of data is taken from a small, extreme inducing sample of just 12 inter connected matches per team.

But we now have the data, a place to look and video to see if there is some on pitch, if possibly transient cause to the effect of Liverpool finding the net first in big games or if the usual suspect in Constantinos'  mystery does indeed turn out to be the major guilty party.

All data from @InfoGolApp

Tuesday, 8 August 2017

"It's All about The Distribution Part 2"

First the disclaimer, this isn't a "smart after the event" explanation for Leicester's title season.

It is a list of the occasional, nasty or pleasant surprises that can occur and the limitations of trying to second guess these when using a linear, ratings based model.

Building models based around numbers and averages do work extremely well for the majority of teams in the majority of seasons.

But as the financial world found to the cost of others, neglecting distributions, especially ones that appear normal, but hide fatter than usual tails can leave you unprepared for the once in a lifetime event.

The previous post looked at a hypothetical five team scenario, where the lowest rated, but under exposed side had a much better chance of winning a contest than implied by the respective ratings, simply because the distribution of potential ratings were markedly different for this side.

Again, full disclosure, this model wasn't from football, it was a five runner race run at Uttoxeter and Team 5 was actually a very lightly raced horse against exposed rivals.

I assumed that the idea that distributions of potential performance sometimes matters also carries over into football and the obvious example of an unconsidered team taking a league by storm was Leicester's 2015/16 title winning season.

I went back to 2014/15 and produced some very simple expected goals ratings for all 20 sides going into the 2015/16 season.

I also looked at how diverse and spread out the performance ratings from 2014/15 were for each side.

Three teams whose performances had fluctuated most and might be considered as having a bit more meat in their distribution tails and might be less likely to adhere to their "average" expectations were champions, Chelsea, West Ham and Leicester.

I then set up a distribution for each team based around their average rating and the standard deviation from their individual game by game performances in 2014/15.

I then drew from these tailored distributions as a basis to simulate each game in the 2015/16 season, Leicester's winning season.

And this is how the Foxes and their fellow in and out teams fared in simulations that take from a distribution, rather than a rating.

.

Leicester project as a top half team, who were as likely to finish in the top two as they were to be relegated and West Ham put themselves about all over the place, but predominately in the top half, which is where they ended up.

Chelsea have a minute chance of ending up tenth, so kudos to Mourinho for breaking this particular model.

There are some really interesting figures emerging today, both for teams and players and usually it's fine to run with the average.

But these averages live in distributions and when these distributions throw up something inevitable, if unexpected, as the bankers found out, someone has to pay.

"It's All About The Distribution".

You've got five teams.

One is consistently the best team, their recruitment is spot on with a steady stream of younger replacements ready and able to take over when their starts peak and wane.

Then we've got two slightly inferior challengers, again the model of consistency, with few surprises, either good or bad.

The lowest two rated teams complete the group of five.

The marginally superior of these also turns in performances that only waver slightly from their baseline average.

For the final team, however we have very limited information about their abilities, partly due to a constantly changing line up and new acquisitions.

The current team has been assembled from a variety of unfashionable leagues and results and we only have a handful of results by which to judge them.

So we group together the initial results of similarly, newly assembled teams to create a larger sample size to describe what we might get from such a team.

Instead of a distribution that resembles the four, more established teams, we get one that is much more inconsistent. Some such teams did well, others very badly.

The distribution of performances for the first four sides is typical of teams from this mini league, whereas the distribution we have chosen to represent the potential upside and downside of this unexposed side is not.

Team 5's distribution has a flatter peak and fatter tails, both good and bad.

The average "ratings" of the five teams are shown below.



Team 5 has the lowest average rating, but by far the largest standard deviation based on the individual ratings of the particular cohort of sides we have chosen to represent them.

As Team 5 is the lowest rated, they're obviously going to finish bottom of the table, a lot, but just to confirm things we could run a simulation based on the distribution of performances for all five teams.

First we need to produce a distribution that mimics the range of performances for the 5 teams and we'll draw a random number from that distribution to decide the outcome of a series of contests.

The highest performance number drawn takes the spoils.

Run 10,000 simulated contests and Team 5 does come last more frequently than any other side, roughly half the tournaments finish with Team 5 in last position.

However, because their profiled performances are inconsistent and populated by a few very good performances, they actually come first more frequently than might be expected from their average performance rating.

In 10,000 simulations, Team 5 comes first 22% of the time, bettered only by Team 1, whose random draw of ratings based on their more conventional distribution of potential performances grants them victory 36% of the time.

Not really what you'd expect simply from eyeballing the raw ratings.

Team 5, based on the accumulated record of teams that have similar limited data, are likely to be sometimes very bad, but occasionally they can produce excellent results.

Such as Leicester when they were transitioning into a title winning team?

As someone once said at an OptaProForum.......

"It's all about the distribution"

......and simple averages can sometimes miss sub populations that could be almost anything.

Straight line assumptions, extrapolated from mere averages will always omit the inevitable uncertainty that surrounds such teams or players, where data is scarce and distribution tails might be fatter than normal.

Friday, 4 August 2017

What Might Leicester Get from Kelechi Iheanacho?

Hidden behind Neymar's unveiling in Paris was Kelechi Iheanacho's departure from Manchester City to last season's Champions League quarter finalists, Leicester City.

There's probably no need to measure the height of Iheanacho's transfer fee in piles of tenners, but it does amount to a substantial investment in young talent for the East Midlands side and an opportunity  for Kelechi to gain larger amounts of playing time, especially from kick off.

His stats are impressive for a young player.

Any playing time at such a raw age, particularly at a regular title contender is impressive and during his 1275 minutes he's scored 12 from 50 shots (24% conversion rate, without the need for a calculator) and provided 4 assists.

Many appearances have been from the subs bench and it is well known that scoring generally accelerates as the game progresses, so he'll have had a slight boost from that.

He's not really been thrown in solely against the Premier League minnows.

The weighted expected goals conceded by the teams he has faced is only slightly above the league average and he's scored against teams such as Stoke, Spurs, Stoke, Manchester United, Bournemouth, Stoke, Swansea and Southampton.

Nothing too much to worry about him being a flat track bully, although he does quite like Stoke.

In simpler, pre expected goals times, you would take his 24% conversion rate and regresses it fairly heavily towards the league average rate to get a more realistic future expectation.

Devoid of any shot location context, Iheanacho's conversion rate since 2015/16 is second only to Llorente at Swansea, another 50 odd attempt player and just ahead of renowned goalscorer, Gary Cahill.

Small samples often lead to unrepresentative extremes and if any media outlet is still quoting raw conversion rates in this enlightened era, they'll probably be disappointed in the long run.

Higher volume shooters over the two seasons Iheanacho's been around in the Premier League are peaking at around 18% conversion rates and as a group, players with 40 or more attempts are converting around 1 in ten.

Regressing his 24% rate by around 50% wouldn't have been out of order and back in the day you would probably pitch him it at around a 17% conversion rate, which is still elite and wait for more data.

Nowadays, lots of Heisenberg expG models are attempting to extract the truth from lots of noisy data produced by players whose fitness peaks and troughs, along with their team mates and opponents.

Most will put Iheanacho's cumulative expected goals from his 50 attempts at around 9 expG compared to his actual total of 12 goals.

Act is > ExpG, case solved, he's an above average finishing capture.

But this doesn't account for natural randomness in a process or outrageous good fortune (such as
the ball hitting you on the back and looping into the net against Swansea in December 2015).


Here's the range of simulated successful outcomes for an average finisher, assuming he could have got onto the end of Iheanacho's 50 attempts.

There's roughly a 14% chance an average Premier League finisher scores as many or more goals than the 12 that Leicester's new signing managed at Manchester City and his highlighted 24% strike rate slightly pales under the scrutiny of shot type and location.

It's also wise to see if your Heisenberg model at least roughly matches the actual distribution of output from the many guinea pigs who are run through it.... and Inheancho is initially a pretty poor fit.

The chance that his actual distribution of goals from his attempts is consistent with the model used in the simulations, is only around 1 in 1000.

In these cases it is well worth looking at each attempt, the outcome and the attached expG value.

The problem with Iheanacho fitting the model is that two of his goals come from very low probability chances (the aforementioned back deflected goal at Swansea) and the remaining ten come from virtually the ten most likely goal scoring opportunities he received.

He's scored one long range shot against Southampton, one with his back against the Swans and then nails almost every high quality chance with an expG above 0.4 that he's presented with.

Mitigate for the fluke and the model fit becomes more forgiving.

Delving into the attempts, looking at the outcomes and seeing where the (imperfect) model breaks down can tell us a lot more about Leicester's £25 million purchase than merely saying "he over-performs his ExpG".

He may thrive on quality chances, he certainly has done in his short time in the Premier League.

Over the previous two campaigns, Manchester City created the second highest proportion of the high quality chances that Iheanacho excels at converting.

Around 7% of Manchester City's created attempts have an ExpG in excess of 0.4 in my model.

Leicester are third in this list over the last two seasons, also with around 7% of their chances being high quality ones, suggesting he's a decent fit for the Foxes.

However, numerically, Manchester City are much more prolific both overall and in this creative area. Their play makers carve out five such highest quality chances every four games, compared to just three for Leicester.

Iheanacho may be able to bridge that gap between the two Cities by his positional nous and undoubted pace, but he'll also be competing with Leicester's main beneficiary of these high quality chances, a quarter of which fell to Jamie Vardy.

In short, just a few caveats to one of the upcoming season's major purchase by a team outside the top six.

Friday, 21 July 2017

Shots, Blocks And Game State

In this post I described a way to quantify game state by reference to how well or badly a side was doing in relation to their pregame expectations.

So rather than simply using the current scoreline to define game state, it gave a much more nuanced description of the state of the game, particularly in those frequent phases of a match when the sides were level.

It also incorporates time remaining into the calculation. 

A team level after 10 minutes might be in a very different situation compared to the same score differential, but with ten minutes remaining. How they and their opponents played out the subsequent time may be very different in the two scenarios.

At a simplistic level, those teams in a happy place may be more content to prioritise actions that maintain the status quo, such as defend more, while those who'd wish to alter the state of the game might put more resources into attack than had previously been the case.

It seems logical that a more defensive approach should result that team accumulating more products of a packed defence, such as blocked shots, while any chances they do create may be meet with increasingly fewer defenders.

I took at look at the correlation between blocks and clear cut or so called big chances and the prevailing state of the game and there was a significant relationship between them.

A side in a poor state of the game had more chance of their goal attempts being blocked and his increased as their game state deteriorated.

Similarly, a side in a positive state of the game was more likely to create a chance that was deemed a big chance.

This appears to fit which the hypothesis of content teams packing their defence more, and increasing the likelihood that they block an attempt and if they do scoot off upfield, they're more likely to be met with a depleted defence.

However, correlation doesn't prove causation etc etc. 

In the case of a side being more likely to create big chances, there may be a confounding factor that is causing both the good state of the game and the big chances. (Think raincoats, wet pavements and weather).

That factor is possibly team quality.

The top six account for 30% of the Premier League, but took 48% of the wins, 43% of the goals scored and 45% of the league points won.

They're a league within a league, more likely to be in a very good game state and they also accounted for 43% of the league's big chances.

Team quality may be the causative agent for a good game state and for creating big chances, which correlates the two without either being causative agents of the other.

So I stripped out all games involving the big six to get a more closely matched initial contest, but the correlation persisted.

Teams in a good place against sides of similar core abilities were more likely to create very good chances and more likely to find defensive bodies to block the anticipated  onslaught from their opponents.

As a tentative conclusion, intuitive events that you might expect to be more likely to occur as strategies subtly alter do appear to be identifiable in the data.

Data from InfoGolApp

Saturday, 15 July 2017

Lloris, the Best with Room to Improve?

Expected goals, saves or assists are now a common currency with which to evaluate players and teams, with an over achievement often being sufficient to label a player as above average/and or lucky, depending on the required narrative.

By presenting simple expected goals verses actual goals scored, much of the often copious amount of information that has been tortured to arrive at two simple numbers is hidden from the view of the audience.

Really useful additional data is sometimes omitted, even simple shot volume and the distribution in shot quality over the sample.

The latter is particularly salient in attempting to estimate the shot stopping abilities of goal keepers.

Unlike shot takers, it is legitimate to include post shot information when modelling a side's last line of defence.

Extra details, such as shot strength, placement and other significant features, like deflections and swerve on the ball, can hugely impact on the likelihood that a shot will end up in the net.

A strongly hit, swerving shot, that is heading for the top corner of the net is going to have a relatively high chance of scoring compared to a weakly struck effort from distance.

Therefore, the range probabilistic success rates for a keeper based shot model is going to be wider than for a mere shooter's expected goals model. not least because the former only contains shots that are on target.

We've seen that the distribution of the likely success of chances can have an effect on the range of actual goals that might be scored, even when the cumulative expected goals of those chances is the same.

To demonstrate, a keeper may face two shots, one eminently savable, with a probability of success of say 0.01 and one virtually unstoppable, with a p of 0.99. Compare this scenario to a keeper who also faces two shots, each with a 0.5 probability of success.

Both have a cumulative expectation of conceding one goal, but if you run the sims or do the maths, there's a 50% chance the latter concedes exactly 1 goal and a near 98% chance for the former.

The overall expectation is balanced by the former having a very small chance of allowing exactly 2 goals, compared to 25% for the keeper facing two coin toss attempts.

Much of this information about the shot volume and distribution of shot difficulty faced by a keeper can be retained by simulating numerous iterations of the shots faced to see how the hypothetical average keeper upon whom these models are initially built and seeing where on that distribution of possible outcomes a particular keepers actual performance lies.

Hugo Lloris has faced 366 non penalty shots and headers on goal over the last 3 Premier League seasons.

Those attempts range from ones that would result in a score once in 1,400 attempts to near certainties with probabilities of 0.99.


An average keeper might expect to conceded goals centred about 120 actual scores based on the quality and quantity of chances faced by Lloris.

Spurs' keeper allowed just 96 non penalty, non own goals and no simulation based on the average stopping ability of Premier league keepers did this well.. The best the average benchmark achieved begins to peter out around 100 goals.

Therefore, an assessment of the shot stopping qualities of a keeper might better be expressed  as the percentage of average keeper simulations that result in as many or fewer goals being scored than the keeper's actual record.

This method incorporates both the volume and quality of attempts faced.




The table above shows the percentage of average keeper simulations of all attempts faced by Premier League keepers since 2014 that equalled or bettered the actual performance of that particular keeper.

For example, there's only a 2.5% chance, assuming a reasonably accurate model, that an average keeper replicates or betters Cech's 2014-17 record and they would expected to equal or better Bravo's
in perpetuity.

Lloris' numbers are extremely unlikely to be replicated by chance by an average keeper and it seems reasonable to surmise that some of his over achievement is because of above average shot stopping talent.

Lloris over performs the average model across the board. Saving more easy attempts compared to the model's estimates and repeating this through to the most difficult ones.

Vertical distance from goal is a significant variable of any shot model and  Lloris' performs to average keeper benchmark save rates, but with the ball moved around 20% closer to the goal.

Intriguingly, this exceptional over performance is partly counter balanced by an apparent less than stellar return when faced with shots across his body.

Modelling Lloris when an opponent attempts to hit the far post produces a variable that his a larger effect on the likelihood of a goal then is the case in the average keeper model.

Raw figures alone hint at an area for improvement in Lloris' already stellar shot stopping.

The conversion rate for players who got an attempt on target, while going across Lloris' body converted 35% of the time, compared to the league average of 32%. He goes from the top of the tree overall to around average in these types of shots.

An average keeper gets more than a look in in this subset and the average model equals or beats Lloris' far post, on target actual outcome around 22% of the time. That's still ok, but perhaps suggests that even the very best have room to improve.

Below I've stitched together a handful of Lloris' attempts to keep out far post, cross shots to give some visual context.



For more recent good work, check out Will and Sam's twitter feed and Paul's blog & podcasts.

Data from Infogol.InfoGol

Thursday, 13 July 2017

Gylfi, "On me head, son"

Expected assists looks at the process of chance creation from the viewpoint of the potential goal creator.

An assisted goal is a collaboration between the player making the vital final pass and his colleague who tries to beat the keeper, but over a season these sample sizes tend to be small.

Manchester City's Kevin De Bruyne topped the actual assist charts in 2016/17 with 18, but these numbers may have benefited from a statistically noisy bout of hot finishing or suffered from team mates who frequently sliced wildly into the crowd.

Therefore, it makes sense to use the probabilistic likelihood of success in the 85 additional instances when the Belgian carved out a chance that went begging.

Here's the top ten expected chance creators from the 2016/17 Premier League, along with their actual returns, courtesy of the recipients of these these key passes.



The list contains the kind of players you'd expect to see when trawling the Premier League for creative talent.

The expected assists are based on a model derived from the historical performance of every assisted goal attempt from previous Premier League seasons.

So De Bruyne's over performance may reflect the above average talent, not just of himself, but also his team mates or it could be that creating and finishing talent is tightly grouped in the top tier of English and Welsh football and randomness accounts for the majority of the disconnect between actual and ExpA over a single season.

Swansea's Gylfi Sigurdsson, a constant topic of transfer speculation, lies 3rd in both expected and actual assists, with 9 ExpA and 13 actual ones. This backs up the Icelander's importance to the Swans, where he was involved in nearly a quarter of Swansea's ExpG in 2016/17.

His relatively large over performance, compared to his ExpA cumulative total of just under 9 may suggest he is particularly adept at presenting chances to his team mates.

However, a simple random hot streak from both or either participant in the goal attempt should not be ruled our.

In 9% of simulations, an average assister/assisted combination would score 13 or more goals from the 77 opportunities crafted by Sigurdsson.


Neither is there anything untoward in the fit of the model to Sigurdsson's 77 assists. Lower quality chances are converted at a lower rate than those which had a higher expectation of producing a goal.

So far there's nothing to set off warning bells for any potential purchaser, Sigurdsson appears to be legitimately a top echelon goal creator, albeit one who may have run slightly hot in 2016/17.

But if we make some direct comparisons to say De Bruyne, differences begin to emerge.

De Bruyne's ExpA per key pass is 0.15 compared to 0.11 for Sigurdson, which suggests that De Bruyne is, on average creating higher quality opportunities.

The profile of the position of the recipients of Sigurdsson's key passes is also strikingly different from those of the Manchester City player.


De Bruyne is supplying chances for a much larger proportion of attacking minded players, such as out and out strikers, wingers and attacking midfielders.

Whereas, over 50% of Sigurdsson's key passes are picking out defenders, notably central defenders and that usually means headed chances, from set pieces.

This appears to be confirmed by the final column in the first graphics of this post. Only a third of Sigurdsson's assists arrived at the feet of a team mate, well below the figures for the remaining nine assisters in the table.

All of whom check in with at least 67% of their potential assists being finished off with the boot.

Gylfi's penchant for set play deliveries to a defenders head also features in Ted's article on the transfer speculation surrounding Sigurdsson in The Independent as part of Ted's grand tour of the British press.

Despite Sigurdsson's apparent niche assistance role, at least in 2016/17, his ExpA per potential assist does still hold up well.

He's below De Bruyne, as we've seen, but is above the remaining eight players in the top ten, bar Fabregas and an anonymous Stoke player, who we want to keep.

So although he does deliver aerial passes to generally less skilled finishers, his relatively impressive ExpA per key pass does suggest that he can put the ball into extremely dangerous areas and with accuracy to find a team mate.

Also his actual assists from headed chances of 8 compared to and expected total of just over 5 suggests he may be more skilled at such deliveries than is the average case, although such small samples inevitably prevent random chance being eliminated as the main causative agent in any over performance.

Overall, Gylfi Sigurdsson may be worth a great deal of money.....to a side that is set up to benefit most from his particular creative skill set.

But those teams may be few in number and principal among them are his current employers.

All data via Infogol

Wednesday, 12 July 2017

Stoke Score More August Goals Than Andrew Cole

Hugely amusing tweet* doing the rounds, yesterday.


All great fun in the world of football bants and also an excellent case study in how to use "stats" to purvey a misleading impression that's likely to get picked up, circulated and no doubt recycled in September when the Premier League's fixture computer love affair with the Potters pitches them to the foot of the table.

So let's do a bit of due diligence .

Cole played 44 games to reach his 25 goals, playing, as he did in the 42 game, Premier League era, when they sometimes managed to cram six games in during the opening month.

Stoke scored their 23 goals in 28 matches.

So even this simple addition of context floors the deliberately provocative tweet.

Cole scored 0.57 Premier League goals/game in August, which is eclipsed by Stoke's 0.82 August goals/game.

The comeback would probably be "one is a team of 11 to 14 players".

But 1 of those 14 is a goal keepers, and keepers, with the exception of Stoke City ones, generally don't score.

Four or five are defenders who don't score a lot, which limits the fair comparison to Stoke players, from the August months of the Premier League era, who played in a similarly advanced role to Cole's position at Newcastle, Manchester United, Blackburn, Fulham, Manchester City and Portsmouth.

Designated Stoke forwards scored 13 of their 23 goals, so their scoring rate falls below Cole's 0.57 goals per game to 0.46 goals per game.

Stoke played an average of two out and out strikers over their Premier League existence, so we'll half that rate to 0.23 goals per game.

This puts Cole well back in the lead, allowing the rip to be taken out of the Potters again....?

However, we haven't considered the goal environments.

Stoke played against a batch of sides in August who conceded an average of 1,35 goals per game, as did Cole a decade earlier.

No change, there.

Cole's teams scored an average of 1.80 goals per game, meaning he played for sides who had a lot of attacking intent.

His 0.57 goals per game was around 30% of the baseline figure for his team.

The homogenised Premier League Stoke striker scored 22% of the 1.06 goals per game Stoke have averaged in the Premier League.

Those strikers included Dave Kitson, Mama Sidibe (legend), Ricky Fuller (legend), James Beattie, Kenwyne Jones and Peter Crouch.

Bottom line, Andrew Cole scored a higher proportion of goals for his club than did this mismatch of ageing, journeymen footballers did in their defensively structured, mid table team........in one particular month.

Ha ha.

*NOT

Thursday, 6 July 2017

Game State Outliers

Newcastle's 2011/12 season remains one of the most interesting of recent times.

They scored just four more goals than Norwich, but gained 18 more league points and allowed two fewer goals than Stoke and won 20 more points.

Their meagre +5 goal difference was inferior to the three teams who finished immediately below them in the final table and a 5th place finish was partly down to the hugely efficient way in which they conceded and scored their goals.

The ability to leak goals only when a game was already lost and score at the most advantageous times proved transient and the following season Newcastle's elevation to the top tier of the Premier League stalled as they barely finished above Sunderland and relegation.

In this post I looked to define game state in terms of not simply the current score, but also the equally important factor of time elapsed.

The current state of the game for a side is a combination of the score line, the relative abilities of each side and how long remains for either team to achieve a favourable final outcome.

As an example, take Stoke's home game with Everton.

The matchup was fairly even, Everton the better team being balanced by the Bet365 stadium and after 6 minutes the hosts had around a 37% chance of winning and a 25% of drawing.

That equates to an expected league points of 1.4.

A minute later Peter Crouch scores to put Stoke 1-0 up and their expected points with 93% of the game remaining and a goal to the good, rises to 2.1 league points.

The goal's welcome, but mitigated by the large amount of time remaining and the evenly matched teams.

No VAR and the game ends 1-1.


The plot above has averaged the increase in expected points per goal scored in an attempt to see which sides were scoring goals that most advanced their potential expected league points, either by design or raw chance, combined with their core ability.

It shouldn't be surprising to see the better teams having the lowest average expected points improvement per goal in the Premier League.

They are more likely to win matches by large margins and the 4th goal in a rout will add little to the teams expected league points, which will already be close to 3.

However, even among the top teams there are variations.

Spurs have the lowest expected points increase per goal scored, partly due to wide margin wins against the lesser sides, while Chelsea, with a similar number of goals, found themselves celebrating a score with, on average, a more tangible game state reward.

Hull appeared to occasionally put themselves into relatively decent positions, despite meagre scoring, while Sunderland, not only scored fewer goals, but also frequently only netted when the spoils had largely been won by their opponents.


The same point may be better illustrated by plotting the success rate ( a combination of  wins and draws for each team) against their expected points increase per goal scored.

Chelsea are apparent outliers from the line of best fit, scoring goals that advance their game state, on average  by more than their fellow top sides.

Again this might suggest that they are employing slightly different in game tactics compared to others.

Perhaps one that deserts further attacking intent for a more defensive outlook once they find themselves in a favourable match position, as do Manchester United....Or perhaps there is an element of random good fortune in when they are scoring their goals, a la Newcastle 2011/12.

Both Championship enigmas, promoted Huddersfield and their beaten playoff rivals, Reading, show anomalies from the seasonal norm when we examine their change of expected points based on goals and time elapsed.

Huddersfield fly high above the general line of best fit for a side of their scoring capacity, fed by a glut of goals where the time factor had nearly ebbed away. Again, tactically and skill driven or transient good fortune or a bit of both?

Reading showed an uncanny ability to know instantly when they were beaten by "selectively" leaking many of their 64 goals allowed in a handful of games "allowing" themselves to spread the remainder  of their concessions more thinly and remain competitive in a large number of their matches.

A "trait" that will be eagerly anticipated for their 2017/18 season.

Thursday, 29 June 2017

Big Chance or No Big Chance.

There has been a fair bit of comment recently around big chances and their inclusion or not in shot based expected goals models.

Big chances are, as the name suggests, a partly subjective addition to the Opta data feed which describes a goal attempt.

Along with undeniable parameters, such as shot location, type and pre-shot build-up details, the big chance attempts to add information, such as the level of defensive pressure or the positioning of the keeper.

While such information may enhance any conclusion about the quality of an individual chance and assist in converting a purely outcome based approach to team evaluation to a more probabilistic, process based one, it may become prey to cognitive biases, such as outcome biases.

I thought I'd quickly build two models, using the Opta data feed we use to power the Infogol app and see how each performs when put to some of the common uses of an ExpG model.

One model uses big chances (BC), whilst the other does not (NBC).

Such models are primarily used either as descriptive of past matches and/or predictive of future performances.

Typically, pre-shot data is collected from a previous season or number of seasons and the relationship between this data to a discrete outcome, such as whether a goal is scored is found using logistic regression.

We can then use the results of the previously modelled regression to assign the probability that any future chance will result in a goal based on recent historical precedent.

The advantages of using ExpG models is that shots are much more numerous than goals and hopefully the process of chance creation with an attached probabilistic measurement of success will better describe a side's underlying abilities compared to actual goals, which are perhaps more prone to random streaks.

                     Cumulative ExpG Totals for 2015/16 Modelled from 2014/15 Opta Data.



Here's the cumulative ExpG totals for the 2015/16 Premier League, modelled using data from the previous season. These type of figures are often used as a basis to predict the future performance of a side.

The top model doesn't use big chances as a parameter, but the second does and while there is some variation between models, the correlation measured in Exp GD is strong between the two models.


For those wishing to use an ExpG approach to produce a probabilistic estimation of team quality, there seems little difference in larger sample sizes between a big or non big chance based model.

It would appear that, in the long term at least, chance quality information is also retrieved from non big chance Opta parameters and more importantly is distributed to individual teams in a similar way to a big chance model.

In short, both models give Exp GD of similar values for most sides.

However, cumulative totals can give near identical values, but be very different at the granular level.

Model BC may assign a much bigger probability to excellent opportunities and smaller ones to weaker opportunities, while model NBC may do the polar opposite and the errors in the latter may fortuitously balance out to give near equal cumulative totals.

The first model would describe future reality better than the second.

To test both models, I arranged the goal attempts for all 20 teams in ascending chance quality,divided these into groups and then compared the actual number of goals scored in each of these subsets to the number predicted by each model.

                      How Well Does the Predicted Distribution of Outcomes Match Reality.



(Green = acceptable match, brown - poor match).

The results of this goodness of fit test is shown above.

Where the probabilistic model prediction for each subset largely agrees with the actual distribution of outcomes for 201516, we get a large p value. There's a decent chance that the variation we see between prediction and reality is just down to chance.

Using the usual 5% threshold, there are two teams from the model constructed without big chances where the actual distribution of outcomes is so far removed from the predictions that chance may be largely ruled out as the cause.

In this case, Liverpool and Stoke.

The model constructed with big chances included as a variable has three teams where chance looks an unlikely candidate for the variation seen in the two distributions. Liverpool (again), Everton and Swansea.

So while cumulative ExpG values tend to show only small variations between a BC and a non BC model, differences do emerge at a more granular level and these differences for this season and these two models does not appear to be systematically in favour of the BC or non BC model.

In short, ExpG is a product of a model and all models vary and these differences and the conclusions we draw may be most evident in smaller shot samples

Saturday, 24 June 2017

You Don't Need Goals to Change Game State

I’ve written previously about the concept of game state and how a side prioritises their attacking and defensive resources.

It is well known that trailing sides often increase their attacking output when they are behind compared to when they were either level or ahead and this in turn impacts on the amount of defending their opponents are obliged to do.

Dependent upon the relative abilities of the two competing teams, a side seeking to get back on level terms often takes more shots and also accrues more products of attacking play, such as corners than was previously the case.

However, game state, as simply defined as the current score line does seem limiting and I’ve previously quoted the example of a top side playing out a goalless draw with a lesser team.

While the level scoreline would be increasingly welcome to the lower rated team as the game progressed, the opposite would apply for the better side in the matchup.

Therefore, quantifying “game state” should perhaps be done in terms that include the changing expectations of each team due to the passage of time and scoreline, rather than simply the scoreline.

I’ve suggested using the expected points each side would get on average from a match as a suitable baseline with which to begin measuring the evolving state of the game.

Here’s an example.

Chelsea entertains Everton and based on pregame home win/draw/away win estimations, Chelsea would expect to average 2.1 points compared to around 0.71 points for the visitors from the fixture.

40 minutes into a still goalless game and these numbers have respectively fallen to 1.9 and risen to 0.81. After 67 minutes and still no goal and Chelsea are faring even less well (1.66) and Everton are up to an average expectation of 0.90 points.

There have been no goals, but the state of the game is constantly drifting away from Chelsea’s expectations and surpassing Everton’s “par for the course”.

Chelsea's game state environment is gradually becoming less palatable to them and Everton's more so, simply through the passage of time and if this feeds through into the relative approaches of the sides, it should be seen in the match data.

Here’s a memorable 0-0 from 2016/17 when Burnley took a point in a stalemate at Old Trafford.

The host’s average expected points total started at around 2.3 points at kick-off compared to 0.55 points for the visitors, but it had fallen by over 10% when half time failed to see a score. So a gradual erosion of expectations, rather than a precipitous decline.

Burnley’s modest expectation was up to over 50% of their original with 20 minutes remaining and with United’s now tumbling by nearly a quarter compared to kick-off, their shot count began increasing as Burnley’s stalled.

        How Manchester United Piled on the Attempts as Burnley Frustrated them at OT.


This switch towards a more overtly attacking stance from the side leaking initial expectation as time elapses in a level match, forces their opponent to adopt a more defensive outlook and appears to be mirrored, on average in all such matches from the 201617 Premier League season.

72% of the goal attempts taken when the scoreline was level in 201617 were taken by the side whose expected points had slipped below their pregame estimation. Perhaps an important consideration when nearly half of all goal attempts from 201617 came while the scores were level.

Across all score lines, the inferior team in a match who had managed to improve their pre-game position, either through remaining level or taking a lead, attempted 31% of shots while that position persisted, but such sides upped this to nearly 46% against superior opponents when their current points expectation fell below their initial expectation.

These figures tally with intuition about how games develop, even in the absence of goals.

Therefore, the amount of change in a team’s pregame expectation may be a viable extension to the more commonly applied mere scoreline when assessing game state, particularly when we are still awaiting an initial goal.

For example, it is commonly assumed that increased shot volume from a side that finds themselves in a disadvantageous game state is partially balanced by a more packed defence.

This may lead to the expected goals from identical pitch locations being lower when defensive pressure is greater.

To try to test this I included a variable for game state within an expected goal model.for the 201617 Premier League, based around this continuous, time elapsed and score dependent calculation, rather than merely using the current scoreline.

Overall, a team playing with a current expected points total that had dipped well below their pre-game expectations, converted chances at a lower rate than identical chances where game state was much less of a factor.

In addition, as teams played with a poorer game state, their goal attempts were also more likely to be blocked by defenders than in similar situations when their game state environment wasn't as dire.

As an example, a side who had improved their position compared to pre-game by around 40% of their initial points expectation might convert a decent shot from the heart of the penalty area around 44% of the time.

But when faced with the same chance when their points expectation had fallen by a similarly large amount, they appear to only convert the opportunity 37% of the time.

This may be due to fewer defenders being around in the first instance as their opponents perhaps chased a goal of their own compared to the second situation when defence might be a higher priority for their opponents.

Thursday, 15 June 2017

Early Season Strength of Schedule

With the major European leagues currently enjoying their summer holidays, it is left to a handful of competitions to provide club based action until early August.

One such league is Brazil's Serie A, a fascinating mix of player and managerial churn, exciting skillful youngsters, paired with former internationals, slowly winding down their illustrious careers and lots of shooting from distance.

Tonight sees the completion of week seven of the twenty team league, so while we have accumulated some new information about the 2017/18 version of teams such as Santos, Sao Paulo, Corinthians and less know sides, such as Gremio and Bahia, that information comes courtesy of an unbalanced schedule.

Prior to week seven, Flamengo had played three of the current bottom four and no side from the top half of the table, whereas Vasco da Gama had faced the current top two and only two sides outside the top ten.

The challenges faced by these two sides were likely to vary in their degree of difficulty,

Delving deeper into each side's most recent games, including matches from 2016/17 may be a more reliable indicator of their respective future prospects, but it is understandable that a six game season to date also invites comment in isolation.

Predicting the future arc of a team's season is always welcome, but celebrating achievement over a shorter time frame, even if some of it has come from a sprinkling of unsustainable randomness also deserves attention.

How can advanced stats and strength of schedule adjustments assist?

It's natural to look firstly at the record of the side in question, but it is their opponents that possess the richest seam of data from 2017/18's fledgling season.

Vasco has played Palmeiras, Bahia, Sport, Fluminese, Corinthians and Gremio prior to last night and in turn each of their opponents has also played five other opponents in addition to Vasco.

Combined, Vasco's opponents have played 36 games, nearly a full season and have played every side in Serie A at least once, bar Corinthians.

We have a ton of accumulated data from goals to expected goals for Vasco's opponents, but only six games of data for Vasco themselves and the same is true for the remaining 19 teams.

It's natural to expect even this limited, if recent achievement does contain some signal relating to future performance and Ben Cronin over at Pinnacle has written this article about the correlations between Premier League position after six games and final position and the FT's John Burn-Murdoch also tweeted this excellent visualisation correlating current league position during the 2013/14 season with finishing position in May.

To adjust for strength of schedule, we might take expected goal differential, rather than league position as the performance related output for each team and utilise the interrelated collateral form lines are created after a few weeks of the season

Team A may not have played team B yet, but they may have played team C, who have played team B.

We are left with 20 simultaneous equations, with a side's opponents on one side and their actual expected goal differential output on the other. Solve these we have new expected goals differentials that more fully represent the difficulty of each team's schedule.

In short, it is the basis for so called power ratings.



Here's how Serie A teams were ranked by expected goals differential prior to week seven and how that ranking changed when we allowed for the sometimes heavily unbalanced schedules played.

Vasco were ranked 13th on expected goal differential, but jumped into the top 10 to 9th when their harsh early schedule was applied.

Ponte Preta dropped four places to 15th in view of an apparently benign group of initial opponents.

In theory this seems fine, but does schedule strength add anything to our knowledge of a side going forward if we choose to limit ourselves to data from just this single season?

As Ben and John have admirably demonstrated, there is a correlation between league position at various stages of the season and finishing position.

Here's a limited (due to workload) example from a previous Premier League season using simply goal differential rather than expected goals.

13 games into the 2013/14 season, Spurs were ranked 13th by goal difference, 10th when strength of previous schedule was applied and 9th in the actual table. They finished 6th.

Their position in the table after 13 games better predicted their finishing spot, followed by strength of schedule adjusted goal difference and lastly actual goal difference.

As a whole though ranked, strength of schedule adjusted goal difference from week 13 did best of the three, producing ranked correlations of 0.77 for league position and actual goal difference after 13 games, but rising to 0.80 when strength of schedule corrections were applied and the teams re ranked after 13 matches each.

In short, there is signal in limited early season data and as a means of predicting final finishing position there may be some improvement if we rank by a schedule adjusted performance indicator.

All Brazilian data from InfAppoGol