Let’s talk about Expected Goals (XG) and data within sport. There has been an unbelievable amount written about XG in recent weeks and how Aston Villa and Sunderland will revert back to the mean. Aston Villa should be in 16th place apparently. Not 3rd. Pure luck that it hasn’t happened.
I just want to throw some things out there before writing this piece;
1. As a professional gambler data within sport is a hugely important tool to me. It is the cornerstone of any and all analysis carried out. However, it needs context and that is what is so sadly lacking with so much of what I see currently.
2. Aston Villa will regress back to the mean on this current run. It is unsustainable. It is just a question of how far.
3. The biggest football syndicates are using models several iterations ahead of XG. They are clearly not for public consumption and hugely valuable in the data arms race.
XG is treated as absolute gospel in certain parts on places like Twitter. It really shouldn’t be. It is a guide to help and is just one small part of the puzzle. In the same way things like ‘Touches in the box’ or ‘Field Tilt’ can help to provide a little more context on a performance. It is a significantly more informative statistic than something like shots on goal, total goalkeeper saves, number of clean sheets, possession percentage, etc.
We have seen so many examples of XG lacking context just this week. Let’s start with that Arsenal vs Aston Villa match last night. Depending on which source you use as XG varies as they are slightly different models. Fotmob have it as 3.04 to 2.67. Understat 4.41 to 2.91. FBREF 2.9 – 2.5. You might think Villa were unfortunate to lose 4-1 based on those numbers. They weren’t. The numbers need context. A huge amount of Villas XG came in injury time. Utterly meaningless numbers at that stage in the match. One thing we have seen with Emery’s teams is that he really doesn’t care if a team runs up a score on them. We have seen that time and time again. Just look at the substitutions he made. So much importance is given to numbers that lack any context. Not by the important people but by the masses.
I talked about it on the podcast this week but Chelsea’s first goal against Villa last week. If James is credited with the goal from the corner it would be about 0.02 XG. Pedro gets the slightest touch on it and it becomes 0.93 XG. Nothing has materially changed on the pitch. It was going in. We see these sorts of examples all the time.
One of my favourite examples is the Aston Villa vs Bournemouth game earlier this season. Villa won 4-0. XG was 1.7-1.61 or 1.88-2.01 or 1.7-1.6. The number of people implying Villa were lucky to win that game on the back of those numbers. Nothing tells me someone doesn’t actually watch football more than when making those observations. If you don’t believe me go back and watch the match in it’s entirety.
I was fortunate enough to watch Steph Curry play live recently. I think he is a good example to use as we think about the XG problem. Let’s imagine we are looking at his 3 point made percentage (3pt %). Let’s assume his career average is about 42%. Let’s imagine he has a purple patch where he is making 80% of his 3’s. Given the highest in history is about 45% of 3pters over a sustained period, he is due a significant regression. It will happen but we don’t know to what level as there are so many factors that can affect it. Maybe he is just running hot. Maybe he is playing against teams with below average defenders. Maybe he has different team mates creating different opportunities. Maybe he has improved his technique. Maybe his shot selection has improved. Maybe the ball has changed.
We assume that 45% is the upper baseline as we have never seen better than that. One day a player might blow all these numbers away. Maybe someone so tall he just shoots over everyone. With someone like Curry his numbers are distorted by how much pressure he is constantly under. He probably shoots 80%+ in practice with no one close to him. Other factors around him matter hugely. Who knows what his true benchmark may be going forward.
The point is similar to Villa in that we know there will be a regression but there are so many factors at play, that we have no idea how big that regression will be. People keep looking at relatively small sample sizes of 19 matches this season. They will cherry pick the sample size to fit the narrative. If we extrapolate it out further Emery has overachieved his XG at Villa constantly. Over 166 matches. Aston Villa are a comfortable 4th place in points won since Emery joined over three years ago.
He has overachieved against XG his entire career. Over a large sample size. At some stage people might realise that what he does and how he achieves outcomes aren’t always captured with the one method of XG. There are many more factors at play. To give you some ideas of those.
He is often very passive when his team takes the lead. We see Villa sit back and conserve energy time and time again once in front. That doesn’t lead to positive outcomes on XG for a match. Game state really matters to him. He isn’t seemingly interested in running up a score against teams. Whilst he has a good bench, when depth is stretched it really isn’t great. Just look at the comparable benches between Aston Villa and Arsenal yesterday. That depth affects XG.
One thing he does is clog up the centre of the pitch. He closes down the passing lanes through the middle. Forces you out wide or to shoot from long distance. What you end up with is lots of blocked shots or contested headers. Both of which I think are poorly captured by these XG models. The way teams like Villa break forward at pace isn’t always captured well. They can be in hugely threatening positions without getting a shot away. Registers as 0 XG. High quality goalkeepers. High quality strikers. The list is endless as to why XG doesn’t always capture things brilliantly.
As touched on previously Emery has overperformed on XG throughout his career to a significant level over a significant number of matches. There is another manager in the Premier League whose baseline number is very similar. Simply never talked about though. Someone called Pep.
The point is that so many people use XG as an absolute data source instead of using it as one of many. I used to berate Martin O’Neill for his opinions on XG. It was never about XG per se but his opinions on the use of data in football generally. It adds so much to the game when utilised correctly.
Aston Villa and Emery will regress. You cannot defy the numbers. It’s just a question of where to. Given he has the 4th most points since he joined Villa it seems unlikely that it will be to the 15th/16h best team in the League as some suggest. I would urge people to use both the data (lots of it) and their own eyes at times. So much important stuff isn’t captured by data alone and definitely not in just XG. Everything needs context.