Scraping Data from the Web

 

On a personal note, I think one of the biggest drawbacks in loving AFL and wanting to do analysis in AFL is that good data just isn’t readily available for analysis. There are a few main things preventing analysis of AFL from growing within Australia. If we take out the obvious data access issues https://thearcfooty.com/2016/09/05/the-afl-needs-to-go-beyond-the-box-score/ another drawback is that even the basic data access is limiting.

For one to get access to “boxscores” short of emailing the great folks at http://afltables.com/afl/afl_index.html or http://www.footywire.com/ you just can’t get access that is readily available. The only people who are able to get access are those that learn or get someone else to web-scrape for them.

This has many drawbacks for the AFL community at large. Which I won’t go into here.

So what I propose here to do, is go through a short example just getting one round of data from Wikipedia, and one games worth of data from footywire it won’t be the most efficient way of getting the data that is not the aim of this exercise of mine. The aim of this is to show a few different ways you can get the data and clean it up so that you at home can answer questions you have about footy.

The Set up:

First you need to set up R and R studio.

You can download them using the links below

https://www.r-project.org/

https://www.rstudio.com/ once you have installed them both

Step 1: Start a new script

blog post scraping

Untitled.png

From there we need to install the packages to scrape the data we want!

Step 2 Installing the relevant packages

In the bottom left hand corner you should be able to see the install button, this helps you install all the packages you will need below.

Capture

Enter in the packages you need one by one and click install!

Capture2

Once you have done it for the following packages, stringr, XML, rvest, tidyr and dplyr you can just use library (see below) or just tick the check box in the bottom right

library(stringr)
library(XML)
library(rvest)
library(tidyr)
library(dplyr)

Capture3.PNG

Step 3 Go to website get some data

Lets say you want to be able to scrape Wikipedia for AFL data. This example will use 2016

wiki.PNG

For the first part of the example I will use Wikipedia to get a round of AFL data.

round 1.PNG

 

#the first step is getting the html (data) in from the page ("https://en.wikipedia.org/wiki/2016_AFL_season")
afl_season<-read_html("https://en.wikipedia.org/wiki/2016_AFL_season", encoding = "UTF-8")
#once you have all the html, you want to find the tables
tables<-html_table(afl_season, fill = TRUE)
tables #this will print all the tables in your console window

 

When you do that you will end up with something like below. What you want to focus on here is the [[19]]  this means that the Round 18 data you see is the table 19!

Capture4.PNG

As an example lets start from the beginning and get round 1 of AFL data, this would be table 2

 
table.example<-tables[[2]] ##just looking at round 1
names(table.example) #see the variable names 
str(table.example) #see the data structure
View(table.example) #view the data!!!

 

Capture5.PNG

 

Thats great, what we can see is that X1 seems to contain the dates, X2 contains the winning team name and score etc.

The number to the left correspond to the rows, think about it as rows in an excel sheet. So really our data starts in row 3

Capture6.PNG

So now we have the data, all we have to do is get it into a format that we want and then we can do all the fun analysis type things like maybe I don’t know building your own ELO.

Step 4 Cleaning the data (the really fun stuff)

Our table here is table.example but it contains rows we do not want (rows 1,2,12) so we delete them.

 

df1<-table.example[-c(1,2,12),]
View(df1)

#we also want some sort of label on our data for round 1&lt;br data-mce-bogus="1"&gt;

df1$round<-"Round 1" ##add a label so I know what round it is

 

From here we can see we don’t really want the columns X3 or X5 or in other words we want to select columns X1 ,X2, X4, X5 and our newly created round.

 

df2<-select(df1, X1, X2, X4,X5,round) #just getting the columns that I want

For the cleaning of the data we use what is called “regular expressions” I am following the tutorial here  https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html

I want to separate out time from the day of the game I can do so like below


df3<-extract(df2, X1, into = c("Date", "Time"), "([^(]+)\\s+\\(([:graph:]+).") ### seperates the date and the time
 View(df3)

 

Captur7.PNG

So now if we view df3 the table isn’t looking too bad, I would probably want to create a new column for winning/losing team goals, behind and total. Separate out the ground from the crowd size and then we have one rounds of scores.


#getting the winning score

df4<-extract(df3, X2, into = c("winning team", "winning Score"), "([^(]+)\\s+\\(([0-9]+).")
#what this does is take the numbers but only the numbers in the () #prettycoolay

 

Capture8

Next we would want to get the losing teams score

df5<-extract(df4, X4, into = c("losing team", "Score"), "([^(]+)\\s+\\(([0-9]+).") ##getting the losing score

View(df5)

Now lets separate the Venue from the crowd

df6<-separate(df5,X5,into=c("Venue","Crowd"),sep='[(]crowd:',remove=TRUE,convert=FALSE)
df6$Crowd<-gsub("\\)","",df6$Crowd)
View(df6)

Now lets separate out the behinds.


df7<-separate(df6, 'winning team',into=c("winning.team","winning.behinds"),sep="\\.") 
df8<-separate(df7,'losing team', into = c("losing.team","losing.behinds"),sep="\\.")
View(df8)

Now for the finale the GOALS!

df9<-separate(df8, winning.team, c("winning.team", "winning.goals"), "(?<=[a-z]) ?(?=[0-9])")
df10<-separate(df9, losing.team, c("losing.team", "losing.goals"), "(?<=[a-z]) ?(?=[0-9])")
View(df10)

Captur10.PNG

Now I understand that it might seem a bit overwhelming especially if you are new to R. My next blog post I will try and go through the lines and break them down a bit more into smaller chunks

But hopefully this is enough to get the juices flowing.

 

Advertisement

Building a free team rating system for AFL

Building a team rating system for AFL has become awfully popular of late. My personal favourite is the MatterofStats which you can read about here http://www.matterofstats.com. But it is not just Tony at MOS but there has been great content coming from  http://plussixoneblog.com/ , https://thearcfooty.com/http://www.theroar.com.au/author/ryanbuckland7/ and https://hurlingpeoplenow.wordpress.com/  just to name a few.

So if you are reading this blog, you are probably in the wrong space…. But if you want to start building out your own ELO rating system for AFL well then stayed tuned.

The following will be a step by step guide to building out a basic ELO rating system using R.

So to begin with, you need well R which you can download from here https://cran.r-project.org/bin/windows/base/. The next thing I would recommend is an easier UI to use R, for this most people use R studio which you can download here https://www.rstudio.com/.

To build out an ELO system we need data, for this I use http://afltables.com/afl/stats/biglists/bg3.txt

Now for the fun stuff, once you have the data and R installed into your computer you can go about building your very own AFL ELO model. Yep no strings attached for free, from free websites! Maybe you want to make some money or maybe you just want to win a tipping comp with mates.

Below is some R script which will hopefully get you started, once you get started reach out if you have any questions.

getwd() #this is your working directory, its where you should save your data/code
setwd("insert your directory here") #this is if you are organised and want to save your data somehwere
install.packages("PlayerRatings") #this is the R package that someone very friendly built
library(PlayerRatings) #this is loading the package so we can build out the ratings
install.packages("dplyr") #this is part of the hadleyverse and will help you manipulate your data
library(dplyr)

##now you will need to download the data, for this basic step I will assume you have already downloaded your data and cleaned it
#however if you haven't go to here to download a manipulated dataset from afltables
#this manipulated dataset https://drive.google.com/open?id=0B2903kNbc39daC1VSUktanZPZFk
#this dataset has been manipulated to make running the ELO as quick as possible.
#when you download the dataset *public afl data* instead of downloads move it to the same folder that gets printed when you do getwd()

Now that you have done above, I want to get you excited, and the easiest way is to just run something and it works, its tactile, its there ready for you to see and interpret. It’s there for you to digest and critique. It’s there and you can manipulate it anyway you want.

Say you want to use scoring shots instead of points scored. Or say you want to use a different amount of games to train your ELO. You can do it all here. Lets begin.

Assuming you ran the above R script and it worked, to get out a quick ELO rating all you have to do is run below.

Capture

df<-read.csv("public afl data.csv")

x<-select(df,Week,HomeTeam, AwayTeam, Score)
x$Score<-as.numeric(x$Score)
x$HomeTeam<-as.character(x$HomeTeam)
x$AwayTeam<-as.character(x$AwayTeam)
x$Week<-as.numeric(x$Week)
elo(x)

And there you have it your own ELO rating system. Well someone elses but you can edit from here.

Lets say you found this blog because you are a bit of a numbers nut. Being a numbers nut you think to yourself, hey I think that it makes more sense to have a higher/lower k factor than what I usually hear people use.

Well then, let me get you started.

elo(x, status = NULL, init = 2200, gamma = 5, kfac = 1,
 history = FALSE, sort = TRUE)

Play around with kfac see what happens as you increase it from 0 to 5 to 10 to 20 to 25 etc. For those of you who want to know more about the parameters you can now edit. Please read this https://cran.r-project.org/web/packages/PlayerRatings/PlayerRatings.pdf
Play around with all the parameters, see what you come up with as making the most sense.

There you go, go forth numbers nuts and build out your own ELO system using free software and data.

Remember numbers are there to help narrate the story you wish to tell. So please now that you can do it go ahead I’d love to see some more ELOs floating around!



			

How different is finals football to home and away footy

We are entering what will perhaps be the closest finals series in recent memories. Up until the very last game the order of the finals series was being shaped. It wasn’t until halfway through the very last game of the home and away season did the Hawks lock in there top 4 spot. Amazing.

Finals footy, hard contested footy. Lots of tackles hard ball gets, the pressure is on and there’s just nothing like it. The atmosphere of the crowd, the knowing that there might be a next week to go to the footy. Finals footy holds a special place in people’s memories.  But just how different are home and away games from finals?

We have all heard the narrative that finals footy is won at the coal face that finals footy is won by winning the clearances, by winning the hard contested ball and the one percenters. That the games will be closer and lower scoring but do these tales play out in the numbers or are they more fairy tales like the Dockers 2016 premiership hopes.

This slideshow requires JavaScript.

Looking at AFL since 2002 to 2015, we can see some pretty interesting narratives. Games are not only closer on average for finals 32 points vs 36 for the home and away season but they are also lower scoring 86 to 92 on average.

Teams are scoring less, margins are less and this is reflected in the drop in inside 50 differentials (difference in scoring opportunities) by 18%.

Interestingly the tackle differential of the winning team has decreased by 58%, even though we are seeing 4% more tackles being made in these high pressure games.

But what about the coal face, the clearances, contested ball and the one percenters, here we see some pretty interesting things.

Teams that win finals have increased their contested possession differential by 19% even though total contested possessions only increases by 2%. The clearance differential increases by 35% even though there are only 4% more total clearances during a game and their one percenter differential increases by 83% while total one percenters in finals games only increases by 8%.

All up, this shapes for an incredibly exciting finals series. Can’t wait.

Not so clear

When watching an AFL game its easy to get swept up in the talk of team x is dominating the clearances and this is explains why they are up in the game or have won the game. But does it really and can we assess this?

Lets start with a simple premise, if we covered up the final score and just based on game statistics can we make inference about who won the game?

Clearly in doing this we have to strip out the obvious i.e. goals, scoring shots. The reasoning behind this is we don’t want to state the obvious i.e. teams that score more goals win more. If we keep variables that make up the score we can quite confidently say that they are going to win which isn’t that fun……

What about the less obvious, do teams that win contested possessions win more than teams that lose the contested possession count? Do teams that win the clearances win more than others. What about the tackle count, or teams that make less mistakes (clangers) or the uncontested possession count.

Let’s look at some of the commentator favorites like clearances and contested possessions it be great to look at meters gained but you know Champion isn’t into releasing that kind of thing…..

The purpose of model selection is to choose a model from all possible models with desirable properties. Usually this would involve minimising the

Capture.JPG

We want to have a trade off between descriptive power and complexity. We don’t want our model to overfit (useless) or have so many variables that while accuracy is good, usability is poor.

Description loss is usually measured by -2log liklihood and model complexity is the number of parameters in the model.

Visually what we would like to see, is if a variable is important when modelling. Take for example clearances. How often do you hear when watching the footy that the team is up in clearances and this explains why they are winning.

Let’s model this…..

The variables I have chosen to model are the “commentator favorites”

  1. Clearances
  2. Uncontested Possessions
  3. Contested possessions
  4. Tackles
  5. Clangers
  6. Inside 50s/Rebound 50s (How often the ball comes in vs comes out of its attacking 50m)
  7. Rebound 50s/Inside 50s (How often the team gets the ball out inside its defensive 50m)

The “input variables” are the differences i.e. if team a gets 40 clearances and team b gets 20, team a’s input value for clearances is 20 and team b is -20 etc

The outcome variable in this case is 1 if team wins, 0 if team loses.

Rplot01

 

Looking at the plot about, we can see that for all the unique combination of 2,3,4,5,6 and 7 parameters models there exists a model where you will get LESS descriptive loss when you EXCLUDE clearances in the model building process.

How do contested vs uncontested possessions fare in this descriptive loss vs number of parameter battle?

From these graphs it would seem as though for unique combinations of regression parameters you are better off without contested possessions when faced with a choice between contested and uncontested possessions!

However the very best model i.e. the lowest is the one that contains contested possessions. Which just shows how interesting interaction effects can be!

We can see this a bit closer below.

Rplot02

Rplot03.jpeg

 

These graphs have some nice implications going forward

  1. These graphs are based off all games from 2003-2015 including finals. It would be interesting to see if the characteristics of finals wins are different from home and away wins by sub-setting the data
  2. Nice visual representation of if a variable should be including/excluded
  3. Good way to see the trade between complexity and description loss, as a personal aside I prefer less to more.

The R used for these plots was edited from the code provided by this paper which like all good scripts is reducible and links are provided in the paper

Murray, K., Heritier, S. and Müller, S., 2013. Graphical tools for model selection in generalized linear models. Statistics in medicine, 32(25), pp.4438-4451.

 

 

Sentiment Analysis of a couple of AFL games

Probably should work on shorter titles, but hey its my blog my rules!

Code used can be found here.

So I was at #aflgiantseagles game watching my team the eagles play and boy was it a great game. But as I was watching devastated as Chris Masten turned the ball over, it was thankfully soon followed by euphoria as Naitanui kicked that wonderful last kick winner of the wrong side of the boot if you don’t mind.

My friend and I were sitting the GWS members stand, and one of them commented near the end as the game was lost (or so I thought) that we seemed pretty upbeat for losers (boy we showed them). But also that I bet it would be different if you were Tigers fans right now.

So lets see, I wonder how tweets in Tigers games vary to the Eagles games?

Well we can see that below.

Any questions around why I choose which package, if I were to be honest, I bookmarked this site here a while ago. Long story short I use pocket to save things I’d like to try. I was working my through the reading list and feeling a bit guilty. About not posting anything. Hence the post.

Here are some pics from the footy.

Online Blogger

My first post!

This is an area, where I will put my various attempts in doing statistical modelling with AFL Data.

Stay tuned for player analysis, game analysis I plan on sharing all code used…..