Background and Data

In 2011, Iain Fyffe wrote an article in Hockey Prospectus that introduced the Central Scouting Integrator or Cescin — an acronym taking the first two letters from each word. The goal was to create a mechanism that can take the different rankings that the Central Scouting Service (CSS) makes and combine them so they are comparable. CSS is an arm of the NHL and it creates rankings of players who will be eligible for the next NHL Entry Draft. CSS ranks players within four categories of players: European Goalies (EU G), European Skaters (EU SK), North American Goalies (NA G), North American Skaters (NA SK) but does not combine their rankings. Consequently there are four players ranked 1, four players ranked 2, etc. Cescin attempted to devise a method by which the lists could be aggregated based upon past draft order of each category. Fyffe used the previous four draft data (2007-2010) to get a sense of where the particular rank on the list was drafted, on average. Here is a link to current CSS rankings from the NHL site: CSS Rankings.

From Fyffe’s original article here are the multipliers for those positions. Effectively these are linear prediction equations for estimating where a prospective player will be drafted based on their Final Rank.

EU G: Cescin =38.18FinalRank

EU SK: Cescin =6.27FinalRank

NA G: Cescin =13.29FinalRank

NA SK: Cescin =1.35FinalRank

Here’s a link to a pdf of the original article which we can’t find on the web so we created a pdf from a paper copy that Schuckers had. Fyffe Original Article

So the way that Cescin would be used is that if we have the 3rd ranked NA G, we would expect them to be drafted around 3×13.29=39.78 or about the 40th selection. Know the Category and the Final Rank and you can predict where a player is selected. Note that effectively Fyffe has created linear equations with y-intercepts of zero.

The goals of this article are to create a complete public dataset for doing and analysis of NHL draft data and to update Fyffe’s work using this larger dataset. We’ve recently put together a dataset that merges the CSS and the NHL Draft Selection data from 2003 to 2019. That data can be downloaded from CSS and NHL Draft Combined.

This dataset is a work in progress and is meant to be. In particular, the data is created by joining data from CSS and data from the NHL draft, where a player was selected which is the Overall variable in that dataset. Who is selected in the NHL draft is fairly easy data to collect since it is on the NHL site and many others. The CSS rankings are on the NHL page from 2008 onwards but the earlier data is harder to obtain. Sites like The Draft Analyst are invaluable for past history in an electronically digestable form. Note that these sites have additional variables that are not included in this analysis that might be useful for doing work related to the NHL draft. We had some of these data available from Schuckers’ 2016 MIT Sloan Sports Analytics Conference paper which used the Cescin as a predictor variable for NHL outcomes. 2016 MIT Sloan Paper. The hardest part of this joining is the matching of names often spelled differently or changes from Michael to Mike, Sergey to Sergei, or Axelson to Axelsson. The basis for all of this was the senior thesis of Amanda Butterfield, who is a Statistics major and plays on St. Lawrence’s Division I women’s hockey team. She began the work of trying to automate some of the matching. Automation was not completely possible and the creation of this file was a sometimes arduous process. We were only interested in matching for a particular year, so if a player was ranked by CSS in 2007 but not drafted until 2008, they are not matched in our data. We did a full join, so the players who were not matched are still in that dataset. That is, players who were not drafted but ranked by CSS and players were were ranked by CSS but not drafted are in this dataset. Additionally, the data contains a Cescin value from Fyffe’s original paper for each player.

The intention here is to have this be a dataset that gets updated. If you have information on any matches that we missed, data that is incorrect or additional data that could be added, please send me an email with the information at email me (edit To: field).

players<-read.csv("http://myslu.stlawu.edu/~msch/NHLDraft_CSS_Cescin_2020c.csv")

players$cescin[is.na(players$FINAL.RANK)]=1500
#qplot(FINAL.RANK, Overall, data=players)
players<-players %>%
  mutate(Category=paste(Region,Pos)) %>%
  mutate(recent=(Year>2011))

Below we are going to plot some of the data and look at the relationships. We began by limiting the plots here to the first 224 overall picks which will be what we have when Seattle joins the party. Further, for players who did not have a Cescin value because they were not in the CSS rankings, there were given a value of 1500 which is what Schuckers did in the Draft by Numbers paper cited above. That number was much higher than any of the Cescin values that were obtained from the data. Some players that CSS includes in their lists do not have a ranking for Final Rank. Some players in the first ranking are not in the second. This is problematic since these players are clearly ranked lower than their cohort who have rankings but for all of these we have given the Cescin value of 1500. Part of this is because CSS does two rankings, a Midterm Rank and a Final Rank. An analysis of Midterm Rank vs Final Rank would be a valuable one.

Here is a basic breakdown of the numbers of players in each category in our data by year. In total, there are 7968 players in this dataset (as of January 29, 2020). For 2003 we have very little data for EU SK. The 2003 data is the most complete we have at the moment and we hope that someone has something better somewhere in their archives. Additionally more years of historical data would be useful for completeness. For the analyses that follow, since we are interested in predicting draft position, Overall, from CSS Final Rank, we have removed players who did not have a ranking in the year they were drafted. This is somewhat problematic as we will see below.

##       
##        EU G EU SK NA NA NorAm G NorAm SK
##   2003   12    25    98      30      207
##   2004   13   161    63      30      275
##   2005   10   144    36      30      250
##   2006   18   165    18      30      222
##   2007   16   175    43      30      210
##   2008   15   175    37      30      170
##   2009   12   171    43      25      215
##   2010    9   131    38      23      238
##   2011   10   159    40      30      249
##   2012   11   146    43      35      249
##   2013   10   142    32      35      250
##   2014   14   157    33      30      252
##   2015   10   129    32      41      237
##   2016   13   161    34      38      244
##   2017   13   150    38      31      250
##   2018   18   141    29      38      256
##   2019   12   144    30      36      246

Here is the breakdown of drafted players (Picks 1 to 224) by Category and Year. The only trend that I see here is that the number of European Skaters (EU SK’s) has increased by a bit over the last five years or so. Again because there is a only a finite number of slots that likely means that there will be a lower value for Cescin for that group going forward. The group that seems to have decreased as a result is the North American Skaters (NorAm SK’s).

drafted<-players %>%
  filter(!is.na(Overall)) %>%
  filter(Overall<225)

table(drafted$Year,drafted$Category)
##       
##        EU G EU SK NorAm G NorAm SK
##   2003    6    24      13      120
##   2004    6    58      17      115
##   2005    2    40      17      131
##   2006    7    50      17      122
##   2007    3    24      13      128
##   2008    5    27      15      127
##   2009    5    32      12      118
##   2010    4    28      13      127
##   2011    6    31      10      123
##   2012    5    24      14      125
##   2013    2    35      14      128
##   2014    9    34       9      125
##   2015    5    41      12      122
##   2016    4    40      12      121
##   2017    6    43      12      118
##   2018    8    50      16      114
##   2019    4    54      12      117

Relationships

This section will look at some of the relationships among these observations. The first thing we want to do is to look at the relationship between Final Rank and Overall. That is shown below with a plot for each category as well as the smoothed curve. Here we’ve used a loess smoother with a span of 0.75. What is really obvious is that the relationship is not linear.

gg<-ggplot(players, aes(x=FINAL.RANK, y=Overall, colour=Category))+ facet_wrap( ~ Category, ncol=3)+geom_point(alpha=0.1)+xlim(c(0,250))+ ylim(c(0,224)) +geom_smooth(color= "black",method="loess",span=0.75)+facet_grid(vars(Region),vars(Pos))+ggtitle("Overall vs Final Rank (2003-2019)")
gg

Next, we want to consider Cescin values from Fyffe’s article. To illustrate the relationship here, the plot below is of the Overall vs. the Cescin value for each player. It does seem clear that assuming a linear relationship between these two variable is not appropriate. The smoothed relationships (black lines) in the plot seem to suggest that there is some non-linearity here. Cescin does spread out the x-axis values for these four categories of players.

gg1<-ggplot(players, aes(x=cescin, y=Overall, colour=Category))+ facet_wrap( ~ Category, ncol=3)+geom_point(alpha=0.1)+xlim(c(0,600))+ ylim(c(0,224)) +geom_smooth(color= "black",method="loess",span=0.75)+facet_grid(vars(Region),vars(Pos))+ggtitle("Overall vs Cescin (2003-2019)")
gg1

Here is the plot of the Overall vs Final Rank data but only using the last 8 years of data, i.e. since the 2012 draft. What you can see and what will be obvious is that because there are only so many draft picks that their is a ceiling on how the predicted value for Overall can grow. EU G’s seems to have a bit of a downward trend to them among the higher Final Rank’s, perhaps NorAm G’s do also.

gg<-ggplot(players[players$recent,], aes(x=FINAL.RANK, y=Overall, colour=Category))+ facet_wrap( ~ Category, ncol=3)+geom_point(alpha=0.1)+xlim(c(0,250))+ ylim(c(0,224)) +geom_smooth(color= "black",method="loess",span=0.75)+facet_grid(vars(Region),vars(Pos))+ggtitle("Overall vs Final Rank (2012-2019)")
gg

Below is the plot with the same variables but only data from 2003 to 2011. While it is hard to judge from these plots, the basic function form of the relationships between Final Rank and Overall seems to be about the same for the older period above (2003 to 2011) and the more recent period below (2012 to 2019). European Goalies (EU G) might be an exception to that.

gg<-ggplot(players[!players$recent,], aes(x=FINAL.RANK, y=Overall, colour=Category))+ facet_wrap( ~ Category, ncol=3)+geom_point(alpha=0.1)+xlim(c(0,250))+ ylim(c(0,224)) +geom_smooth(color= "black",method="loess",span=0.75)+facet_grid(vars(Region),vars(Pos))+ggtitle("Overall vs Final Rank (2003-2011)")
gg

We next look at the positions (D, F and G) by Region specifically to focus on whether or not the relationship between Overall and Final Rank differs between D’s and F’s. Visually from the graph below I don’t see much evidence of a difference.

gg<-ggplot(players[!is.na(players$Position2),], aes(x=FINAL.RANK, y=Overall, colour=Category))+ facet_wrap( ~ Position2, ncol=3)+geom_point(alpha=0.1)+xlim(c(0,250))+ ylim(c(0,224)) +geom_smooth(color= "black",method="loess",span=0.75)+facet_grid(vars(Region),vars(Position2))+ggtitle("Overall vs Final Rank by Position (2003-2019)")
gg

Returning to the full data, the next plot is of the relationship between Overall and Final Rank using a linear fit. This doesn’t pass an eye-test to be sure.

gg2<-ggplot(players, aes(x=FINAL.RANK, y=Overall, colour=Category))+ 
  facet_wrap( ~ Category, ncol=3)+
  geom_point(alpha=0.10)+
  xlim(c(0,250))+ 
  ylim(c(0,224)) +
  geom_smooth(color= "black",method="lm")+
  facet_grid(vars(Region),vars(Pos))+ggtitle("Overall vs Final Rank (2003-2019), linear")

gg2

Here is the linear fit for the same data with the line forced to go through the origin (0,0), i.e without an intercept in the model. This is better though the smoothed fits above suggest that the model does not do as well for larger values of Final Rank.

gg3<-ggplot(players, aes(x=FINAL.RANK, y=Overall, colour=Category))+
  geom_point(alpha=0.1)+
  stat_smooth(formula=y~x-1,method="lm",color="black")+
  facet_wrap(~Category)+
  xlim(c(0,250))+
  ylim(c(0,224))+
  facet_grid(vars(Region),vars(Pos),
             scales="fixed")+
  ggtitle("Overall vs Final Rank (2003-2019), linear fit no intercept")
  
gg3

Finally, we return to the plots by Position using a linear fit without intercept for data from 2012 onward. Below the resulting plot seems to produce slopes/coefficients that are approximately the same for F’s and D’s.

temp.players<- players[!is.na(players$Position2)&players$recent,]
gg3<-ggplot(temp.players, aes(x=FINAL.RANK, y=Overall, colour=Category))+
  geom_point(alpha=0.1)+
  facet_wrap(~Position2)+
  stat_smooth(formula=y~x-1,method="lm",color="black")+
  xlim(c(0,250))+
  ylim(c(0,224))+
  facet_grid(vars(Region),vars(Position2),
             scales="fixed")+
  labs(title="Overall vs Final Rank by Position (2012-2019),\n linear fit no intercept")+ 
     theme(plot.title = element_text(hjust = 0.5))
  
gg3

Statistical Models

Next we ran a series of statistical models to formally analyze the quality of model fit for the relationship between Overall and Final Rank by Category. Both linear, quadratic and nonparametric regressions were considered as well as models with and without y-intercepts. Given the additional variability in Overall for larger values of Final Rank, we also considered using Poisson and Quasipoisson regression which would deal with this variation. Initially, we ran a wide variety of models using mean absolute deviation, MAD, of their residuals as my criterion for average fit. If you show the code you’ll be able to see these models. And we did some out of sample prediction by using recent years as folds. Further, we contemplated using an isotonic or a monotonic regression for these model, but ultimately chose not to for reasons outlined below.

After doing that model building, it occurred to us that this modelling was problematic for these data. These data are truncated/censored in a statistical sense which means that we don’t have observed values for many of these players. In the case of the draft, the players were not drafted and, therefore, they don’t have a value for when they were selected, Overall, or they were not ranked by CSS for a particular year and so there is no value for the Final Rank variable. Now we can drop all of the players for whom there is not a value in one of the variables but that biases our analyses. Looking at the data it is possible to have someone drafted who had a very high Final Rank but not possible to have someone who had a lower rank but was drafted really late. This makes me think the nonlinearity in the higher values of Final Rank is not real but an artifact of the censoring. As a consequence, we’ve gone with Fyffe’s original model of a linear model with no intercept since that seems to best fit the data in the neighborhood of the origin, that is, near (0,0).

mad_table<-NULL
reg0<-lm(Overall~FINAL.RANK:Category-1,data=players)
summary(reg0)
#plot(reg0)
mad_table<-data.frame(Name="reg0",MAD=mean(abs(predict(reg0)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK))])),Description="Linear Model each Category, no intercept")
reg1<-lm(Overall~FINAL.RANK*Category,data=players)
summary(reg1)
#plot(reg1)
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg1",MAD=mean(abs(predict(reg1)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description="Linear Model each Category"))
#try box-cox transformation
#boxcox(reg0,lambda=seq(-2, 2, length = 20))
#boxcox(reg1,lambda=seq(-2,2,length=20))
reg2=lm(sqrt(Overall)~FINAL.RANK*Category,data=players)
#plot(reg2)
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg2",MAD=
mean(abs(predict(reg2)^2-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description="Sqrt (Y) ~ Linear for each Category"))
#
reg3=lm(sqrt(Overall)~FINAL.RANK*Category+I(FINAL.RANK^2):Category,data=players)
summary(reg3)
#plot(reg3)
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg3",
MAD=mean(abs(predict(reg3)^2-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),
Description="Sqrt(Y) ~ Quadratic for each Category"))
#
reg4=glm(Overall~FINAL.RANK*Category+I(FINAL.RANK^2):Category,data=players,family=quasipoisson(link="identity"))
summary(reg4)
#plot(reg4)
#
mad_table %>%
  bind_rows(data.frame(Name="reg4",MAD=mean(abs(predict(reg4)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description="QuasiPoisson Regression, Identity Link, Quadratic for each Category"))
#
reg5=glm(Overall~FINAL.RANK*Category,data=players,family=quasipoisson(link="identity"))
summary(reg5)
#plot(reg5)
mad_table<-mad_table %>%
bind_rows(data.frame(Name="reg5",MAD=
mean(abs(predict(reg5)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description=
  "QuasiPoisson Regression, Identity Link, Linear for each Category"))
#
reg6=glm(Overall~FINAL.RANK*Category,data=players,family=quasipoisson (link="log"))
summary(reg6)
#plot(reg6)
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg6", MAD=
mean(abs(predict(reg6,type="response")-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])), Description="QuasiPoisson, Log link, Linear for each Category"))
#
reg7=glm(formula = Overall ~ FINAL.RANK:Category + I(FINAL.RANK^2):Category - 1, family = quasipoisson(link = "identity"), data = players)
summary(reg7)
#plot(reg7)
mean(abs(reg7$residuals))
mean(abs(predict(reg7)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))]))
#
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg7",
MAD=mean(abs(predict(reg7,type="response")-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description="QuasiPoisson, Identity link, Quadratic for each Category, no intercept"))
#
reg8=glm(formula = Overall ~ FINAL.RANK:Category  - 1, family = quasipoisson(link = "identity"), data = players)
summary(reg8)
#plot(reg8)
mean(abs(predict(reg8)-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))]))
#
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg8",MAD=
mean(abs(predict(reg8,type="response")-players$Overall[!(is.na(players$Overall)|is.na(players$FINAL.RANK)|is.na(players$Category))])),Description="QuasiPoisson, Identity link, Linear for each Category, no intercept"))
#
players<-players %>%
  arrange(Year,Category,FINAL.RANK)
#
span_mad=seq(from = 0.3, to = 0.8, by = 0.05)
mad_out=rep(NA,length(span_mad))
#
    for(i in 1:length(span_mad))
    {
    reg9eug=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="EU G",],span=span_mad[i])
 #   
    reg9eusk=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="EU SK",],span=span_mad[i])
#    
    reg9nag=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="NorAm G",],span=span_mad[i])
 #   
    reg9nask=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="NorAm SK",],span=span_mad[i])
#    
    mad_out[i]<-(sum(abs(reg9eug$residuals))+
      sum(abs(reg9eusk$residuals))+
     sum(abs(reg9nag$residuals))+
      sum(abs(reg9nask$residuals)))/length(players$Overall[!is.na(players$Overall)])
    }
#
players<-players %>%
  arrange(Year,Category,FINAL.RANK)%>%
  filter(!is.na(players$Overall&!is.na(players$FINAL.RANK)))
#
span_mad=seq(from = 0.5, to = 0.9, by = 0.05)
years_mat=2014:2019
mad_out2=matrix(NA,nrow= length(span_mad),ncol=length(years_mat))
#
for (j in 1:length(years_mat))
{
    for(i in 1:length(span_mad))
    {
    reg9eug=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="EU G"&players$Year!=years_mat[j],],span=span_mad[i])
#
reg9eusk=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="EU SK"&
        players$Year!=years_mat[j],],span=span_mad[i])
#    
  reg9nag=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="NorAm G"&
    players$Year!=years_mat[j],],span=span_mad[i])
#    
  reg9nask=loess(formula=Overall ~ FINAL.RANK, data=players[players$Category=="NorAm SK"&players$Year!=years_mat[j],],span=span_mad[i])
#    
  mad_out2[i,j]<-
  sum(abs(predict(reg9eug,newdata=players[players$Category=="EU G"&players$Year==years_mat[j],])-players$Overall[players$Category=="EU G"&players$Year==years_mat[j]]),na.rm=TRUE)+
    sum(abs(predict(reg9eusk,newdata=players[players$Category=="EU SK"&players$Year==years_mat[j],])-players$Overall[players$Category=="EU SK"&players$Year==years_mat[j]]),na.rm=TRUE)+
    sum(abs(predict(reg9nag,newdata=players[players$Category=="NorAm G"&players$Year==years_mat[j],])-players$Overall[players$Category=="NorAm G"&players$Year==years_mat[j]]),na.rm=TRUE)+
    sum(abs(predict(reg9nask,newdata=players[players$Category=="NorAm SK"&players$Year==years_mat[j],])-players$Overall[players$Category=="NorAm SK"&players$Year==years_mat[j]]),na.rm=TRUE)
    }
}
#
mad_table<-mad_table %>%
  bind_rows(data.frame(Name="reg9",
MAD=mad_out[11],Description="Non parametric Regression via LOESS"))

The next step after settling on the model was to look at whether there have been changes in recent years. Below we’ve also added Fyffe’s multipliers for comparison. For the table we can see that there have been some changes. Since the 2012 NHL Entry Draft, European Goalies (EU G) are taken earlier in the draft as our European Skaters (EU SK) while North American Goalies (NorAm G) are taken slightly later relative to their Final Rank. North American Skaters (NorAm SK ) have a multiplier that is almost identical across the two time periods. We’ve not done a formal calculation with the standard errors but none of these differences seem to be more than two standard error’s apart.

When comparing Fyffe’s numbers we can see that his values for EU players were considerable higher for EU G’s and EU SK’s. Some of that is no doubt due to methodology since our numbers for the earlier period were also higher but not quite as high especially for the aforementioned groups. Multipliers for NorAm G and NorAm SK are similar to Fyffe’s. Note that the data that leads to the curvature in the smoothed plots potentially also makes the slopes/coefficients smaller.

#Compare 2003-2011 to 2012-2019, 
regfinalold=glm(formula = Overall ~ FINAL.RANK:Category  - 1, family = quasipoisson(link = "identity"), data = players[players$recent==0,])
regfinalnew=glm(formula = Overall ~ FINAL.RANK:Category  - 1, family = quasipoisson(link = "identity"), data = players[players$recent==1,])

#
table.coef<-data.frame(bind_rows(round(regfinalold$coefficients,2),round(regfinalnew$coefficients,2))) 
table.coef<-rbind(table.coef,c(38.18,6.27,13.25,1.35))
names(table.coef)<-c("EU G","EU SK","NorAm G","NorAm SK")
row.names(table.coef)<-c("2003-2011","2012-2019","Fyffe")
#
table.coef %>%
  kable() %>%
  kable_styling()
EU G EU SK NorAm G NorAm SK
2003-2011 24.16 3.30 11.03 1.27
2012-2019 21.03 2.86 12.63 1.29
Fyffe 38.18 6.27 13.25 1.35

Discussion

The primary goal of this project was to create a merged dataset of Central Scouting Service Rankings and NHL Entry Draft data. As linked above, that data is now available. The long term goal is to update that data each year as it becomes available. As we said above, we’re happy to update the current data if errors are found.

The second goal was to replicate Iain Fyffe’s Cescin analysis with newer data. We’ve done that to some extent. There seem to be some small changes to the coefficients that his analysis developed in the case of EU players. More sophisticated analyses of these data attacking the problem of the data’s missingness are probably warranted at some point and we’d be happy to collaborate with someone on that project. Dealing appropriately with missing or unobserved data is challenging. As part of that analysis, some flexibility in the functional form (linear, quadratic, spline) also seems to be worth investigating further.

Thanks to Timo Seppa and Sam Ventura for reading and commenting on a draft of this article.