\documentclass[submit]{smj}

\usepackage{booktabs} 
\usepackage{subfig}

\Author{Gunther Schauberger\Affil{1,}\Affil{2}
        and Andreas Groll\Affil{3}}
\AuthorRunning{Gunther Schauberger and Andreas Groll}


\Affiliations{
\item Chair of Epidemiology,
Department of Sport and Health Sciences,
      Technical University of Munich,
      Germany 
\item Department of Statistics, 
      Ludwig-Maximilians-Universit\"at M\"unchen,
      Germany

\item Faculty of Statistics,
      Technische Universit\"at Dortmund, 
      Germany
}   

\CorrAddress{Gunther Schauberger, 
Chair of Epidemiology,
      Department of Sport and Health Sciences, 
      Technical University of Munich,
			Georg-Brauchle-Ring 56,
			80992 M\"unchen,
      Germany}
\CorrEmail{gunther.schauberger@tum.de}
\CorrPhone{(+49)\;89\;289\;24955}
\CorrFax{(+49)\;89\;289\;24953}


\Title{Predicting matches in international football tournaments with random forests}
\TitleRunning{Predicting matches in international football tournaments with random forests}


\Keywords{
Random forests, Football, FIFA World Cups, Poisson regression, Regularisation
}


\Abstract{
Many approaches that analyse and predict results of international matches in football are based on 
statistical models incorporating several potentially influential covariates with respect to a national team's success, such as the bookmakers' ratings or the FIFA ranking. 
Based on all matches from the four previous FIFA World Cups 2002 -- 2014, we compare 
the most common {\em regression models} that are based on the teams' covariate information with regard to their predictive performances with an alternative modelling class, the so-called {\em random forests}.
Random forests can be seen as a mixture between machine learning and statistical modelling and are known for their high predictive power. Here, we consider two different types of random forests depending on the choice of response. One type of random forests predicts the precise numbers of goals while the other type considers the three match outcomes {\em win, draw} and {\em loss} using special algorithms for ordinal responses. {\color{red}To account for the specific data structure of football matches, in particular at FIFA World Cups, the random forest methods are slightly altered compared to their standard versions and adapted to the specific needs of the application to FIFA World Cup data.}
}

\begin{document}
%\SweaveOpts{concordance=TRUE}

<<echo=FALSE,message=FALSE,warning=FALSE,results ='hide'>>=
library(xtable)
library(ggplot2)
library(gridExtra)
library(grid)
@

\maketitle


\section{Introduction}

In the last decade an increasing interest in the modelling and prediction of major 
international football events can be observed. As a consequence, many different statistical techniques
and approaches have been applied and adapted to deal with different types of football data.
Among these, an essential class of models is based on regression methods, which incorporate covariate information of the opposing teams. 

In particular, Poisson regression models have gained a lot of attention, where the numbers of goals of both competing teams can be directly linked to a set of influence variables. Early references in this 
context are, for example, \citet{Lee:97} and \citet{Dyte:2000}.
The latter researchers focus on scores in international football matches, treating each
team's goals as (conditionally) independent Poisson variables depending on two influence variables, namely the team's FIFA ranking and the match venue. More recently, \citet{GroAbe:2013} and \citet{GroSchTut:2015} have further extended these Poisson models by incorporating a large set of potential influence variables as well as (either random or fixed) team-specific ability parameters. By using different regularisation techniques %, in particular, different types of $L_1$-penalization, 
they discovered a sparse set of relevant covariates, which were then used to predict the European championship (EURO) 2012 and FIFA World Cup 2014 winners, respectively. In both cases, the actual tournament winner was identified as the most likely one by the model.

Note that, implicitly, all of these models treat the two numbers of goals scored in a match as independent
(conditioned on covariate information). First approaches to account for possible dependencies
between the scores by using adjusted Poisson models are proposed by \citet{DixCol:97} 
and \citet{RueSal:2000}. Alternatively, the bivariate Poisson distribution 
allows to explicitly model (positive) dependence within the Poisson framework. One of the first works
dealing with this distribution in the context of football data is \citet{Mah:82}.
Furthermore, an extensive study for the use of the bivariate Poisson distribution for the modelling of football data can be found in \citet{KarNtz:2003}. However, in \citet{GrollEtAl2018} it has been shown by the help of gradient boosting techniques that (at least for their setting of European championship data) no additional modelling of the covariance structure is necessary: for suitably designed 
linear predictors, which are based on highly informative covariates, two (conditionally) independent Poisson distributions are adequate.

We also want to mention a completely different approach, 
which is solely based on the easily available source of 
``prospective" information contained in bookmakers' odds,
compare \citet{Leit:2010a} and their follow-up papers. 
They obtain winning probabilities for each team 
by aggregating winning odds from several online bookmakers
and then using inverse tournament simulation to compute team-specific 
abilities by paired comparison models. 
%Using this technique the effects of the tournament draw are stripped. 
Based on these abilities, pairwise probabilities 
for each possible game at the corresponding tournament can be calculated
and, finally, the whole tournament can be simulated.

In this work, we pursue a different approach and investigate an alternative
tool for the prediction of the outcomes of football matches, namely random (decision) forests --
an ensemble learning method for classification, regression and other tasks proposed by \citet{Breiman:2001a}.
The method stems from the machine learning and data mining community and operates by first constructing a multitude of so-called decision trees (see, e.g., \citealp{qui:1986}; \citealp{BreiFrieOls:84}) on a training data set.  {\color{red} For prediction, the predictions from the individidual trees are summarized, either by taking the mode of the predicted classes (in classification) or by averaging the predicted values (in regression).} This way, random forests reduce the tendency of overfitting and the variance compared to regular decision trees, and, hence, are a common powerful tool for prediction. {\color{red} Therefore, random forests might also be a promising alternative for the prediction of football matches.} In the present work, we use both random forests for metric (i.e., the number of goals) and ordinal response (i.e., win-draw-loss) as well as the combination of both. On a data set containing all matches of the FIFA World Cups 2002 -- 2014 we compare the predictive performance of these different types of random forests with conventional regression methods for count data, such as Poisson generalized linear models (GLMs).


% Of course, as the resulting winning probabilities are very closely connected to the underlying odds,
% it is practically impossible to predict an underdog team as the tournament winner.
% Consequently, while for the UEFA EUROs 2008 and 2012 as well as for the FIFA World 2010 
% the approach successfully predicted the tournament 
% winner, which was always the team favored by the bookmakers, the results for the FIFA World Cup 2014 and the UEFA Euro 2016 were inferior (for details, see \citealp{Zeil:2014} and \citealp{Zeil:2016}).

The rest of the manuscript is structured as follows: in Section~\ref{data} we describe the underlying data set covering all matches of the four preceding FIFA World Cups 2002 -- 2014. Next, in Section~\ref{modeling} we explain how random forests can be used as prediction tools for the outcomes of football matches.
Alternative, regression-based methods are summarized in Section~\ref{alternatives}.
The main differences between the outputs of random forests and regression models for football predictions are highlighted in Section~\ref{sec:heuristic:comp}. Both modelling alternatives are then compared with regard to their predictive performance in Section~\ref{comparison}. Finally, we conclude in Section~\ref{conclusion}.


\section{Data}
\label{data}

In this section, we provide a brief description of the underlying data
set covering all matches of the four preceding FIFA World Cups 2002 -- 2014 together with
several potential influence variables. In general, we use essentially %exactly 
the same set of covariates that is introduced in \citet{GroSchTut:2015}. For each participating team, most of these covariates are observed 
shortly before the start of the respective World Cup (e.g., the FIFA ranking) 
or for the same year of the World Cup (e.g., the GDP per capita). Therefore, the covariate values of the teams may vary from one World Cup to another. Several of the variables contain information about the recent performance and sportive success of national teams, as it is reasonable to assume that 
the current form of a national team has an influence on the team's success in the upcoming
tournament. Beside these sportive variables, also certain economic factors as well as variables describing the structure of a team's squad are collected. A detailed description of these variables can be found in \citet{GroSchTut:2015}.




\begin{itemize}
\item {\it GDP per capita.} To account for the general 
	increase of the gross domestic product (GDP) during 2002 -- 2014, a ratio of the GDP per capita of the respective country and the worldwide average GDP per capita is used (source: \url{http://unstats.un.org/unsd/snaama/dnllist.asp}).\vspace{-0.2cm}
\item {\it Population.} The population size is 
	used as a ratio with the respective global population to account for the general growth of the world population (source: \url{http://data.worldbank.org/indicator/SP.POP.TOTL}).\vspace{-0.2cm}
	\item {\it \red{ODDSET probabilities}.} Bookmaker odds provided by the German state betting agency ODDSET are converted into winning probabilities. Therefore, the variable reflects probabilities for each team to win the respective World Cup\footnote{The possibility of betting on the World Champion before the start of the tournament is rather novel. ODDSET, for example, offered the bet for the first time at the FIFA World Cup 2002.}.\vspace{-0.2cm}
\item {\it FIFA Rank.} The FIFA ranking provides a ranking system for all national 
	teams measuring the performance of the team over the last four years (source: \url{http://de.fifa.com/worldranking/index.html}). \vspace{-0.2cm}
\item {\it Host.} A dummy variable 
indicating whether or not a national team is a hosting country.\vspace{-0.2cm} 
\item {\it Continent.} A dummy variable indicating if a national team is from the same 
	continent as the host of the World Cup (including the host itself).\vspace{-0.2cm}
\item {\it Confederation.} This categorical variable comprises the confederation of the respective team with (in principle) six possible values: Africa (CAF);
    Asia (AFC); Europe (UEFA); North, Central America and Caribbean (CONCACAF); Oceania (OFC); South America (CONMEBOL). The confederations OFC and AFC had to be merged because in the data set only one team (New Zealand, 2006) from OFC participated in a World Cup.\vspace{-0.2cm}
\item {\it (Second) maximum number of teammates.} For each squad, both the maximum 
	and second maximum number of teammates playing together in the same national club are counted. \vspace{-0.2cm}
\item {\it Average age.} The average age of each squad is collected.\vspace{-0.2cm} 
\item {\it Number of Champions League (Europa League) players.}  
	As a measurement of the success of the players on club level, the number of 
	players in the semi finals (taking place only a few weeks before the 
	respective World Cup) of the UEFA Champions 
	League (CL) and UEFA Europa League are counted.\vspace{-0.2cm} 
\item {\it Number of players abroad.} For each squad, the number of players 
	playing in clubs abroad (in the season previous to the respective World Cup) is counted.\vspace{-0.2cm}
\item{\it Factors describing the team's coach:}
For the coach of each national team, the {\it Age} and the duration of his {\it Tenure} are observed. Furthermore, a dummy variable is included, 
whether the coach has the same {\it Nationality} as his team or not.
	\end{itemize}
	
\red{In total, this adds up to 16 variables which were collected separately for each World Cup and each participating team.} For illustration, Table~\ref{data1} shows exemplarily for the first four matches of the FIFA World Cup 2002 the results (\ref{tab:results}) and (parts of) the covariates (\ref{tab:covar}) of the respective teams. In the remainder of this section, this data excerpt will be used to illustrate how the final data sets are constructed.

	\begin{table}[h]
\small
\caption{\label{data1} Exemplary table showing the results of four matches and parts of the covariates of the involved teams.}
\centering
\subfloat[Table of results \label{tab:results}]{
\begin{tabular}{lcr}
  \hline
 &  &  \\ 
  \hline
FRA \includegraphics[width=0.4cm]{flags/fra.png} & 0:1 &  \includegraphics[width=0.4cm]{flags/sen.png} \;SEN\\
URU \includegraphics[width=0.4cm]{flags/uru.png} & 1:2 &  \includegraphics[width=0.4cm]{flags/den.png} \;DEN\\
FRA \includegraphics[width=0.4cm]{flags/fra.png} & 0:0 &  \includegraphics[width=0.4cm]{flags/uru.png} \;URU\\
DEN \includegraphics[width=0.4cm]{flags/den.png} & 1:1 &  \includegraphics[width=0.4cm]{flags/sen.png} \;SEN\\
  \vdots & \vdots & \vdots  \\
  \hline
\end{tabular}}
\hspace*{0.8cm}
\subfloat[Table of covariates \label{tab:covar}]{
\begin{tabular}{llrrrrr}
  \hline
World Cup & Team &  Age & Rank & Oddset &   \ldots \\ 
  \hline
2002 & France  & 28.3 & 1 & 0.149 & \ldots \\ 
2002 &  Uruguay & 25.3 & 24 & 0.009 & \ldots \\ 
2002 &  Denmark & 27.4 & 20 & 0.012 & \ldots\\ 
2002 &  Senegal & 24.3 & 42 & 0.006 & \ldots\\ 
  \vdots & \vdots & \vdots & \vdots & \vdots  &  $\ddots$ \\
   \hline
\end{tabular}
}
\end{table}

For the modelling techniques introduced in the following sections, all of the metric 
covariates are incorporated in the form of differences. For example, the final variable {\it Rank} will be the difference between  the FIFA ranks of both teams. The categorical variables {\it Host},
{\it Continent}, {\it Confederation} and {\it Nationality}, however, are included as separate variables for both competing teams.
For variable {\it Confederation}, for example, this results in two columns of the corresponding design matrix denoted by 
{\it Confed} and {\it Confed.oppo}, where {\it Confed} is referring to the confederation of the first-named team
and {\it Confed.oppo} to the one of its opponent.

In general, we will consider two different types of response variables which will lead to two fundamentally different data sets. For the first type, the number of goals is directly used as response variable. Therefore, each match corresponds to two different observations, one per team. The second type uses the ordinal variable with categories  1 (\textit{win}), 2 (\textit{draw}) and 3 (\textit{loss}) from the perspective of the first-named team. Therefore, in this case each match represents one observation in the data set. Also the covariate differences are computed from the perspective of the first-named team. For illustration, the resulting data structures for the exemplary matches from Table~\ref{data1} are displayed in Table~\ref{data2}, separately for count data response in Table~\ref{data_goals} and for ordinal response in Table~\ref{data_ord}.



\begin{table}[!h]
\small
\centering
\caption{Exemplary tables illustrating the data structure for both response types.}\label{data2}
\subfloat[%Exemplary table illustrating the data 
Data structure for count data response (goals). \label{data_goals}]{
\begin{tabular}{rllrrrr}
  \hline
Goals & Team & Opponent & Age & Rank & Oddset &  ... \\ 
  \hline
0 & France & Senegal & 4.00 & -41 & 0.14 &  ...  \\ 
  1 & Senegal & France & -4.00 & 41 & -0.14 &  ...  \\ 
  1 & Uruguay & Denmark & -2.10 & 4 & -0.00 &  ...  \\ 
  2 & Denmark & Uruguay & 2.10 & -4 & 0.00 &  ...  \\ 
    0 & France & Uruguay & 3.00 & -23 & 0.14 &  ... \\ 
  0 & Uruguay & France & -3.00 & 23 & -0.14 &  ... \\ 
  1 & Denmark & Senegal & 3.10 & -22 & 0.01 &  ...  \\ 
  1 & Senegal & Denmark & -3.10 & 22 & -0.01 &  ... \\ 
	 \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &  $\ddots$ \\
   \hline
\end{tabular}
}

\subfloat[%Exemplary table illustrating the data 
Data structure for ordinal response (1: win; 2: draw; 3: loss). \label{data_ord}]{
\begin{tabular}{rllrrrr}
  \hline
Result & Team & Opponent & Age & Rank & Oddset &  ... \\ 
  \hline
3 & France & Senegal & 4.00 & -41 & 0.14 &  ...  \\ 
  3 & Uruguay & Denmark & -2.10 & 4 & -0.00 &  ...  \\ 
    2 & France & Uruguay & 3.00 & -23 & 0.14 &  ... \\ 
  2 & Denmark & Senegal & 3.10 & -22 & 0.01 &  ...  \\ 
	 \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &  $\ddots$ \\
   \hline
\end{tabular}
}
\end{table}



\section{Modelling football results using random forests}
\label{modeling}
In this work we propose to use random forests as prediction tools for the outcomes of football matches. Before introducing possible specific strategies for the application of random forests to football data we start with a general introduction into the basic ideas of random forests. 


\subsection{Random forests}
Random forests were introduced by \citet{Breiman:2001a} as an extension of the method proposed by \citet{Ho:98}. The underlying principle of random forests is the aggregation of a (large) number of classification or regression trees and, therefore, the method can be used both for classification and regression purposes. The single trees are grown independently from each other. To get a final prediction, predictions of single trees are aggregated, either by majority vote (for classification) or by averaging (for~regression).

Before going into further details of the principles of random forests we shortly sketch the main essence of classification and regression trees \citep{BreiFrieOls:84}. In general, the term classification tree is used for trees with categorical (or binary) response variables, while trees for metric responses are called regression trees. With classification and regression trees the feature space is partitioned, each partition has its own prediction (or its own model, see, e.g., \citealp{zeileis2008model}). \red{The partitioning of the predictor space is done recursively and can follow different criteria.} However, the main goal is always to find the split which provides the strongest difference between the two new partitions with respect to the chosen criterion. Observations within the same partition are supposed to be as similar as possible, observations from different partitions are supposed to be very different (with respect to the response variable). The splits are performed subsequently, each partition can be further partitioned in the following step. The consecutive splitting steps can be visualized using a dendrogram.

For illustration we exemplarily fit a regression tree. We use (a part of) the data introduced in Section~\ref{data}, which we will use later for an in-depth comparison of the predictive power of different methods. The data contain all matches from the FIFA World Cups 2002 -- 2014. As response, we consider all final scores of the teams, i.e., we have two observations per match. For simplicity, we only use three predictor variables, which are the differences between the {\it FIFA Rank}, \red{the bookmakers' probabilities {\it Oddset}} and the {\it Age} of both teams.

<<echo=FALSE, results = "hide", message=FALSE, warning=FALSE>>=
library(party)
load("../data/data.02.14.diff.rda")
data.RF <- data.02.14.diff[seq(1,nrow(data.02.14.diff)-1,by=2),]

match.index <- rep(1:nrow(data.RF),each=2)

  data.RF$Y <- rep(1,nrow(data.RF))
  for(i in 1:max(match.index)){
    goals.i <- c(data.02.14.diff[match.index==i,"Goals"])

    diff.i <- diff(goals.i)
    if(diff.i < 0){data.RF$Y[i] <- 1}
    if(diff.i==0){data.RF$Y[i] <- 2}
    if(diff.i > 0){data.RF$Y[i] <- 3}
  }
    data.RF$Y <- as.ordered(data.RF$Y)
    
    # form.rf2 <- as.formula(paste("Y ~ ",paste(names(data.RF)[4:((ncol(data.RF))-1)],collapse=" + ")))
    # 
    # tree2 <- ctree(Y~Rank+age+confed+confed.oppo, data = data.RF)
    # plot(tree2)
    
    names(data.02.14.diff)[c(15,12)] <- c("Oddset","Age")

tree1 <- ctree(Goals~Rank+Oddset+Age, data= data.02.14.diff)
pdf("figure/tree_goals.pdf", width =9)
plot(tree1)
dev.off()
@

\begin{figure}[!ht]
	\centering
		\includegraphics[width=.9\textwidth]{figure/tree_goals.pdf}
	\caption{Exemplary regression tree for FIFA World Cup data. Number of goals is used as response variable, {\it FIFA Rank}, {\it Oddset} and {\it Age} are used as predictors.}
	\label{tree_goals}
\end{figure}

Figure~\ref{tree_goals} shows the resulting regression tree using the function \texttt{ctree} from the \texttt{R}-package \texttt{party} \citep{Hotetal:2006}. The predictor space is partitioned into five partitions, each of the predictors is used at least once for a split. Now, the tree  could be used as a \red{prediction method for new observations}. In each node, the average value of the response variable of the node members is used as prediction. 

Random forests are grown by repeatedly growing different classification/regression trees and applied to new observations by combining the different predictions from the single trees. The main goal is to decrease the variance compared to single trees. Therefore, it is necessary to decrease the correlation between the single trees. For that purpose, two different randomisation steps are applied. First, the trees are not applied to the original sample but {\color{red} to bootstrap samples or random subsamples of the data.} Second, at each node a (random) subset of the predictor variables is drawn which are used to find the best split. {\color{red} But, in contrast to regular trees, in random forests the single trees are commonly not pruned. Pruning leads to a lower variance but also increases the bias. Accordingly, an unpruned tree has the advantage of being nearly unbiased but the disadvantage of high variance. However, the combination of many trees compensates these effects. Therefore,  by de-correlating and combining many trees, predictions with low bias and reduced variance can be achieved.} 

\subsection{Random forests for football results}
\label{rffootball}

In this section, we explain how random forests can be used on football data. In principle, we distinguish between two fundamentally different approaches depending on the type of response (as already described in Section~\ref{data}). 
Similar to the methods introduced in the following section, the first approach uses the number of goals as response. Here, in the data set each match is represented by two rows, one per team. The response is treated as a metric variable and the forest is built using regression trees. However, no explicit distribution assumption is necessary for the application of random forests. The second type of random forests directly tries to classify the ordinal outcomes (win-draw-loss; always from the perspective of the first-named team) of a match. Therefore, each match is represented by a single row in the data set. 

\subsubsection{\bf{Random forest for the prediction of the number of goals}}
\label{rfgoals}
When the metric variable {\it Number of Goals} is considered as the response, we use regression trees for the single trees which are then combined into a random forest. The basic principle is that a predefined number of trees $B$ (e.g., $B=5000$) is fitted based on (bootstrap samples of) the training data. For the prediction of a new observation, the covariate values of the observation are dropped down each of the regression trees, resulting in $B$ predictions. The final prediction is simply the average over all $B$ predictions. This prediction can be seen as a point estimate of  the expected value of the response conditioning on the covariate values. 

We use two slightly different variants of random forests for this approach. First, we use the variant of the classical random forest algorithm proposed by \citet{Breiman:2001a} from the \texttt{R}-package \texttt{ranger} \citep{ranger}. The second variant we use is the function \texttt{cforest} from the \texttt{party} package. Here, the single trees are constructed following the principle of conditional inference trees as proposed in \citet{Hotetal:2006}. The main advantage of conditional inference trees is that one can avoid selection bias in cases where the covariates have different scales, e.g., numerical vs. categorical with many categories. The advantages of these so-called conditional random forests over classical ones are described by \citet{strobl07} and \citet{Strobl-etal:2008}. Conditional forests share the feature of conditional inference trees of avoiding biased variable selection. Furthermore, the single predictions are not aggregated by simple averaging over the single predictions but using observation weights as described in \citet{hothorn04}. 

However, the point estimates for the numbers of goals can not directly be used for the prediction of the outcome of single matches or a whole tournament. Simply plugging in both predictions corresponding to one match does not deliver an integer outcome (i.e., a result) for the match. \red{For example, one might get predictions of 2.3 goals for the first and 1.1 goals for the second team.} Furthermore, as no explicit distribution is assumed for these predictions it is not possible to randomly draw results for the respective match. Hence, similar to the regression methods described in the next section, we will use the predicted expected value for the number of goals as an estimate for the event rate $\lambda$ of a Poisson distribution $Po(\lambda)$. This way we can randomly draw results for single matches and compute probabilities for the match outcomes \textit{win}, \textit{draw} and \textit{loss} by using two independent  Poisson distributions (conditional on the covariates) for both scores.

\subsubsection{\bf{Random forest for the prediction of ordinal match outcomes}}
\label{rford}
If instead of the number of goals the ordinal match outcomes are used as response variable, random forests specifically designed for ordinal responses are applied. They can be seen as an in-between technique of regression and classification forests. In principle, forests for ordinal responses are built as regression forests where the ordinal categories are replaced by metric score values. The determination of the exact score values depends on the type of algorithm, but the score values can also be set by the user. The aggregation of the single trees is then executed analogously to the case of classification forests. This means that the final prediction is determined by majority vote over the single regression tree predictions.

Again, two different variants are used. The first variant is again the function \texttt{cforest} from the \texttt{party} package. By default, for a three-categorical ordinal response it simply uses the values 1, 2 and 3 as score values. Equivalent to conditional forests for metric response variables, \texttt{cforest} uses observation weights to aggregate the single predictions. The second variant is an algorithm recently proposed by \citet{roman} that is implemented in the \texttt{ordinalForest} package \citep{ordfor}. In contrast to the previous method, it uses a complex pre-processing step to learn sensible values for the scores in a data-driven way. Hence, one can avoid the (rather restrictive) assumption that the differences between the single ordinal categories (i.e., between the scores) are equal.


When the ordinal response is considered, it is essential which team is the first-named team. If the order of the teams was reverted, a value of 1 would be replaced by 3 and the other way around. Of course, such an inversion also needs to be accompanied by a redefinition of the covariates. However, even though 1 and 3 are somehow interchangeable, it turns out that across all four tournaments the relative frequencies of the three results are not balanced. While $40.6\%$ of the matches are wins of the first-named team (i.e., take value 1) only $31.6\%$ of the matches are wins of the second-named team (response value 3).
The reason is that due to the specific tournament design of the FIFA there exists a certain structure with respect to first- and second-named teams: especially during the group stage of the World Cups the first-named team is usually the team from the ``stronger" draw bowl. This can be problematic because it implies certain asymmetries in the sense that there is a higher a priori probability of category 1 compared to 3 for each match, even if in a particular match the first-named team is expected to be worse than its opponent. For a good predictive performance it would be preferable if the ordering of first- and second-named team was random. Therefore, we try to escape this issue by additional randomisation based on the following steps:

{\it
\begin{enumerate}
\item Build $T=50$ versions of the training data with a randomised distribution of first- and second-named team.
\item Learn a separate random forest for each of the $T$ randomised training data sets.
\item Build a second version of the test data set which contains each match from the original test data with the inverted order of the competing teams. 
\item Using each of the $T$ random forests, obtain predictions for each match, separately for both versions \red{of the test data} and average the total of $2\cdot T$ predictions, i.e., probabilities for all three response categories, as the final prediction for the respective match. 
\end{enumerate}
}

Obviously, the random forests for ordinal match outcomes can not directly be used for the simulation of exact match outcomes. Therefore, we propose to combine an ordinal random forest with a random forest predicting the number of goals from Section~\ref{rfgoals}. In that case, for the simulation of a match result we first randomly draw one of the three match outcomes based on the probabilities obtained from the ordinal forest. Subsequently, we randomly draw exact scores for both teams using the predictions from a random forest for the number of goals as described in Section~\ref{rfgoals}. Then the first match result which coincides with the drawn match outcome is accepted. For example, if in the first step we draw a win of the first-named team we accept the first result where the first-named team scores more goals than the second-named team. 


\section{Alternative approaches}
\label{alternatives}

We want to compare the described random forest approaches to more traditional modelling approaches which have already been used for modelling football results, at least in a similar way. In general, the most frequently used modelling approach for football results is to treat the scores of the competing teams as (conditionally) independent variables following a Poisson distribution (possibly conditioning on certain covariates). Especially the works of \citet{DixCol:97} and \citet{Mah:82} set the basics for this modelling approach. Therefore, all methods described in this section use the number of goals scored by single teams as response variables (compare Table~\ref{data_goals}). If, as in our case, one wants to include several covariates of the competing teams into the model it is sensible to use regularisation techniques when estimating the models to allow for variable selection and to avoid overfitting. In the following, we will apply three different regularisation approaches.

\subsubsection*{\bf Lasso}\vspace{-0.4cm}
The most simple approach will be to use a conventional Lasso \citep{Tibshirani:96} penalty for the covariate parameters. In this model, the single scores are used as response variable and (conditionally on the covariates) a Poisson distribution is assumed. Each score is treated as a single observation, so that per match there are two observations. Accordingly, for $n$ teams the respective model has the form
\begin{eqnarray}
Y_{ijk}|\boldsymbol{x}_{ik},\boldsymbol{x}_{jk}&\sim &Po(\lambda_{ijk})\,,\\ \nonumber
log(\lambda_{ijk})&=&\beta_0 + (\boldsymbol{x}_{ik}-\boldsymbol{x}_{jk})^\top\boldsymbol{\beta}+\boldsymbol{z}_{ik}^\top\boldsymbol{\gamma}+\boldsymbol{z}_{jk}^\top\boldsymbol{\delta}\,.
\label{lasso}
\end{eqnarray}
Here, $Y_{ijk}$ denotes the score of team $i$ against team $j$ in tournament $k$, where $i,j\in\{1,\ldots,n\},~i\neq j$. The metric characteristics of both competing teams are captured in the $p$-dimensional vectors $\boldsymbol{x}_{ik}, \boldsymbol{x}_{jk}$, while $\boldsymbol{z}_{ik}$ and $\boldsymbol{z}_{jk}$ capture dummy variables for the categorical covariates {\it Host}, {\it Continent}, {\it Confed} and {\it Nation.Coach} (built, for example, by reference encoding), separately \red{for the considered teams and their respective opponents.}. For these variables, it is not sensible to build differences between the respective values. Furthermore, $\boldsymbol{\beta}$ is a parameter vector which captures the linear effects of all metric covariate differences and $\boldsymbol{\gamma}$ and $\boldsymbol{\delta}$ collect the effects of the dummy variables corresponding to the \red{teams and their opponents}, respectively. For notational convenience, we collect all covariate effects in the $\tilde p$-dimensional vector $\boldsymbol{\theta}^\top=(\boldsymbol{\beta}^\top, \boldsymbol{\gamma}^\top, \boldsymbol{\delta}^\top)$. 

For estimation, instead of the regular likelihood $l(\beta_0,\boldsymbol{\theta})$ the penalized likelihood 
\begin{eqnarray}
l_p(\beta_0,\boldsymbol{\theta}) = l(\beta_0,\boldsymbol{\theta}) + \lambda P(\beta_0,\boldsymbol{\theta})
\label{eq:lasso}\end{eqnarray}
is maximized, where $P(\beta_0,\boldsymbol{\theta})=\sum_{v=1}^{\tilde p}|\theta_v|$ is the ordinary Lasso penalty
and $\lambda$ is a tuning parameter. The optimal value for the tuning parameter $\lambda$ will be determined by 10-fold cross-validation (CV). The model will be fitted using the function \texttt{cv.glmnet} from the \texttt{R}-package \texttt{glmnet} \citep{FriHasTib:2008}. In contrast to the similar ridge penalty \citep{HoeKen:70}, which penalizes squared parameters instead of absolute values, Lasso does not only shrink parameters towards zero, but is also able to set parameters to exactly zero. Therefore, depending on the chosen value of the tuning parameter, Lasso also enforces variable selection. 

In addition to the conventional Lasso solution that minimizes the 10-fold CV error, the 
\texttt{cv.glmnet} function from the \texttt{glmnet} package %\citep{glmnet} 
also provides a more sparse solution. 
% It corresponds to a value of the penalty parameter (and, hence, to
% the most regularised model) such that the error is within one standard error of the minimum
% of the CV error.
Here, a different strategy is applied to choose the optimal value for the tuning  parameter $\lambda$. Instead of choosing the model with the minimal CV error one chooses the most restrictive model which is within one standard error of the minimum of the CV error. We refer to this method as {\it Lasso (1se)} in the following.


\subsubsection*{\bf Gamboost}\vspace{-0.4cm}
Compared to the Lasso approach presented above, in the Gamboost approach the model from above is extended from linear to smooth covariate effects $f_v$ for all metric covariates, leading to $log(\lambda_{ijk})=\beta_0 + \sum_{v=1}^pf_v(x_{vik}-x_{vjk})+\boldsymbol{z}_{ik}^\top\boldsymbol{\gamma}+\boldsymbol{z}_{jk}^\top\boldsymbol{\delta}$. In this case, suitable penalization is not as simple as in the case presented above. Therefore, we switch from penalization to boosting and will use the function \texttt{gamboost} from the package \texttt{mboost} \citep{mboost}. Boosting is an iterative fitting procedure where in each step the fit is updated only by a small increment. In our case, in every step only one of the covariates is selected (the one associated with the highest improvement of the fit) and only the respective smooth function is updated. 
Also, each of these respective updates is a rather smooth, only slightly non-linear function to prevent over-fitting.  Of course, for all dummy variables the updates are simply linear effects, and to avoid over-fitting these linear updates are also shrunk. For more details on boosting algorithms see, e.g., \citet{BueHot:2007}. Similar to Lasso also boosting has to be tuned, in this case the number of boosting steps is the main tuning parameter. Tuning is done by fitting the model to $B$ bootstrap samples (here $B=25$) and evaluating the out-of-bag prediction error by taking advantage of all left-out observations. Implicitly, the selection of the number of boosting iterations also enforces variable selection, because all covariates for which the respective smooth function (or linear parameter) has never been updated (until reaching the optimal number of boosting steps) are eliminated from the final model.

\subsubsection*{\bf Group Lasso}\vspace{-0.4cm}
Finally, we use a Group Lasso approach which is a different extension of the Lasso approach presented above. This approach corresponds to the approach proposed by \citet{GroSchTut:2015} where it was used to predict the FIFA World Cup 2014. Here, the linear predictor from \eqref{lasso} is extended by team-specific attack and defense effects for all competing teams and has the form $log(\lambda_{ijk})=\beta_0 + (\boldsymbol{x}_{ik}-\boldsymbol{x}_{jk})^\top\boldsymbol{\beta} +\boldsymbol{z}_{ik}^\top\boldsymbol{\gamma}+\boldsymbol{z}_{jk}^\top\boldsymbol{\delta}+ att_i - def_j$. Note that, as pointed out by \citet{GroSchTut:2015}, the inclusion of team-specific effects makes the parameters corresponding to the {\it Continent} and {\it Confed} variables unidentified. Therefore, the respective terms are excluded from $\boldsymbol{z}_{ik}$, $\boldsymbol{\gamma}$, $\boldsymbol{z}_{jk}$ and $\boldsymbol{\delta}$. The attack and defense parameters are considered as fixed effects and are again estimated using a penalized likelihood approach which extends the conventional Lasso penalty term by a Group Lasso \citep{YuanLin:2006} penalty term. Altogether, the penalty term reads
\begin{eqnarray*}
(\beta_0,\boldsymbol{\theta},\boldsymbol{att},\boldsymbol{def})=\sum_{v=1}^{\tilde p}|\theta_v| + \sqrt{2}\sum_{i=1}^n\sqrt{att_i^2+def_i^2}\,,
\label{eq:grplasso}
\end{eqnarray*} 
where $\boldsymbol{att}^\top=(att_1,\ldots,att_n)$ and $\boldsymbol{def}^\top=(def_1,\ldots,def_n)$, and the factor $\sqrt{2}$ accounts for the respective group size. 
The second part of the penalty term represents a group penalty on the team-specific effects, such that both effects corresponding to the same team form a group of parameters.
In Group Lasso, groups of parameters can be defined where variable selection is then applied to the group as a whole. Therefore, either all parameters from a certain variable group enter the model or none. If selected, the additional team-specific effects can cover effects which are constant for the respective national team across all World Cups of the training data and which are not yet covered by the covariate effects. 

When considering distributions for count data, an alternative to the Poisson distribution is the negative binomial distribution. In general, 
it is less restrictive than the Poisson distribution, as it overcomes 
the rather strict assumption of the expectation equating
the variance. For this project, we also investigated two different modelling alternatives based on the assumption of negative binomially distributed responses. First, we again used the function 
\texttt{gamboost} from the \texttt{mboost} package \citep{mboost}, allowing for smooth covariate effects for all metric covariates. In contrast to the Poisson case,  overdispersion is estimated in the form of an additional scale parameter. However, in our data analyses it turned out that the resulting models correspond to Poisson models because no overdispersion compared to the Poisson assumption was \red{detected: the scale parameter was set to such a large value by the boosting method that 
the negative binomial model was (approximately) equivalent to the Poisson model}. Second, we exploited the very flexible framework of the generalized additive model for location, scale and shape (GAMLSS; \citealp{rigby2005}), where also the second distribution parameter, i.e., the scale, can be related to covariates. We applied a boosting approach proposed by \citet{gamboostlss:2012}, which is implemented in the  \texttt{R}-package \texttt{gamboostLSS} \citep{gamboostLSS_tut} and which allows to perform variable selection on the predictors of both mean and variance. Hence, in contrast to \texttt{gamboost} it would be possible to have also covariates alter the size of the scale parameters. However, the respective parameters were never updated. Therefore, analogously to the simpler \texttt{gamboost} approach, it turned out that the scale parameter was unnecessary and that we again ended up with the Poisson model. Hence,  the negative binomial approach is not further pursued for the remainder of this article.

\section{Heuristic comparison of model outputs}\label{sec:heuristic:comp}
There are fundamental differences between random forests and regression models with respect to their respective model outputs and the interpretations that these models allow for. Therefore, we want to shortly elaborate on these differences in a rather general manner. Exemplarily, we only pick one random forest for the number of goals and one for the ordinal match result (both based on the \texttt{party} package) and compare them to the conventional Lasso and Gamboost approaches. Each of the other methods has strong similarities to one of these approaches and is not treated separately here. The major goal is to highlight the main differences between the outputs of random forests and regression models for football prediction. For that purpose, the four approaches are fitted to the complete data set introduced in Section~\ref{data}.


The main goal of random forests is prediction. In contrast to regression models, they are harder to interpret because no explicit relationship between dependent and independent variables can be extracted. In particular, in contrast to the regularised regression methods they do not perform variable selection. Nevertheless, the importance of the single variables can be measured. Typically, this is done by permuting each of the variables separately in the out-of-bag observations of each tree and measuring
the prediction accuracy. \red{In this context, permuting a variable means that in the variable each value is randomly assigned to a location within the vector. If, for example, {\it Age} is permuted, the average age of the German team in 2002 could be assigned to the average age of the Brazilian team in 2010.} When permuting variables randomly, they loose their information with respect to the response variable (if they have any). Then, one measures the loss of prediction accuracy compared to the case where the variable is not permuted.

\begin{figure}[h]
	\centering
		\includegraphics[width=1.00\textwidth]{../SimCompare/importance.pdf}
	\caption{Variable importance for random forests with goals (left) and match results (right) as response variables for World Cup data from 2002 -- 2014.}
	\label{importance}
\end{figure}


Figure~\ref{importance} shows the respective values for the variable importance of each variable, separately for random forests predicting the number of goals and ordinal match outcomes.
In the case of ordinal match outcomes we average over the values from the different (permuted) data sets, see Section~\ref{rford}. It can be seen that the domains of the importance values differ strongly between {\it RF Goals} and {\it RF Result}, which is simply due to the fact that 
%both models differ with respect to the level of measurement of their response. 
both models use different response types with different scalings.
Besides that, the main outcomes are rather similar. The most important variables are {\it FIFA Rank}, {\it Oddset} and {\it \# CL Players}. Also, {\it GDP} and the {\it Confed} variables seem to have some explanatory power. The remaining variables show very small or even negative values with respect to their variable importance and, therefore, do not provide additional explanatory power. However, the distinction between influential and non-influential variables is rather heuristic.

In contrast to random forests, regression models can estimate explicit and interpretable relationships between the covariates and the response and, in our case of regularised regression, can explicitly discriminate between influential and non-influential variables. While the estimated relationships between the covariates and the response are strictly linear in the case of Lasso, Gamboost allows for smooth functions. Table~\ref{tab:lasso} shows all parameter estimates for the Lasso approach which are different from zero.
\begin{table}[ht]
\centering
\begin{tabular}{rrr}
  \hline
\# CL Players & Rank & Oddset \\ 
  \hline
0.0210 & -0.1773 & 0.0566 \\ 
   \hline
\end{tabular}
\caption{\label{tab:lasso}Standardized parameter estimates for Lasso on World Cup 2002--2014 data.}
\end{table}
It can be seen that the final model is rather sparse with only three selected covariates. Interestingly, these variables coincide with the three most influential variables from both forest approaches. {\it FIFA Rank} exhibits the strongest effect, followed by {\it Oddset} and {\it \# CL Players}. The {\it FIFA Rank} has a negative effect because, obviously, high values of this variable are supposed to indicate rather weak teams. 

In contrast to the simple linear model assumed for Lasso estimation, in Gamboost the effects can also be non-linear (smooth) functions. The estimated (partial) effects for all selected variables are depicted in Figure~\ref{fig:gamboost}.
\begin{figure}[!h]
	\centering
		\includegraphics[width=0.9\textwidth]{../SimCompare/gamboost.pdf}
	\caption{Partial effects for all (selected) covariates in the Gamboost approach for World Cup data from 2002 -- 2014.}
	\label{fig:gamboost}
\end{figure}
It can be seen that, with 15 variables included in the final model, the Gamboost model is clearly less sparse compared to the Lasso solution. However, only the variables {\it \# CL Players} and {\it Legionaires} explicitly show non-linear effects. Here, the effect sizes are somewhat harder to determine. Nevertheless, comparing the domains of the effects again shows that {\it Oddset}, {\it FIFA Rank} and {\it \# CL Players} are (among) the most important variables.
 
Overall, it turns out that, although the results obtained by either random forests or regression models need to be interpreted fundamentally differently, a coinciding set of major influence variables can be identified for both approaches. In particular, the sportive success of national teams 
in matches of the FIFA World Cups 2002 -- 2014 is mainly determined by the {\it Oddset}, the {\it FIFA Rank} and the {\it \# CL Players}.


 
% \section{Comparison of prediction methods for football matches}
\section{Comparison of predictive performance}
\label{comparison}
In the following, we want to perform an in-depth comparison of the predictive power of all methods introduced in Section~\ref{rffootball} and \ref{alternatives}. In particular, we are interested in the question whether the random forest approaches or the more traditional regression approaches perform better. This will be done by using the FIFA World Cup 2002 --2014 data set introduced 
in Section~\ref{data}. We apply the following procedure:\vspace{-0.6cm}

{\it
\begin{enumerate}
\item Form a training data set containing three out of four World Cups.\vspace{-0.3cm}
\item Fit each of the methods to the training data.\vspace{-0.3cm}
\item Predict the left-out World Cup using each of the prediction methods.\vspace{-0.3cm}
\item Iterate steps 1-3 such that each World Cup is once the left-out one.\vspace{-0.3cm}
\item Compare predicted and real outcomes for all prediction methods.\vspace{-0.3cm}
\end{enumerate}
}

This guarantees that each match from the total data set is once part of the test data. Therefore, we get out-of-sample predictions for all matches. In step~{\it 5}, different performance measures for the quality of the predictions are investigated, separately for the prediction of the (ordinal) match outcomes and the number of goals.

\subsection{Prediction of match outcomes}
In the following, let $\tilde y_1,\ldots,\tilde y_N$ be the true ordinal match outcomes, i.e., $\tilde y_i\in\{1,2,3\}$, for all matches $N$ from the four considered World Cups. Additionally, let $\hat\pi_{1i},\hat\pi_{2i},\hat\pi_{3i},~i=1,\ldots,N$, be the predicted probabilities for the match outcomes obtained by one of the different methods presented in Sections~\ref{rffootball} and \ref{alternatives}. While these probabilities are directly available from the ordinal random forests from Section~\ref{rford}, they need to be computed in an additional step for all methods predicting the number of goals. Both for the regression methods and the random forests from Section~\ref{rfgoals} we assume that the numbers of goals follow independent Poisson distributions, where the event rates $\lambda_{1i}$ and $\lambda_{2i}$ for the scores of match $i$ are estimated by the respective predicted expected values. Let $G_{1i}$ and $G_{2i}$ denote the random variables representing the number of goals scored by two competing teams in match $i$. Then, we can compute the probabilities via  
$\hat \pi_{1i}=P(G_{1i}>G_{2i}), \hat \pi_{2i}=P(G_{1i}=G_{2i})$ and $\hat \pi_{3i}=P(G_{1i}<G_{2i})$
based on the corresponding Poisson distributions $G_{1i}\sim Po(\hat\lambda_{1i})$ and $G_{2i}\sim Po(\hat\lambda_{2i})$ with estimates $\hat\lambda_{1i}$ and $\hat\lambda_{2i}$.
Based on these predicted probabilities, three different performance measures were used to compare the predictive power of the methods. 

A classical performance measure for categorical responses is the multinomial {\it likelihood}, which for a single match outcome is defined as $\hat \pi_{1i}^{\delta_{1\tilde y_i}} \hat \pi_{2i}^{\delta_{2\tilde y_i}} \hat \pi_{3i}^{\delta_{3 \tilde y_i}}$, with $\delta_{r\tilde y_i}$ denoting Kronecker's delta. It  reflects the probability of a correct prediction. Hence, a large value of the multinomial likelihood reflects a good fit.

Furthermore, to later calculate the classification rate of each method we consider whether match $i$ was correctly classified using the indicator function $\mathbb{I}(\tilde y_i=\underset{r\in\{1,2,3\}}{\mbox{arg\,max }}(\hat\pi_{ri}))$.
Again, a large value of the classification rate reflects a good fit.

\citet{Gneitingetal:2007} proposed to use the so-called {\it rank probability score} (RPS) as a performance measure which, in contrast to both measures introduced above, explicitly accounts for the ordinal structure of the responses. 
For our purpose, it can be defined as $\frac{1}{3-1} \sum\limits_{r=1}^{3-1}\left( \sum\limits_{l=1}^{r}\hat \pi_{li} - \delta_{l\tilde y_i}\right)^{2}$. As the RPS is an error measure, here a low value represents a good fit.

As a natural benchmark for these predictive performance measures the predictions based on bookmakers' odds can be considered. For this purpose, we collected the so-called ``three-way'' odds\footnote{Three-way odds consider only the match tendency with possible results \emph{victory of team 1}, \emph{draw} or \emph{defeat of team 1} and are usually fixed some days before the corresponding match takes place.} for (almost) all matches of the FIFA World Cups 2002 -- 2014. The three-way odds were obtained from the website \url{http://www.betexplorer.com/}. Unfortunately, for 6 matches from the FIFA World Cup 2006 no odds were available. Hence, the results from Table~\ref{tab:3way} are based on 250 matches only.
By taking the three quantities $\tilde \pi_{ri}=1/\mbox{odds}_{ri}, r\in\{1,2,3\}$, of a match $i$ and by normalizing with $c_i:=\sum_{r=1}^{3}\tilde \pi_{ri}$ in order to adjust for the bookmaker's margins, the odds can be directly transformed into probabilities using $\hat \pi_{ri}=\tilde \pi_{ri}/c_i$\footnote{The transformed 
probabilities only serve as an approximation, based on the assumption that the bookmaker's margins follow a discrete uniform distribution on the three possible match tendencies.}. Using these predicted probabilities $\hat \pi_{ri}$, we can evaluate the three performance measures for (ordinal) match outcomes introduced above also for the information contained in bookmakers' odds.

Table~\ref{tab:3way} displays the results for these (ordinal) performance measures 
for all methods introduced in Section~\ref{rffootball} and \ref{alternatives} as well as for the bookmakers' odds, averaged over 250 matches from the four FIFA World Cups 2002 -- 2014. 
It turns out that in terms of the mean multinomial likelihood score all forest-based methods and also the conventional Lasso achieve a fit that is close to the one obtained by the bookmakers, which here fulfill their role as a benchmark. With respect to the classification rate the four forest-based methods clearly outperform all other approaches, remarkably even the bookmakers in this case. Again, the conventional Lasso is best-performing among the regression methods 
with a performance  equal to the bookmakers.
Finally, in terms of the RPS again the four forest-based methods perform best, yielding clearly lower values than all regression approaches. In particular, the two random forests that directly model the number of goals achieve error rates very close to those of the bookmakers, which here again serve as the benchmark. The regression methods (except for Lasso~(1se)) on the contrary yield all very similar results for RPS.

To sum up, all methods based on random forests provide very satisfactory results, which are 
either close to or even outperforming those obtained by the bookmakers, which can be seen as a natural benchmark. Altogether, the random forests that directly model the number of goals slightly outperform those for ordinal responses. Among the regression approaches, conventional Lasso clearly performs best and overall seems to be a convincing competitor to the forest-based methods.

<<echo=FALSE, results = 'asis', message=FALSE,cache=FALSE>>=
# load(file = "../SimCompare/compare_diff.RData")
load(file = "RCode_Data_Schauberger_Groll/compare_methods.RData")
source('RCode_Data_Schauberger_Groll/Methods/help_funs.R')


load("RCode_Data_Schauberger_Groll/Data/odds_payouts_sorted.RData")
load("RCode_Data_Schauberger_Groll/Data/odds.RData")
load("RCode_Data_Schauberger_Groll/Data/wc.data.02.14.rda")    
n.rep <- 100
    
## Extract results for prediction of goals
all.quad <- apply(pred.goals, 2, loss.quad, y = rep(wc.data$Goals, each = n.rep))
all.diff <- apply(pred.goals, 2, goal.diff.quad, y = rep(wc.data$Goals,each = n.rep),n.rep = n.rep)

## Extract results for prediction of win/draw/loss probabilities
all.mult <- sapply(pred.probs, loss.mult, y = wc.data$Goals)
all.err <- unlist(lapply(pred.probs, err.class, y = wc.data$Goals))
all.err2 <- unlist(lapply(pred.probs, err.class2, y = wc.data$Goals))
all.rps <- sapply(pred.probs,rps,y=wc.data$Goals)

## Extract results for comparison of betting returns
all.bets <- matrix(unlist(lapply(pred.probs, bet.crit, goals = wc.data$Goals)), nrow = 4)


mult.b <- data.0214odds$true.odds
no.na <- !is.na(mult.b)
mult.b <- mult.b[no.na]

rps.b <- data.0214odds$rps
rps.b <- rps.b[no.na]

mult <- cbind(all.mult[no.na,],mult.b)
rps <- cbind(all.rps[no.na,], rps.b)

colnames(mult)[ncol(mult)] <- "Bookmakers"
colnames(rps)[ncol(rps)] <- "Bookmakers"

means <- apply(mult,2,mean)
mean.rps <- apply(rps,2,mean)


means_tab <- rbind(means,c(all.err2,corr.class),mean.rps)[,-9]

means_tab <- formatC(means_tab,format="f",digits = 3)
rownames(means_tab) <- c("Likelihood","Class. Rate","RPS")
colnames(means_tab) <- c("RF Goals (party)", "RF Goals (ranger)", "RF Result (party)",
                         "RF Result (ordinalForest)", "Lasso", "Lasso (1se)", "Group Lasso","Gamboost",
                         "Bookmakers")

print(xtable(t(means_tab),align="lrrr"),
  floating=FALSE,
  include.rownames = TRUE,
  include.colnames = TRUE,
  hline.after=NULL,
  add.to.row=list(pos=list(0,1,2,3,4,5,6,7,8,9),
  command=c('\\toprule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\bottomrule\n')),
  file="ordtab.tex",
  sanitize.text.function = function(x){x})
@

% \begin{table}
% \small
% \caption{\label{tab:3way}Mean values of different loss functions separately for all methods}
% \centering
% \input{ordtab}
% \end{table}


\begin{table}[!h]
\small
\caption{\label{tab:3way}Comparison of different prediction methods for ordinal outcome based on multinomial likelihood, classification rate and ranked probability score (RPS).}
\centering
\input{ordtab}
\end{table}

\subsection{Prediction of exact numbers of goals}

Beside the ordinal match outcome (\textit{win}, \textit{draw}, \textit{loss}), we are also interested in the performance of the regarded methods with respect to the prediction of the exact number of goals. This is important, for example, if one wants to predict the whole tournament course or the winning probabilities for a FIFA World Cup before the start of the tournament. The reason is that in order to identify the teams that qualify for the knockout stage, one has to determine the precise final group standings. To be able to do so, the precise results of the matches in the group 
stage play a crucial role\footnote{The final group standings are determined by (1) the number of
points, (2) the goal difference and (3) the number of scored goals.
If several teams coincide with respect to all of these three criteria, a
separate chart is calculated based on the matches between the coinciding
teams only. Here, again the final standing of the teams is
determined following criteria (1)--(3). If still no distinct decision can
be taken, the decision is induced by lot.}.

<<echo=FALSE, results = 'asis', message=FALSE,cache=FALSE>>=
goals_tab <- cbind(colMeans(all.diff),colMeans(all.quad))[-9,]

goals_tab <- formatC(goals_tab,format="f",digits = 3)
colnames(goals_tab) <- c("Goal Difference","Goals")

rownames(goals_tab) <- c("RF Goals (party)", "RF Goals (ranger)", "RF Result (party)",
                         "RF Result (ordinalForest)", "Lasso", "Lasso (1se)", "Group Lasso","Gamboost")

print(xtable(goals_tab,align="lrr"),
  floating=FALSE,
  include.rownames = TRUE,
  include.colnames = TRUE,
  hline.after=NULL,
  add.to.row=list(pos=list(0,1,2,3,4,5,6,7,8),
  command=c('\\toprule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\bottomrule\n')),
  file="goaltab.tex",
  sanitize.text.function = function(x){x})
@



For this reason, we also evaluate the performance of all introduced methods
with regard to the quadratic error between the observed and predicted
number of goals for each match and each team, as well as between the observed and predicted goal difference. Let now $y_{ijk}$, for $i,j=1,\ldots,n$ and $k\in\{2002,2006,2010,2014\}$,
denote the observed number of goals scored by team $i$ against team $j$ in tournament $k$ and
$\hat y_{ijk}$ a corresponding predicted value, obtained by one of the methods from Section~\ref{rffootball} and~\ref{alternatives}. For those methods that combine ordinal and metric random forests we do not directly obtain a fixed predicted number of goals, because the predicted number of goals depends on the drawn match outcome. \red{Therefore, for every method we randomly simulate 100 results (0:1,1:2,2:2,...) for each match we want to predict. Then we calculate the two quadratic errors $(y_{ijk}-\hat y_{ijk})^2$ (here we get two errors per simulated match, 200 errors per predicted match) and $\left((y_{ijk}-y_{jik})-(\hat y_{ijk}-\hat y_{jik})\right)^2$
(here we get one error per simulated match, 100 errors per predicted match) for all $N$ matches of the four FIFA World Cups 2002 -- 2014. Finally, per method we average over these errors.} Note that in this case the odds provided by the bookmakers can not be used for comparison. So in contrast to Table~\ref{tab:3way}, where six matches had to be left out due to missing bookmaker 
information, we now can calculate 
both (mean) quadratic errors based on all $N=256$ matches.

\begin{table}[!h]
\small
\caption{\label{tab:goals}Comparison of different prediction methods for the exact number of goals and the goal difference based on mean quadratic error.}
\centering
\input{goaltab}
\end{table}

Table~\ref{tab:goals} summarizes the corresponding results.
In general, the overall trend that the random forest methods outperform the regression-based approaches is confirmed. However, in contrast to the results from the previous subsection, here the disparities between the forest-based methods are less evident. \red{Now, the clearly best-performing method with respect to both quadratic errors is based on a random forest for ordinal responses, namely the one implemented in the \texttt{party} package}. The second best approach, however, is then a random forest that directly models the number of goals, namely the \texttt{cforest} from the \texttt{party} package. \red{However, note that in both ordinal approaches we use the \texttt{party} package to draw final results as described in Section~\ref{rford}.} Therefore, it seems that the prediction of the number of goals works rather well with the random forests which are specifically designed for that purpose, but they can still be slightly improved by combining them with a random forest prediction of the ordinal match outcome.
Altogether, it is not possible to clearly identify a best-performing class of random forests with respect to the two different types of responses (ordinal or metric).
The only regression approach that can compete with the forest-based methods with regard to the two quadratic errors is the Group Lasso, while the conventional Lasso here performs rather bad. 
%\vspace*{-0.1cm}

\subsection{Comparison of betting returns}\label{sec:betting}
<<echo=FALSE, results = 'asis', message=FALSE,cache=FALSE>>=

all.bets <- matrix(unlist(lapply(pred.probs, bet.crit, goals=wc.data$Goals)),nrow=4)
# colnames(all.bets) <- colnames(all.bets.cleaned)
all.bets.0 <- all.bets[3,-9,drop=FALSE]*100
rownames(all.bets.0) <- "Return"
colnames(all.bets.0) <- c("RF Goals (party)", "RF Goals (ranger)", "RF Result (party)",
                         "RF Result (ordinalForest)", "Lasso", "Lasso (1se)", "Group Lasso","Gamboost")

print(xtable(t(all.bets.0),digits=2),
  floating=FALSE,
  include.rownames = TRUE,
  include.colnames = TRUE,
  hline.after=NULL,
  add.to.row=list(pos=list(0,1,2,3,4,5,6,7,8),
  command=c('\\toprule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\midrule\n','\\bottomrule\n')),
  file="bettab.tex",
  sanitize.text.function = function(x){x})
@


Further insight into the predictive performance 
of the different fitting procedures can be obtained by analysing the success of
certain betting strategies\footnote{The betting strategy presented in the following is again based on the three-way odds from the website \url{http://www.betexplorer.com/}. For this reason, again only 250 matches can be considered, as for 6 matches no odds were available.}. For every match~$i$ and each of the possible three outcomes $r\in\{1,2,3\}$ one can calculate the expected return of a bet with a betting volume of one (arbitrary) monetary unit as follows: $E[return_{ri}]=\hat \pi_{ri}*odds_{ri}-1$. \red{This automatically follows from the fact that with probability $\pi_{ri}$ one gets a payout of $odds_{ri}$ when betting on match outcome $r$. After subtracting the stake of one (arbitrary) unit we end up with the expected return.}
In general, one would choose the outcome with the highest expected return and only place the bet if the expected return is positive, i.e., if $\max\limits_{r\in\{1,2,3\}}E[return_{ri}] > 0$. The corresponding returns (in \%) for all methods are shown in Table~\ref{tab:bets}.
\begin{table}[!h]
\small
\caption{\label{tab:bets} Betting returns (in $\%$) for different prediction methods.}
\centering
\input{bettab}
\end{table}
It turns out that, altogether, the differences between the returns are relatively high across the different methods. Among the random forests, those from the \texttt{party} package show rather bad results compared to the random forests from \texttt{ordinalForest} and \texttt{ranger}, which gain the highest returns among all methods. From the regression methods, the regular Lasso gains a return of $4\%$ while Lasso (1se) leads to a loss of $7.5\%$. However, the betting results should probably not be over-interpreted. First, only one betting company is considered\footnote{In fact, the odds provided by the website \url{http://www.betexplorer.com/} do not even correspond to a real betting company, but are average odds from several bookmakers.}. Also, the margins in the betting odds are rather high, they range between $4\%$ and $13\%$ per match (on average about $7\%$). In a real betting scenario, it is more realistic that the player can choose from a variety of companies and place his bet at the company with the most favorable odds. Third, although the results are based on four World Cups the returns are still strongly depending on single matches and can not be seen as predictions for average betting gains for future World Cups.

% Note that for this rather simple betting strategy several modifications and extensions have been
% proposed in the literature. For example, \citet{koop:2015} use different values of the threshold $\tau>0$ and showed that this way the overall mean return could be increased. However, they use constant stake sizes (one unit) for each bet. In contrast, \citet{boshnakov2017} applied a betting strategy with varying stake sizes based on the Kelly criterion \citep{Kelly:1956}. This criterion is a strategy to determine the optimal stake for single bets in order to maximize the return considering the size of the odds and the winning probability. 




\section{Concluding remarks}
\label{conclusion}

In the present work we compared two fundamentally different, covariate-based approaches
for the modelling and prediction of matches in international football tournaments, namely random forests and regression methods. We describe the methods and, on a data set containing all matches of the FIFA World Cups 2002 -- 2014, compare the predictive performance of random forests for both ordinal and metric response %types 
to conventional regression methods for count data, such as Poisson~GLMs.

% Before empirically comparing the different methods we describe various possibilities how random forests can be applied for the prediction of football matches. We present two different possible response variables and elaborate on the possible problems induced by these approaches. In particular, we present a possible solution on how to handle the problem that {\it win} and {\it loss} are in principle random, depending on the ordering of the teams, but that nevertheless changing the ordering of the teams leads to different predictions.

In order to evaluate the performance of the methods, several different performance measures for both ordinal match outcomes and the precise number of goals \red{were investigated. 
For ordinal match outcomes, all methods based on random forests provided very satisfactory results, which were either close to or even outperforming those obtained by the bookmakers (serving  as natural  benchmark).} 
Moreover, the forest-based methods outperformed the regression approaches. 
Only conventional Lasso turned out to be a convincing competitor to the forest-based methods.
Within the forest-based methods, random  forests that directly model the number of goals slightly outperformed those based on ordinal responses.

In terms of the quadratic errors for the precise number of goals, 
the overall trend that the random forest methods outperform the regression-based approaches was confirmed. \red{However, %in contrast to the measures for ordinal match outcomes, the
disparities between the forest-based methods were less clear: while the best-performing method was based on a random forest for ordinal responses, the second best approach was
%turned out to be 
a random forest directly modelling the number of goals.} So for the precise number of goals it was not possible to clearly identify a best-performing class of random forests, i.e., forests for either ordinal or metric responses. \red{Here, the only regression approach able to compete with the forest-based methods 
%in terms of the number of goals 
was the Group Lasso}, while the conventional Lasso performed rather bad. 

Finally, we also analysed the performance of the methods in terms of the success of
a simple betting strategy. In general, we found relatively high differences between the returns across the different methods. The highest returns among the tree-based methods were obtained by the \texttt{ordinalForest}, while again conventional Lasso performed best among the regression approaches. 

Overall, our analyses showed that %both random forests for ordinal match outcomes and for the precise number of goals 
\red{generally random forests slightly outperform regression-based approaches with respect to a variety of prediction performance measures. Only conventional Lasso} turned out to be a promising competitor.
Based on these findings, we plan to establish a random forest-based prediction model
for simulating the FIFA World Cup 2018 tournament, which takes place in Russia.
However, as several of the underlying covariates are based on the final squads nominated for the FIFA World Cup 2018, we need to wait until the final official squad announcements, which the national coaches need to 
provide by 4th of June 2018. 

\bibliography{literatureFull}




\end{document}
