https://www.zeileis.org/Achim Zeileis2021-11-05T00:51:07+01:00Research homepage of Achim Zeileis, Universität Innsbruck. <br/>Department of Statistics, Faculty of Economics and Statistics. <br/>Universitätsstr. 15, 6020 Innsbruck, Austria. <br/>Tel: +43/512/507-70403Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Jekyllhttps://www.zeileis.org/news/euro2020knockout/Updated forecasts for the UEFA Euro 2020 knockout stage2021-06-25T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.<p>After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.</p> <h2 id="updates">Updates</h2> <p>After the 36 matches of the group stage were completed earlier this week, we had decided to update our <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a>. As the <a href="https://www.zeileis.org/news/euro2020group/">evaluation of the group stage</a> showed that, by and large, the forecasts worked reasonably well up to this point, we kept our general strategy and just made a few updates:</p> <ul> <li>The <em>historic match abilities</em> for all teams were updated to incorporate the results from the 36 additional matches from the group stage. Given that the estimates are weighted such that the most recent results have a higher influence, this changed the estimates of the team abilities somewhat.</li> <li>The <em>average plus-minus player ratings</em> for all teams were also updated but these changed to a lesser degree given that each team only played three additional matches.</li> <li>All other covariates (bookmaker consensus, market value, etc.) were left unchanged.</li> <li>The learning data set for the hybrid random forest that combines all the predictors was extended: In addition to all the matches from the UEFA Euro 2004-2016 it now includes the group stage results from this year’s Euro.</li> <li>The resulting predicted number of goals for each team can then be used to simulate the entire knockout stage 100,000 times.</li> </ul> <p>While all the changes above have a certain influence, the biggest effect arguably comes from the last item: Because the match-ups for the round of 16 are fixed now, there is a lot less variation in the potential courses of the tournament. Specifically, it is now clear that there are more top favorites in the upper half of the tournament tableau (namely France, Spain, Italy, Belgium, Portugal) than in the lower half of the tableau (England, Germany, Netherlands). In the following it is shown in more detail what the consequences of this are.</p> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The updated results show that now England became the top favorite for the title with a winning probability of 17.4% because they are more likely to face weaker opponents provided they beat Germany in the round of 16. Our top favorite from the pre-tournament forecast was France and they rank now second with an almost unchanged winning probability of about 15.0%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>Somewhat surprisingly, Italy still has a rather low winning probability of only 7.3% whereas they are now among the top three teams according to most bookmaker odds. This is most likely due to the tournament draw: If they beat Austria in the round of 16, they meet either the FIFA top-ranked team Belgium or defending champion Portugal in the quarter final. In a potential semi-final they would have a high chance of facing either France or Spain.</p> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. Using these, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em> in normal time. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The resulting probability that one team beats the other in a knockout match is depicted in the heatmap below. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match results after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, we are able to simulate the entire knockout stage 100,000 times to provide “survival” probabilities for each team across the remaining stages. Teams in the upper half of the tournament tableau are shown in orange while the lower half teams are shown in blue.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <p>This shows that England has relatively low chances of surviving the round of 16 - at least compared to other top teams like France, Italy, or Netherlands who play against weaker opponents. However, provided England proceeds to the quarter final, they have a really high probability of prevailing up to the final match.</p> <p>In summary, the updates compared to the pre-tournament forecast changed but maybe not as much as expected. The most important change in information is that the remaining course of the tournament is rather clear now while the knowledge from the 36 group stage matches themselves has only moderate effects. Thus, the most exciting part of the UEFA Euro 2020 is only starting now and we can all be curious what is going to happen. Everything is still possible! (Recall that in the 2016 tournament Portugal eventually took the championship despite not winning a single group stage match and ranking third in their group.)</p>2021-06-25T00:00:00+02:00https://www.zeileis.org/news/euro2020group/Evaluation of the UEFA Euro 2020 group stage forecast2021-06-24T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...<p>A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...</p> <h2 id="how-surprising-was-the-group-stage">How surprising was the group stage?</h2> <p>Yesterday the group stage of the UEFA Euro 2020 was concluded with the final matches in Groups E and F so that all pairings for the round of 16 are fixed now. Therefore, today we want to do address two questions regarding our own <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a> based on a hybrid machine learning model that we have published prior to the tournament:</p> <ol> <li>How good were the predictions for the group stage? Were the actual outcomes surprising?</li> <li>How can we update the forecasts for the knockout stage starting with the round of 16 on the weekend?</li> </ol> <p>The first of these questions is answered in this post while the second question will be deferred to tomorrow’s post.</p> <p><strong>TL;DR</strong> All of our predictions worked quite well and most results were within the expected range of random variation. All tournament favorites proceeded to the round of 16 and mostly the weakest teams dropped out of the tournament. Only in Group E the final ranking was a bit more surprising with Spain ending up second behind Sweden and Poland finishing last and dropping out. At the individual match level there were a couple of games where the clearly stronger team failed to take the win, especially Hungary’s two draws in the “killer group” F were a bit surprising. But other than that the more exciting part of the tournament ist still ahead of us!</p> <h2 id="group-stage-results">Group stage results</h2> <p>First, we look at the results in terms of which teams successfully proceeded from the group stage to the round of 16. The barplot below shows all teams along with their predicted winning probability for the entire tournament, with the color highlighting elimination from the tournament prior to the knockout stage.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/barplot.png" alt="Probabilities to win the tournament with highlighting of teams advancing to the knockout stage" /></p> <p>Clearly, only teams from the lower half were eliminated with the most unexpected drop-out being Poland. Also, it may seem somewhat surprising that both the Czech Republic and Ukraine “survived” the group stage but with four out of six third-ranked teams advancing to the round of 16 this is not very unexpected.</p> <p>Looking at the rankings in each group in a bit more detail we see that most group results are as expected. Only in Group E the ranking is really a surprise with Sweden playing stronger than expected and even winning the group. On the other hand, Poland’s performance was somewhat disappointing (as already mentioned above) and Spain waited until the third game (a 5-0 win against Slovakia) to show their full potential.</p> <div class="row"> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">A <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ITA</strong> <br /> <strong>WAL</strong> <br /> <strong>SUI</strong> <br /> TUR</td> <td style="text-align: right"><strong>88.8</strong> <br /> <strong>53.7</strong> <br /> <strong>72.3</strong> <br /> 53.3</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">B <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>BEL</strong> <br /> <strong>DEN</strong> <br /> FIN <br /> RUS</td> <td style="text-align: right"><strong>91.5</strong> <br /> <strong>84.5</strong> <br /> 37.1 <br /> 52.0</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">C <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>NED</strong> <br /> <strong>AUT</strong> <br /> <strong>UKR</strong> <br /> MKD</td> <td style="text-align: right"><strong>93.4</strong> <br /> <strong>80.9</strong> <br /> <strong>57.4</strong> <br /> 32.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">D <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ENG</strong> <br /> <strong>CRO</strong> <br /> <strong>CZE</strong> <br /> SCO</td> <td style="text-align: right"><strong>94.6</strong> <br /> <strong>78.0</strong> <br /> <strong>40.8</strong> <br /> 49.8</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">E <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>SWE</strong> <br /> <strong>ESP</strong> <br /> SVK <br /> POL</td> <td style="text-align: right"><strong>59.8</strong> <br /> <strong>94.0</strong> <br /> 44.9 <br /> 66.2</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">F <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>FRA</strong> <br /> <strong>GER</strong> <br /> <strong>POR</strong> <br /> HUN</td> <td style="text-align: right"><strong>89.7</strong> <br /> <strong>85.3</strong> <br /> <strong>85.3</strong> <br /> 13.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> </div> <h2 id="match-results">Match results</h2> <p>After seeing that all the favorites prevailed and only relatively weak teams dropped out of the tournament, we take a closer look at the 36 individual group-stage matches to check whether we had any major surprises. The stacked bar plot below groups all match results into four categories by their expected goal difference for the stronger vs. the weaker team.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/match.png" alt="Observed match outcome vs. expected goal difference" /></p> <p>In the first bar the stronger team was expected to be only marginally better, with 0 to 0.25 more predicted goals on average. In this bar we see that the stronger team won half of the matches (4 out of 8) while the other half was either lost (3 matches) or ended in a draw (1 match). In short, the distribution of match outcomes conforms essentially exactly with the predictions.</p> <p>The same is true for the second and third bar where the expected goal difference for the stronger team was between 0.26 and 0.6 or between 0.6 and 1, respectively. The stronger team won in 70.0% out of ten and 77.7% out of nine matches, respectively, thus conforming closely with the predictions.</p> <p>Only in the last bar with the highest expected goal differences (between 1 and 2 goals), the picture is somewhat unexpected.</p> <ol> <li>There were three draws (out of nine matches), two of which by underdog Hungary against the much stronger teams France and Germany. Ultimately, Hungary nevertheless finished last in Group F.</li> <li>One of these nine matches was even lost by the clear favorite but this match was the 0-1 of Denmark vs. Finland. During this match, Danish key player Christian Eriksen suffered a cardiac arrest and had to be reanimated in the stadium before being brought to the hospital. Denmark then had to continue playing the match later that evening and were clearly still under shock. Needless to say that no forecasting model (that we are aware of) would incorporate such extreme and rare events.</li> </ol> <p>As a final evaluation we check whether the observed number of goals per team in each match conforms with the expected distribution based on the Poisson model employed. This is brought out graphically by a so-called <a href="https://dx.doi.org/10.1080/00031305.2016.1173590">hanging rootogram</a>.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/goals.png" alt="Hanging rootogram with observed and expected frequencies of number of goals" /></p> <p>The red line shows the square root of the expected frequencies while the “hanging” gray bars represent the square root of the observed frequencies. This shows that the predictions conform closely with the actual observations. There were only a few more occurrences of three goals (ten times) than expected (6.1 times) but this deviation is also within the bounds of random variation.</p>2021-06-24T00:00:00+02:00https://www.zeileis.org/news/euro2020paper/Working paper for the UEFA Euro 2020 forecast2021-06-10T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A working paper describing the data and methods used for our probabilistic UEFA Euro 2020 forecast, published earlier this week, is available now. Additionally, details on the predicted performance of all teams during the group stage are provided.<p>A working paper describing the data and methods used for our probabilistic UEFA Euro 2020 forecast, published earlier this week, is available now. Additionally, details on the predicted performance of all teams during the group stage are provided.</p> <h2 id="overview">Overview</h2> <p>Earlier this week we had published our <a href="https://www.zeileis.org/news/euro2020/">probabilistic UEFA Euro 2020 forecast</a> that combines the expertise of football modelers from four different research teams with the flexibility of machine learning. To explain which data and methods were used exactly, we have also written a <a href="https://arxiv.org/abs/2106.05799">working paper</a>, now published in the <a href="https://arxiv.org/">arXiv.org</a> e-Print archive.</p> <p>Moreover, we take the opportunity and provide further insights that can be obtained from our forecast for the results of the group stage, that starts at the end of this week with the opening match between Italy and Turkey in Rome in Group A. More precisely, predicted probabilities for a <em>win</em>, <em>draw</em>, or <em>loss</em> in each of the 36 group stage matches are provided in interactive heatmaps for all groups.</p> <h2 id="working-paper">Working paper</h2> <p><em>Citation:</em><br /> Groll A, Hvattum LM, Ley C, Popp F, Schauberger G, Van Eetvelde H, Zeileis A (2021). “Hybrid Machine Learning Forecasts for the UEFA EURO 2020.” arXiv:2106.05799, arXiv.org e-Print archive. <a href="https://arxiv.org/abs/2106.05799">https://arxiv.org/abs/2106.05799</a></p> <p><em>Abstract:</em><br /> Three state-of-the-art statistical ranking methods for forecasting football matches are combined with several other predictors in a hybrid machine learning model. Namely an ability estimate for every team based on historic matches; an ability estimate for every team based on bookmaker consensus; average plus-minus player ratings based on their individual performances in their home clubs and national teams; and further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The proposed combined approach is used for learning the number of goals scored in the matches from the four previous UEFA EUROs 2004-2016 and then applied to current information to forecast the upcoming UEFA EURO 2020. Based on the resulting estimates, the tournament is simulated repeatedly and winning probabilities are obtained for all teams. A random forest model favors the current World Champion France with a winning probability of 14.8% before England (13.5%) and Spain (12.3%). Additionally, we provide survival probabilities for all teams and at all tournament stages.</p> <h2 id="predicted-match-probabilities-for-the-group-stage">Predicted match probabilities for the group stage</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match in the group stage. As there are typically more goals in the group stage compared to the knockout stage, a different expected number of goals is fitted for the two stages by including a corresponding binary dummy variable in the regression model. While the heatmap shown in our previous blog post contained the probabilities for all possible matches in the knockout stage, we complement this information here by showing different heatmaps for all groups.</p> <p>The color scheme visualizes the winning probability of the team in the row over the team in the column. Light red or orange vs. dark green or blue signals low vs. high winning probabilities. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss.</p> <p>Interactive full-width graphics: <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.html">Group A</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.html">Group B</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.html">Group C</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.html">Group D</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.html">Group E</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.html">Group F</a>.</p> <table> <thead> <tr> <th style="text-align: center">Group A</th> <th style="text-align: center">Group B</th> <th style="text-align: center">Group C</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.png" alt="Heatmap: Match probabilities for Group A" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.png" alt="Heatmap: Match probabilities for Group B" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.png" alt="Heatmap: Match probabilities for Group C" /></a></td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align: center">Group D</th> <th style="text-align: center">Group E</th> <th style="text-align: center">Group F</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.png" alt="Heatmap: Match probabilities for Group D" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.png" alt="Heatmap: Match probabilities for Group E" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.png" alt="Heatmap: Match probabilities for Group F" /></a></td> </tr> </tbody> </table>2021-06-10T00:00:00+02:00https://www.zeileis.org/news/euro2020/Hybrid machine learning forecasts for the UEFA Euro 20202021-06-07T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Probabilistic forecasts for the UEFA Euro 2020 are obtained by using a hybrid model that combines data from four advanced statistical models through random forests. The favorite is France, followed by England and Spain.<p>Probabilistic forecasts for the UEFA Euro 2020 are obtained by using a hybrid model that combines data from four advanced statistical models through random forests. The favorite is France, followed by England and Spain.</p> <div class="row t20 b20"> <div class="small-8 medium-9 large-10 columns"> The UEFA Euro 2020 will finally take place across Europe from 11 June to 11 July 2021 (after a year of delay due to the Covid-19 pandemic). 24 of the best European teams compete to determine the new European Champion. Football fans worldwide are curious what the most likely outcome of the tournament is. Hence, we employ a machine learning approach yielding probabilistic forecasts for all possible matches which can then be used to explore the likely course of the tournament by simulation. </div> <div class="small-4 medium-3 large-2 columns"> <a href="https://www.uefa.com/uefaeuro-2020/" alt="UEFA Euro 2020 web page"><img src="https://upload.wikimedia.org/wikipedia/en/9/96/UEFA_Euro_2020_Logo.svg" alt="UEFA Euro 2020 logo" /></a> </div> </div> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The forecast is based on a conditional inference random forest learner that combines four main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 19 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The random forest model is learned using the UEFA Euro tournaments from 2004 to 2016 as training data and then applied to current information to obtain a forecast for the UEFA Euro 2020. The random forest forecasts actually provide the predicted number of goals for each team in all possible matches in the tournament so that a bivariate Poisson distribution can be used to compute the probabilities for a <em>win</em>, <em>draw</em>, or <em>loss</em> in such a match. Based on these match probabilities the entire tournament can be simulated 100,000 times yielding winning probabilities for each team. The results show that the current World Champion France is also the favorite for the European title with a winning probability of 14.8%, followed by England with 13.5%, and Spain with 12.3%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>The full study has been conducted by an international team of researchers: <a href="https://www.statistik.tu-dortmund.de/groll.html">Andreas Groll</a>, <a href="https://home.himolde.no/hvattum/">Lars Magnus Hvattum</a>, <a href="https://users.ugent.be/~chley/">Christophe Ley</a>, <a href="https://www.xing.com/profile/Franziska_Popp20">Franziska Popp</a>, <a href="https://www.sg.tum.de/epidemiologie/team/schauberger/">Gunther Schauberger</a>, <a href="https://biblio.ugent.be/person/2C617710-F0EE-11E1-A9DE-61C894A0A6B4">Hans Van Eetvelde</a>, <a href="https://www.zeileis.org/">Achim Zeileis</a>. The corresponding working paper will be published on arXiv in the next couple of days. The core of the contribution is a hybrid approach that starts out from four state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to best combine these forecasts.</p> <ul> <li> <p><em>Historic match abilities:</em><br /> An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A <em>bivariate Poisson model</em> with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain <em>average</em> team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of <em>current</em> team abilities. More details can be found in <a href="https://doi.org/10.1177%2F1471082X18817650">Ley, Van de Wiele, Van Eetvelde (2019)</a>.</p> </li> <li> <p><em>Bookmaker consensus abilities:</em><br /> Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 19 international bookmakers that reflect their expert expectations for the tournament. Using the <em>bookmaker consensus model</em> of <a href="https://dx.doi.org/10.1016/j.ijforecast.2009.10.001">Leitner, Zeileis, Hornik (2010)</a>, the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.</p> </li> <li> <p><em>Average player ratings:</em><br /> To infer the contributions of individual players in a match, the <em>plus-minus player ratings</em> of <a href="https://doi.org/10.2478/ijcss-2019-0001">Hvattum (2019)</a> dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team.</p> </li> <li> <p><em>Hybrid random forests:</em><br /> Finally, machine learning is used to combine these three highly aggregated and informative variables above along with a broad range of further relevant covariates, yielding refined probabilistic forecasts for each match. Such a hybrid approach was first suggested by <a href="https://arXiv.org/abs/1806.03208">Groll, Ley, Schauberger, Van Eetvelde (2019)</a>. The task the random forest learner has to accomplish is to combine the three highly-informative team variables above with further team-specific information that may or may not be relevant to the team’s performance. The covariates considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.</p> </li> </ul> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em>. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a <em>win</em>, <em>draw</em>, or <em>loss</em> after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <h2 id="odds-and-ends">Odds and ends</h2> <p>All our forecasts are probabilistic, clearly below 100%, and thus by no means certain. Especially the results in group F are hard to predict but may play a crucial role for the tournament. The reason is that this group comprises three very strong teams with current World Champion France, defending European Champion Portugal, and Germany which generally has an excellent record at international tournaments. Moreover, the runner-up in this group will play against the winner from group D with favorite England. Hence, it is likely that this will lead to a very tough knockout match in the round of 16, possibly even between the two top favorites France and England, but it is hard to predict the exact pair of teams that will face each other in this match.</p> <p>Another interesting observation is that the winning probability for Belgium is only moderately high with 8.3%. This is notable as Belgium currently leads the FIFA/Coca-Cola World Ranking and is also judged to have a much higher winning probability by the bookmaker consensus model with 12.1%.</p> <p>In any case, all of this means that even when we can quantify in terms of probabilities what is likely to happen during the UEFA Euro 2020, it is far from being predetermined. Hence, we can all look forward to finally watching this exciting tournament and hope it will bring a little bit of the joy that we have been missing over this difficult last year.</p>2021-06-07T00:00:00+02:00https://www.zeileis.org/news/ivreg/ivreg: Two-stage least-squares regression with diagnostics2021-05-31T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/The ivreg function for instrumental variables regression had first been introduced in the AER package but is now developed and extended in its own package of the same name. This post provides a short overview and illustration.<p>The ivreg function for instrumental variables regression had first been introduced in the AER package but is now developed and extended in its own package of the same name. This post provides a short overview and illustration.</p> <h2 id="package-overview">Package overview</h2> <p>The <strong>ivreg</strong> package (by <a href="https://socialsciences.mcmaster.ca/jfox/">John Fox</a>, <a href="https://wwz.unibas.ch/en/kleiber/">Christian Kleiber</a>, and <a href="https://www.zeileis.org">Achim Zeileis</a>) provides a comprehensive implementation of instrumental variables regression using two-stage least-squares (2SLS) estimation. The standard regression functionality (parameter estimation, inference, robust covariances, predictions, etc.) is derived from and supersedes the <code class="language-plaintext highlighter-rouge">ivreg()</code> function in the <a href="https://CRAN.R-project.org/package=AER"><strong>AER</strong></a> package. Additionally, various regression diagnostics are supported, including hat values, deletion diagnostics such as studentized residuals and Cook’s distances; graphical diagnostics such as component-plus-residual plots and added-variable plots; and effect plots with partial residuals.</p> <p>An overview of the package along with vignettes and detailed documentation etc. is available on its web site at <a href="https://john-d-fox.github.io/ivreg/">https://john-d-fox.github.io/ivreg/</a>. This post is an abbreviated version of the “Getting started” vignette.</p> <p>The <strong>ivreg</strong> package integrates seamlessly with other packages by providing suitable S3 methods, specifically for generic functions in the <a href="https://www.R-project.org/">base-R</a> <strong>stats</strong> package, and in the <a href="https://CRAN.R-project.org/package=car"><strong>car</strong></a>, <a href="https://CRAN.R-project.org/package=effects"><strong>effects</strong></a>, <a href="https://CRAN.R-project.org/package=lmtest"><strong>lmtest</strong></a>, and <a href="https://CRAN.R-project.org/package=sandwich"><strong>sandwich</strong></a> packages, among others. Moreover, it cooperates well with other object-oriented packages for regression modeling such as <a href="https://CRAN.R-project.org/package=broom"><strong>broom</strong></a> and <a href="https://CRAN.R-project.org/package=modelsummary"><strong>modelsummary</strong></a>.</p> <h2 id="illustration-returns-to-schooling">Illustration: Returns to schooling</h2> <p>For demonstrating the <strong>ivreg</strong> package in practice, we investigate the effect of schooling on earnings in a classical model for wage determination. The data are from the United States, and are provided in the package as <code class="language-plaintext highlighter-rouge">SchoolingReturns</code>. This data set was originally studied by David Card, and was subsequently employed, as here, to illustrate 2SLS estimation in introductory econometrics textbooks. The relevant variables for this illustration are:</p> <pre><code class="language-{r}">data("SchoolingReturns", package = "ivreg") summary(SchoolingReturns[, 1:8]) ## wage education experience ethnicity smsa ## Min. : 100.0 Min. : 1.00 Min. : 0.000 other:2307 no : 864 ## 1st Qu.: 394.2 1st Qu.:12.00 1st Qu.: 6.000 afam : 703 yes:2146 ## Median : 537.5 Median :13.00 Median : 8.000 ## Mean : 577.3 Mean :13.26 Mean : 8.856 ## 3rd Qu.: 708.8 3rd Qu.:16.00 3rd Qu.:11.000 ## Max. :2404.0 Max. :18.00 Max. :23.000 ## south age nearcollege ## no :1795 Min. :24.00 no : 957 ## yes:1215 1st Qu.:25.00 yes:2053 ## Median :28.00 ## Mean :28.12 ## 3rd Qu.:31.00 ## Max. :34.00 </code></pre> <p>A standard wage equation uses a semi-logarithmic linear regression for <code class="language-plaintext highlighter-rouge">wage</code>, estimated by ordinary least squares (OLS), with years of <code class="language-plaintext highlighter-rouge">education</code> as the primary explanatory variable, adjusting for a quadratic term in labor-market <code class="language-plaintext highlighter-rouge">experience</code>, as well as for factors coding <code class="language-plaintext highlighter-rouge">ethnicity</code>, residence in a city (<code class="language-plaintext highlighter-rouge">smsa</code>), and residence in the U.S. <code class="language-plaintext highlighter-rouge">south</code>:</p> <pre><code class="language-{r}">m_ols <- lm(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south, data = SchoolingReturns) summary(m_ols) ## Call: ## lm(formula = log(wage) ~ education + poly(experience, 2) + ethnicity + ## smsa + south, data = SchoolingReturns) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.59297 -0.22315 0.01893 0.24223 1.33190 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.259820 0.048871 107.626 < 2e-16 *** ## education 0.074009 0.003505 21.113 < 2e-16 *** ## poly(experience, 2)1 8.931699 0.494804 18.051 < 2e-16 *** ## poly(experience, 2)2 -2.642043 0.374739 -7.050 2.21e-12 *** ## ethnicityafam -0.189632 0.017627 -10.758 < 2e-16 *** ## smsayes 0.161423 0.015573 10.365 < 2e-16 *** ## southyes -0.124862 0.015118 -8.259 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3742 on 3003 degrees of freedom ## Multiple R-squared: 0.2905, Adjusted R-squared: 0.2891 ## F-statistic: 204.9 on 6 and 3003 DF, p-value: < 2.2e-16 </code></pre> <p>Thus, OLS estimation yields an estimate of 7.4% per year for returns to schooling. This estimate is problematic, however, because it can be argued that <code class="language-plaintext highlighter-rouge">education</code> is endogenous (and hence also <code class="language-plaintext highlighter-rouge">experience</code>, which is taken to be <code class="language-plaintext highlighter-rouge">age</code> minus <code class="language-plaintext highlighter-rouge">education</code> minus 6). We therefore use geographical proximity to a college when growing up as an exogenous instrument for <code class="language-plaintext highlighter-rouge">education</code>. Additionally, <code class="language-plaintext highlighter-rouge">age</code> is the natural exogenous instrument for <code class="language-plaintext highlighter-rouge">experience</code>, while the remaining explanatory variables can be considered exogenous and are thus used as instruments for themselves. Although it’s a useful strategy to select an effective instrument or instruments for each endogenous explanatory variable, in 2SLS regression all of the instrumental variables are used to estimate all of the regression coefficients in the model.</p> <p>To fit this model with <code class="language-plaintext highlighter-rouge">ivreg()</code> we can simply extend the formula from <code class="language-plaintext highlighter-rouge">lm()</code> above, adding a second part after the <code class="language-plaintext highlighter-rouge">|</code> separator to specify the instrumental variables:</p> <pre><code class="language-{r}">library("ivreg") m_iv <- ivreg(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south | nearcollege + poly(age, 2) + ethnicity + smsa + south, data = SchoolingReturns) </code></pre> <p>Equivalently, the same model can also be specified slightly more concisely using three parts on the right-hand side indicating the exogenous variables, the endogenous variables, and the additional instrumental variables only (in addition to the exogenous variables).</p> <pre><code class="language-{r}">m_iv <- ivreg(log(wage) ~ ethnicity + smsa + south | education + poly(experience, 2) | nearcollege + poly(age, 2), data = SchoolingReturns) </code></pre> <p>Both models yield the following results:</p> <pre><code class="language-{r}">summary(m_iv) ## Call: ## ivreg(formula = log(wage) ~ education + poly(experience, 2) + ## ethnicity + smsa + south | nearcollege + poly(age, 2) + ethnicity + ## smsa + south, data = SchoolingReturns) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.82400 -0.25248 0.02286 0.26349 1.31561 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.48522 0.67538 6.641 3.68e-11 *** ## education 0.13295 0.05138 2.588 0.009712 ** ## poly(experience, 2)1 9.14172 0.56350 16.223 < 2e-16 *** ## poly(experience, 2)2 -0.93810 1.58024 -0.594 0.552797 ## ethnicityafam -0.10314 0.07737 -1.333 0.182624 ## smsayes 0.10798 0.04974 2.171 0.030010 * ## southyes -0.09818 0.02876 -3.413 0.000651 *** ## ## Diagnostic tests: ## df1 df2 statistic p-value ## Weak instruments (education) 3 3003 8.008 2.58e-05 *** ## Weak instruments (poly(experience, 2)1) 3 3003 1612.707 < 2e-16 *** ## Weak instruments (poly(experience, 2)2) 3 3003 174.166 < 2e-16 *** ## Wu-Hausman 2 3001 0.841 0.432 ## Sargan 0 NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4032 on 3003 degrees of freedom ## Multiple R-Squared: 0.1764, Adjusted R-squared: 0.1747 ## Wald test: 148.1 on 6 and 3003 DF, p-value: < 2.2e-16 </code></pre> <p>Thus, using two-stage least squares to estimate the regression yields a much larger coefficient for the returns to schooling, namely 13.3% per year. Notice as well that the standard errors of the coefficients are larger for 2SLS estimation than for OLS, and that, partly as a consequence, evidence for the effects of <code class="language-plaintext highlighter-rouge">ethnicity</code> and the quadratic component of <code class="language-plaintext highlighter-rouge">experience</code> is now weak. These differences are brought out more clearly when showing coefficients and standard errors side by side, e.g., using the <code class="language-plaintext highlighter-rouge">compareCoefs()</code> function from the <strong>car</strong> package or the <code class="language-plaintext highlighter-rouge">msummary()</code> function from the <strong>modelsummary</strong> package:</p> <pre><code class="language-{r}">library("modelsummary") m_list <- list(OLS = m_ols, IV = m_iv) msummary(m_list) </code></pre> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: right">OLS</th> <th style="text-align: right">IV</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">(Intercept)</td> <td style="text-align: right">5.260 <br />(0.049)</td> <td style="text-align: right">4.485 <br />(0.675)</td> </tr> <tr> <td style="text-align: left">education</td> <td style="text-align: right">0.074 <br />(0.004)</td> <td style="text-align: right">0.133 <br />(0.051)</td> </tr> <tr> <td style="text-align: left">poly(experience, 2)1</td> <td style="text-align: right">8.932 <br />(0.495)</td> <td style="text-align: right">9.142 <br />(0.564)</td> </tr> <tr> <td style="text-align: left">poly(experience, 2)2</td> <td style="text-align: right">-2.642 <br />(0.375)</td> <td style="text-align: right">-0.938 <br />(1.580)</td> </tr> <tr> <td style="text-align: left">ethnicityafam</td> <td style="text-align: right">-0.190 <br />(0.018)</td> <td style="text-align: right">-0.103 <br />(0.077)</td> </tr> <tr> <td style="text-align: left">smsayes</td> <td style="text-align: right">0.161 <br />(0.016)</td> <td style="text-align: right">0.108 <br />(0.050)</td> </tr> <tr> <td style="text-align: left">southyes</td> <td style="text-align: right">-0.125 <br />(0.015)</td> <td style="text-align: right">-0.098 <br />(0.029)</td> </tr> <tr> <td style="text-align: left">Num.Obs.</td> <td style="text-align: right">3010</td> <td style="text-align: right">3010</td> </tr> <tr> <td style="text-align: left">R2</td> <td style="text-align: right">0.291</td> <td style="text-align: right">0.176</td> </tr> <tr> <td style="text-align: left">R2 Adj.</td> <td style="text-align: right">0.289</td> <td style="text-align: right">0.175</td> </tr> <tr> <td style="text-align: left">AIC</td> <td style="text-align: right">2633.4</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">BIC</td> <td style="text-align: right">2681.5</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">Log.Lik.</td> <td style="text-align: right">-1308.702</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">F</td> <td style="text-align: right">204.932</td> <td style="text-align: right"> </td> </tr> </tbody> </table> <p>The change in coefficients and associated standard errors can also be brought out graphically using the <code class="language-plaintext highlighter-rouge">modelplot()</code> function from <strong>modelsummary</strong> which shows the coefficient estimates along with their 95% confidence intervals. Below we omit the intercept and experience terms as these are on a different scale than the other coefficients.</p> <pre><code class="language-{r}">modelplot(m_list, coef_omit = "Intercept|experience") </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2021-05-31-ivreg/modelplot.png"><img src="https://www.zeileis.org/assets/posts/2021-05-31-ivreg/modelplot.png" alt="Model plot of coefficients and confidence intervals" /></a></p>2021-05-31T00:00:00+02:00https://www.zeileis.org/news/networktree100/Network trees: networktree 1.0.0, web page, and Psychometrika paper2021-02-04T00:00:00+01:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Version 1.0.0 (and actually 1.0.1) of the R package 'networktree' with tools for recursively partitioning covariance structures is now available from CRAN, accompanied by a paper in Psychometrika, and a dedicated software web page.<p>Version 1.0.0 (and actually 1.0.1) of the R package 'networktree' with tools for recursively partitioning covariance structures is now available from CRAN, accompanied by a paper in Psychometrika, and a dedicated software web page.</p> <h2 id="psychometrika-paper">Psychometrika paper</h2> <ul> <li><em>Citation:</em> Jones PJ, Mair P, Simon T, Zeileis A (2020). “Network Trees: A Method for Recursively Partitioning Covariance Structures.” <em>Psychometrika</em>, <strong>85</strong>(4), 926-945. <a href="https://doi.org/10.1007/s11336-020-09731-4">doi:10.1007/s11336-020-09731-4</a>.</li> <li><em>Preprint version:</em> <a href="https://www.zeileis.org/papers/Jones+Mair+Simon-2020.pdf">https://www.zeileis.org/papers/Jones+Mair+Simon-2020.pdf</a></li> <li><em>OSF replication materials:</em> <a href="https://osf.io/ykq2a/">https://osf.io/ykq2a/</a></li> </ul> <h2 id="abstract">Abstract</h2> <p>In many areas of psychology, correlation-based network approaches (i.e., psychometric networks) have become a popular tool. In this paper, we propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the structure of the covariance or correlation matrix. Psychometric networks or other correlation-based models (e.g., factor models) can be subsequently estimated from the resultant splits. We adapt model-based recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. The empirical power of these approaches is studied in several simulation conditions. Examples are given using real-life data from personality and clinical research.</p> <h2 id="software--web-page">Software & web page</h2> <p>All methods discussed are implemented in the R package <code class="language-plaintext highlighter-rouge">networktree</code> that is developed on GitHub and stable versions are released on CRAN (Comprehensive R Archive Network). Version 1.0.0 accompanies the publications in Psychometrika and version 1.0.1 adds a few small enhancements and bug fixes, specifically for the plotting infrastructure. Furthermore, a nice web page with introductory examples, documentation, release notes, etc. has been produced with the wonderful <code class="language-plaintext highlighter-rouge">pkgdown</code>.</p> <ul> <li><em>CRAN release:</em> <a href="https://CRAN.R-project.org/package=networktree">https://CRAN.R-project.org/package=networktree</a></li> <li><em>Web page:</em> <a href="https://paytonjjones.github.io/networktree/">https://paytonjjones.github.io/networktree/</a></li> </ul> <h2 id="illustration">Illustration</h2> <p>The idea of psychometric networks is to provide information about the statistical relationships between observed variables. Network trees aim to reveal heterogeneities in these relationships based on observed covariates. This strategy is implemented in the R package <code class="language-plaintext highlighter-rouge">networktree</code> building on the general tree algorithms in the <code class="language-plaintext highlighter-rouge">partykit</code> package.</p> <p>For illustration, we consider a depression network - where the nodes represent different symptoms - and detect heterogeneities with respect to age and race. The data used below is provided by <a href="https://openpsychometrics.org/">https://openpsychometrics.org/</a> and was obtained using the Depression Anxiety and Stress Scale (DASS), a self-report instrument for measuring depression, anxiety, and tension or stress. It is available in the <code class="language-plaintext highlighter-rouge">networktree</code> package as <code class="language-plaintext highlighter-rouge">dass</code>. To make resulting graphics and summaries easier to interpret we use the following variable names for the depression symptoms that are measured with certain questions from the DASS:</p> <ul> <li><code class="language-plaintext highlighter-rouge">anhedonia</code> (Question 3: I couldn’t seem to experience any positive feeling at all.)</li> <li><code class="language-plaintext highlighter-rouge">initiative</code> (Question 42: I found it difficult to work up the initiative to do things.)</li> <li><code class="language-plaintext highlighter-rouge">lookforward</code> (Question 10: I felt that I had nothing to look forward to.)</li> <li><code class="language-plaintext highlighter-rouge">sad</code> (Question 13: I felt sad and depressed.)</li> <li><code class="language-plaintext highlighter-rouge">unenthused</code> (Question 31: I was unable to become enthusiastic about anything.)</li> <li><code class="language-plaintext highlighter-rouge">worthless</code> (Question 17: I felt I wasn’t worth much as a person.)</li> <li><code class="language-plaintext highlighter-rouge">meaningless</code> (Question 38: I felt that life was meaningless.)</li> </ul> <p>First, we load the data and relabel the variables for the depression symptoms:</p> <pre><code class="language-{r}">library("networktree") data("dass", package = "networktree") names(dass)[c(3, 42, 10, 13, 31, 17, 38)] <- c("anhedonia", "initiative", "lookforward", "sad", "unenthused", "worthless", "meaningless") </code></pre> <p>Subsequently, we fit a <code class="language-plaintext highlighter-rouge">networktree()</code> where the relationship between the symptoms (<code class="language-plaintext highlighter-rouge">anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless</code>) is “explained by” (<code class="language-plaintext highlighter-rouge">~</code>) the covariates (<code class="language-plaintext highlighter-rouge">age + race</code>). (As an alternative to this formula-based interface it is also possible to specify groups of dependent and split variables, respectively, through separate data frames.) The threshold for detecting significant differences in correlations is set to 1% (plus Bonferroni adjustment for testing two covariates at each step).</p> <pre><code class="language-{r}">tr <- networktree(anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless ~ age + race, data = dass, alpha = 0.01) </code></pre> <p>The resulting network tree can be easily visualized with <code class="language-plaintext highlighter-rouge">plot(tr)</code> which would display the raw correlations. As these are generally high between all depression symptoms we use a display with partial correlations (<code class="language-plaintext highlighter-rouge">transform = "pcor"</code>) instead. This brings out differences between the detected subgroups somewhat more clearly. <em>(Note that version 1.0.1 of networktree is needed for this to work correctly.)</em></p> <pre><code class="language-{r}">plot(tr, transform = "pcor") </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2021-02-04-networktree100/dasstree.png"><img src="https://www.zeileis.org/assets/posts/2021-02-04-networktree100/dasstree.png" alt="Depression network tree" /></a></p> <p>This shows that the network tree detects three subgroups. First, the correlations of the depression symptoms change across <code class="language-plaintext highlighter-rouge">age</code> - with the largest difference between “younger” and “older” persons in the sample at a split point of 30 years. Second, the correlations differ with respect to race for the older persons in the sample - with the largest difference between Arab/Black/Native American/White and Asian/Other. The differences in the symptom correlations affect various pairs of symptoms as brought out in the network display produced by the <a href="https://CRAN.R-project.org/package=qgraph">qgraph</a> package in the terminal nodes. For example, the “centrality” of <code class="language-plaintext highlighter-rouge">anhedonia</code> changes across the three detected subgroups: For the older Asian/Other persons it is partially correlated with most other symptoms while this is less pronounced for the other two subgroups.</p> <p>The networks visualized in the tree can also be extracted easily using the <code class="language-plaintext highlighter-rouge">getnetwork()</code> function. For example, the partial correlation matrix corresponding to the older Asian/Other group (node 5) can be obtained by:</p> <pre><code class="language-{r}">getnetwork(tr, id = 5, transform = "pcor") </code></pre> <p>To explore the returned object <code class="language-plaintext highlighter-rouge">tr</code> in some more detail, the <code class="language-plaintext highlighter-rouge">print()</code> method gives a printed version of the tree structure but does not display the associated parameters.</p> <pre><code class="language-{r}">tr ## Network tree object ## ## Model formula: ## anhedonia + initiative + lookforward + sad + unenthused + worthless + ## meaningless ~ age + race ## ## Fitted party: ## [1] root ## | [2] age <= 30 ## | [3] age > 30 ## | | [4] race in Arab, Black, Native American, White ## | | [5] race in Asian, Other ## ## Number of inner nodes: 2 ## Number of terminal nodes: 3 ## Number of parameters per node: 21 ## Objective function: 42301.84 </code></pre> <p>The estimated correlation parameters in the subgroups can be extracted with <code class="language-plaintext highlighter-rouge">coef(tr)</code>, here returning a 3 x 21 matrix for the 21 pairs of symptom correlations and the 3 subgroups. To show two symptom pairs with larger correlation differences we extract the correlations of <code class="language-plaintext highlighter-rouge">anhedonia</code> with <code class="language-plaintext highlighter-rouge">worthless</code> and <code class="language-plaintext highlighter-rouge">meaningless</code>, respectively. Note that these are the raw correlations and not the partial correlations displayed in the tree above.</p> <pre><code class="language-{r}">coef(tr)[, 5:6] ## rho_anhedonia_worthless rho_anhedonia_meaningless ## 2 0.5595725 0.5994682 ## 4 0.6741686 0.6339481 ## 5 0.6639088 0.7178744 </code></pre> <p>Finally, we extract the p-values of the underlying parameter instability tests to gain some insights how the tree was constructed. In each step the stability we assess whether the correlation parameters are stable across each of the two covariates <code class="language-plaintext highlighter-rouge">age</code> and <code class="language-plaintext highlighter-rouge">race</code> or whether there are significant changes. The corresponding test statistics and Bonferroni-adjusted p-values can be extracted with the <code class="language-plaintext highlighter-rouge">sctest()</code> function (for “structural change test”). For example, in Node 1 there are significant instabilities with respect to both variables but <code class="language-plaintext highlighter-rouge">age</code> has the lower p-value and is hence selected for partitioning the data:</p> <pre><code class="language-{r}">library("strucchange") sctest(tr, node = 1) ## age race ## statistic 7.151935e+01 1.781216e+02 ## p.value 1.787983e-05 3.108049e-03 </code></pre> <p>In Node 3 only <code class="language-plaintext highlighter-rouge">race</code> is significant and hence used for splitting:</p> <pre><code class="language-{r}">sctest(tr, node = 3) ## age race ## statistic 42.9352852 1.728898e+02 ## p.value 0.1447818 6.766197e-05 </code></pre> <p>And in Node 5 neither variable is significant and hence the splitting stops:</p> <pre><code class="language-{r}">sctest(tr, node = 5) ## age race ## statistic 35.1919522 22.09555 ## p.value 0.5514142 0.63279 </code></pre> <p>For more details regarding the method and the software see the Psychometrika paper and the software web page, respectively.</p>2021-02-04T00:00:00+01:00https://www.zeileis.org/news/biasreduction/Bias reduction in Poisson and Tobit regression2021-01-20T00:00:00+01:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/While it is well-known that data separation can cause infinite estimates in binary regression models, the same is also true for other models with a point mass at the bounday (typically at zero) such as Poisson and Tobit. It is shown why this happens and how it can be remedied with bias-reduced estimation, along with implementations in R.<p>While it is well-known that data separation can cause infinite estimates in binary regression models, the same is also true for other models with a point mass at the bounday (typically at zero) such as Poisson and Tobit. It is shown why this happens and how it can be remedied with bias-reduced estimation, along with implementations in R.</p> <h2 id="citation">Citation</h2> <p>Köll S, Kosmidis I, Kleiber C, Zeileis A (2021). <em>“Bias Reduction as a Remedy to the Consequences of Infinite Estimates in Poisson and Tobit Regression”</em>, arXiv:2101.07141, arXiv.org E-Print Archive. <a href="https://arXiv.org/abs/2101.07141">https://arXiv.org/abs/2101.07141</a></p> <h2 id="abstract">Abstract</h2> <p>Data separation is a well-studied phenomenon that can cause problems in the estimation and inference from binary response models. Complete or quasi-complete separation occurs when there is a combination of regressors in the model whose value can perfectly predict one or both outcomes. In such cases, and such cases only, the maximum likelihood estimates and the corresponding standard errors are infinite. It is less widely known that the same can happen in further microeconometric models. One of the few works in the area is Santos Silva and Tenreyro (2010) who note that the finiteness of the maximum likelihood estimates in Poisson regression depends on the data configuration and propose a strategy to detect and overcome the consequences of data separation. However, their approach can lead to notable bias on the parameter estimates when the regressors are correlated. We illustrate how bias-reducing adjustments to the maximum likelihood score equations can overcome the consequences of separation in Poisson and Tobit regression models.</p> <h2 id="software">Software</h2> <p>R package <code class="language-plaintext highlighter-rouge">brglm2</code> from CRAN: <a href="https://CRAN.R-project.org/package=brglm2">https://CRAN.R-project.org/package=brglm2</a></p> <p>R package <code class="language-plaintext highlighter-rouge">brtobit</code> from R-Forge: <a href="https://R-Forge.R-project.org/R/?group_id=2305">https://R-Forge.R-project.org/R/?group_id=2305</a></p> <h2 id="illustration">Illustration</h2> <p>The simplest but arguably often-encountered occurrence of data separation in practice is when there is a binary regressor such that the response y = 0 (or another boundary value) whenever the regressor is 1. If P(y = 0) is monotonically increasing in the linear predictor of the model, then the coefficient of the binary regressor will diverge to minus infinity in order to push P(y = 0) in this subgroup as close to 1 as possible.</p> <p>To illustrate this phenomenon in R for both Poisson and Tobit regression we employ a simple data-generating process: In addition to the intercept we generate a continuous regressor x<sub>2</sub> uniformly distributed on [-1, 1] and a binary regressor x<sub>3</sub>. The latter comes from a Bernoulli distribution with probability 0.25 if x<sub>2</sub> is positive and with probability 0.75 otherwise. Thus, x<sub>2</sub> and x<sub>3</sub> are correlated.</p> <p>The linear predictor employed for both Poisson and Tobit is: 1 + x<sub>2</sub> - 10 x<sub>3</sub>, where the extreme coefficient of -10 assures that there is almost certainly data separation. In the full paper linked above we also consider less extreme scenarios where separation may or may not occur. The Poisson response is then drawn from a Poisson distribution using a log link between mean and linear predictor. The Tobit response is drawn from a normal distribution censored at zero with identity link and constant variance of 2. Here, we draw two samples with 100 observations from both models:</p> <pre><code class="language-{r}">dgp <- function(n = 100, coef = c(1, 1, -10, 2), prob = 0.25, dist = "poisson") { x2 <- runif(n, -1, 1) x3 <- rbinom(n, size = 1, prob = ifelse(x2 > 0, prob, 1 - prob)) y <- switch(match.arg(tolower(dist), c("poisson", "tobit")), "poisson" = rpois(n, exp(coef[1] + coef[2] * x2 + coef[3] * x3)), "tobit" = rnorm(n, mean = coef[1] + coef[2] * x2 + coef[3] * x3, sd = sqrt(coef[4])) ) y[y <= 0] <- 0 data.frame(y, x2, x3) } set.seed(2020-10-29) d1 <- dgp(dist = "poisson") set.seed(2020-10-29) d2 <- dgp(dist = "tobit") </code></pre> <p>Both of these data sets exhibit quasi-complete separation of y with respect to x<sub>3</sub>, i.e., y is always 0 if x<sub>3</sub> is 1.</p> <pre><code class="language-{r}">xtabs(~ x3 + factor(y == 0), data = d1) ## factor(y == 0) ## x3 FALSE TRUE ## 0 47 8 ## 1 0 45 </code></pre> <p>We then compare four different modeling approaches in this situation:</p> <ul> <li>ML: Standard maximum likelihood estimation.</li> <li>BR: Bias-reduced estimation based on adjusted score equations, first suggested by David Firth and later refined by David Firth and Ioannis Kosmidis.</li> <li>ML/sub: ML estimation on the subset not affected by separation, i.e., omitting both the regressor and all observations affected.</li> <li>ML/SST: ML estimation omitting only the regressor affected by the separation but keeping all observations. This strategy is recommended more commonly (compared to ML/sub) in the literature, specifically for Poisson in a paper by Santos Silva and Tenreyro (2010, <em>Economics Letters</em>).</li> </ul> <p>For Poisson regression, all these models can be fitted with the standard <code class="language-plaintext highlighter-rouge">glm()</code> function in R. To obtain the BR estimate <code class="language-plaintext highlighter-rouge">method = "brglmFit"</code> can be plugged in using the <code class="language-plaintext highlighter-rouge">brglm2</code> package (by Ioannis Kosmidis).</p> <pre><code class="language-{r}">install.packages("brglm2") library("brglm2") m12_ml <- glm(y ~ x2 + x3, data = d1, family = poisson) m12_br <- update(m12_ml, method = "brglmFit") m1_all <- glm(y ~ x2, data = d1, family = poisson) m1_sub <- update(m1_all, subset = x3 == 0) m1 <- list("ML" = m12_ml, "BR" = m12_br, "ML/sub" = m1_sub, "ML/SST" = m1_all) </code></pre> <p>This yields the following results (shown with the wonderful <a href="https://vincentarelbundock.github.io/modelsummary/">modelsummary</a> package):</p> <pre><code class="language-{r}">library("modelsummary") msummary(m1) </code></pre> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: right">ML</th> <th style="text-align: right">BR</th> <th style="text-align: right">ML/sub</th> <th style="text-align: right">ML/SST</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">(Intercept)</td> <td style="text-align: right">0.951</td> <td style="text-align: right">0.958</td> <td style="text-align: right">0.951</td> <td style="text-align: right">0.350</td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(0.100)</td> <td style="text-align: right">(0.099)</td> <td style="text-align: right">(0.100)</td> <td style="text-align: right">(0.096)</td> </tr> <tr> <td style="text-align: left">x2</td> <td style="text-align: right">1.011</td> <td style="text-align: right">1.006</td> <td style="text-align: right">1.011</td> <td style="text-align: right">1.662</td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(0.158)</td> <td style="text-align: right">(0.157)</td> <td style="text-align: right">(0.158)</td> <td style="text-align: right">(0.144)</td> </tr> <tr> <td style="text-align: left">x3</td> <td style="text-align: right">-20.907</td> <td style="text-align: right">-5.174</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(2242.463)</td> <td style="text-align: right">(1.416)</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">Num.Obs.</td> <td style="text-align: right">100</td> <td style="text-align: right">100</td> <td style="text-align: right">55</td> <td style="text-align: right">100</td> </tr> <tr> <td style="text-align: left">Log.Lik.</td> <td style="text-align: right">-107.364</td> <td style="text-align: right">-107.869</td> <td style="text-align: right">-107.364</td> <td style="text-align: right">-169.028</td> </tr> </tbody> </table> <p>The following remarks can be made:</p> <ul> <li>Standard ML estimation using all observations leads to a large estimate of x<sub>3</sub> with even larger standard error. As a result, a standard Wald test results in no evidence against the hypothesis that x<sub>3</sub> should not be in the model, despite the fact that using x<sub>3</sub> = -10 when generating the data makes x<sub>3</sub> perhaps the most influential regressor.</li> <li>The ML/sub strategy, i.e., estimating the model without x<sub>3</sub> only for non-separated observations (with x<sub>3</sub> = 0), yields exactly the same estimates as ML.</li> <li>Compared to ML and ML/sub, BR has the advantage of returning a finite estimate and standard error for x<sub>3</sub> . Hence a Wald test can be directly used to examine the evidence against x<sub>3</sub> = 0. The other parameter estimates and the log-likelihood are close to ML. BR slightly shrinks the parameter estimates of x<sub>2</sub> and x<sub>3</sub> towards zero.</li> <li>Finally, the estimates from ML/SST, where regressor x<sub>3</sub> is omitted and all observations are used, appear to be far from the values we used to generate the data. This is due to the fact that x<sub>3</sub> is not only highly informative but also correlated with x<sub>2</sub>.</li> </ul> <p>Moreover, in more extensive simulation experiments in the paper it is shown that the BR estimates are always finite, and result in Wald-type intervals with better coverage probabilities.</p> <p>Analogous results an be obtained for Tobit regression with our <code class="language-plaintext highlighter-rouge">brtobit</code> package, currently available from R-Forge. This provides both ML and BR estimation for homoscedastic tobit models. (Some tools are re-used from our <code class="language-plaintext highlighter-rouge">crch</code> package that implements various estimation techniques , albeit not BR, for Tobit models with conditional heteroscedasticity.) Below we fit the same four models as in the Poisson case above.</p> <pre><code class="language-{r}">install.packages("brtobit", repos = "http://R-Forge.R-project.org") library("brtobit") m22_ml <- brtobit(y ~ x2 + x3, data = d2, type = "ML", fsmaxit = 28) m22_br <- brtobit(y ~ x2 + x3, data = d2, type = "BR") m2_all <- brtobit(y ~ x2, data = d2, type = "ML") m2_sub <- update(m2_all, subset = x3 == 0) m2 <- list("ML" = m22_ml, "BR" = m22_br, "ML/sub" = m2_sub, "ML/SST" = m2_all) </code></pre> <p>Because <code class="language-plaintext highlighter-rouge">brtobit</code> does not yet provide a direct interface for <code class="language-plaintext highlighter-rouge">modelsummary</code> (via <code class="language-plaintext highlighter-rouge">broom</code>) we go through the <code class="language-plaintext highlighter-rouge">coeftest()</code> results as an intermediate step. These can then be rendered by <code class="language-plaintext highlighter-rouge">modelsummary</code>:</p> <pre><code class="language-{r}">library("lmtest") m2 <- lapply(m2, coeftest) msummary(m2) </code></pre> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: right">ML</th> <th style="text-align: right">BR</th> <th style="text-align: right">ML/sub</th> <th style="text-align: right">ML/SST</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">(Intercept)</td> <td style="text-align: right">1.135</td> <td style="text-align: right">1.142</td> <td style="text-align: right">1.135</td> <td style="text-align: right">-0.125</td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(0.208)</td> <td style="text-align: right">(0.210)</td> <td style="text-align: right">(0.208)</td> <td style="text-align: right">(0.251)</td> </tr> <tr> <td style="text-align: left">x2</td> <td style="text-align: right">0.719</td> <td style="text-align: right">0.705</td> <td style="text-align: right">0.719</td> <td style="text-align: right">2.074</td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(0.364)</td> <td style="text-align: right">(0.359)</td> <td style="text-align: right">(0.364)</td> <td style="text-align: right">(0.404)</td> </tr> <tr> <td style="text-align: left">x3</td> <td style="text-align: right">-11.238</td> <td style="text-align: right">-4.218</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(60452.270)</td> <td style="text-align: right">(0.891)</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">(Variance)</td> <td style="text-align: right">1.912</td> <td style="text-align: right">1.970</td> <td style="text-align: right">1.912</td> <td style="text-align: right">3.440</td> </tr> <tr> <td style="text-align: left"> </td> <td style="text-align: right">(0.422)</td> <td style="text-align: right">(0.434)</td> <td style="text-align: right">(0.422)</td> <td style="text-align: right">(0.795)</td> </tr> <tr> <td style="text-align: left">Num.Obs.</td> <td style="text-align: right">100</td> <td style="text-align: right">100</td> <td style="text-align: right">55</td> <td style="text-align: right">100</td> </tr> <tr> <td style="text-align: left">Log.Lik.</td> <td style="text-align: right">-87.633</td> <td style="text-align: right">-88.101</td> <td style="text-align: right">-87.633</td> <td style="text-align: right">-118.935</td> </tr> </tbody> </table> <p>The results show exactly the same pattern as for the Poisson regression above: ML, BR, and ML/sub yield results close to the true coefficients for intercept, x<sub>2</sub>, and the variance while the ML/SST estimates are far from the true values. For x<sub>3</sub> only the BR estimates are finite while the ML estimates diverge towards minus infinity. Actually, the estimates would have diverged even more if we hadn’t stopped the Fisher scoring early (via <code class="language-plaintext highlighter-rouge">fsmaxit = 28</code> instead of the default <code class="language-plaintext highlighter-rouge">100</code>).</p> <p>Overall this clearly indicates that bias-reduced (BR) estimation is a convenient way to avoid infinite estimates and standard errors in these models and to enable standard inference even when data separation occurs. In contrast the common recommendation to omit the regressor associated with the separation should be avoided or applied to the non-separated subset of observations only. Otherwise it can give misleading results when regressors are correlated.</p>2021-01-20T00:00:00+01:00https://www.zeileis.org/news/colorspace200/Tools for colors and palettes: colorspace 2.0-0, web page, and JSS paper2020-12-02T00:00:00+01:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Version 2.0-0 of the R package 'colorspace' with tools for manipulating and assessing colors and palettes is now available from CRAN, accompanied by an updated web page, and a paper in the Journal of Statistical Software (JSS).<p>Version 2.0-0 of the R package 'colorspace' with tools for manipulating and assessing colors and palettes is now available from CRAN, accompanied by an updated web page, and a paper in the Journal of Statistical Software (JSS).</p> <h2 id="overview">Overview</h2> <p><a href="https://colorspace.R-Forge.R-project.org/"><img src="http://colorspace.R-Forge.R-project.org/logo_wide.png" alt="colorspace logo" /></a></p> <p>The R package colorspace provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (hue-chroma-luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space. Using the HCL color model, general strategies for three types of palettes are implemented: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. To aid selection and application of these palettes, the package also contains scales for use with ggplot2, shiny and tcltk apps for interactive exploration, visualizations of palette properties, accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.</p> <h2 id="jss-paper">JSS paper</h2> <p>Zeileis A, Fisher JC, Hornik K, Ihaka R, McWhite CD, Murrell P, Stauffer R, Wilke CO (2020). “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” <em>Journal of Statistical Software</em>, <strong>96</strong>(1), 1-49. <a href="https://doi.org/10.18637/jss.v096.i01">doi:10.18637/jss.v096.i01</a>.</p> <h2 id="cran-release-of-version-20-0">CRAN release of version 2.0-0</h2> <p>The release of <a href="https://CRAN.R-project.org/package=colorspace">version 2.0-0 on CRAN</a> (Comprehensive R Archive Network) concludes more than a decade of development and substantial updates since the release of version 1.0-0. The JSS paper above gives a detailed overview of the package’s features. The full list of changes over the different release is provided in the package’s <a href="https://colorspace.R-Forge.R-project.org/news/">NEWS</a>.</p> <h2 id="web-page">Web page</h2> <p>Even more details and links along with the full software manual are available on the package web page on R-Forge at <a href="https://colorspace.R-Forge.R-project.org/">https://colorspace.R-Forge.R-project.org/</a> (produced with <code class="language-plaintext highlighter-rouge">pkgdown</code>).</p>2020-12-02T00:00:00+01:00https://www.zeileis.org/news/sandwich300/Robust covariance matrix estimation: sandwich 3.0-0, web page, JSS paper2020-10-08T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Version 3.0-0 of the R package 'sandwich' for robust covariance matrix estimation (HC, HAC, clustered, panel, and bootstrap) is now available from CRAN, accompanied by a new web page and a paper in the Journal of Statistical Software (JSS).<p>Version 3.0-0 of the R package 'sandwich' for robust covariance matrix estimation (HC, HAC, clustered, panel, and bootstrap) is now available from CRAN, accompanied by a new web page and a paper in the Journal of Statistical Software (JSS).</p> <h2 id="cran-release-of-version-30-0">CRAN release of version 3.0-0</h2> <p>The sandwich package provides model-robust covariance matrix estimators for cross-sectional, time series, clustered, panel, and longitudinal data. The implementation is modular due to an object-oriented design with support for many model objects, including: <code class="language-plaintext highlighter-rouge">lm</code>, <code class="language-plaintext highlighter-rouge">glm</code>, <code class="language-plaintext highlighter-rouge">survreg</code>, <code class="language-plaintext highlighter-rouge">coxph</code>, <code class="language-plaintext highlighter-rouge">mlogit</code>, <code class="language-plaintext highlighter-rouge">polr</code>, <code class="language-plaintext highlighter-rouge">hurdle</code>, <code class="language-plaintext highlighter-rouge">zeroinfl</code>, and beyond.</p> <p>The release of <a href="https://CRAN.R-project.org/package=sandwich">version 3.0-0 on CRAN</a> (Comprehensive R Archive Network) completes the substantial updates and improvements started in the 2.4-x and 2.5-x releases: especially clustered, panel, and bootstrap covariances. In addition to the new <a href="https://pkgdown.r-lib.org/">pkgdown</a> web page and paper in the <a href="https://www.jstatsoft.org/">Journal of Statistical Software</a> (JSS), described below, the new release includes some smaller improvements in: some equations in the vignettes (suggested by Bettina Grün and Yves Croissant), the kernel weights function <code class="language-plaintext highlighter-rouge">kweights()</code> (suggested by Christoph Hanck), in the formula handling (suggested by David Hugh-Jones), in the <code class="language-plaintext highlighter-rouge">bread()</code> method for weighted <code class="language-plaintext highlighter-rouge">mlm</code> objects (suggested by James Pustejovsky). The full list of changes can be seen in the package’s <a href="https://sandwich.R-Forge.R-project.org/news/">NEWS</a>.</p> <h2 id="new-web-page">New web page</h2> <p>The package comes with a dedicated <code class="language-plaintext highlighter-rouge">pkgdown</code> website on R-Forge now: <a href="https://sandwich.R-Forge.R-project.org/">https://sandwich.R-Forge.R-project.org/</a>. This includes a nice logo, kindly provided by Reto Stauffer.</p> <p><a href="https://sandwich.R-Forge.R-project.org/"><img src="https://www.zeileis.org/assets/posts/2020-10-08-sandwich300/sandwich.png" alt="New sandwich logo" /></a></p> <p>The web page essentially uses the previous content of the package (documentation, vignettes, NEWS) but also adds a nice overview of the package to help new users to <a href="https://sandwich.R-Forge.R-project.org/articles/sandwich.html">“Get started”</a>.</p> <h2 id="jss-paper">JSS paper</h2> <p><em>Citation:</em><br /> Zeileis A, Köll S, Graham N (2020). “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” <em>Journal of Statistical Software</em>, <strong>95</strong>(1), 1-36. <a href="https://doi.org/10.18637/jss.v095.i01">doi:10.18637/jss.v095.i01</a>.</p> <p><em>Abstract:</em><br /> Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, and other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to ``the’’ clustered standard errors, there is a surprisingly wide variety of clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g., for zero-inflated, censored, or limited responses).</p> <p>In R, functions for covariances in clustered or panel models have been somewhat scattered or available only for certain modeling functions, notably the (generalized) linear regression model. In contrast, an object-oriented approach to “robust” covariance matrix estimation - applicable beyond <code class="language-plaintext highlighter-rouge">lm()</code> and <code class="language-plaintext highlighter-rouge">glm()</code> - is available in the <em>sandwich</em> package but has been limited to the case of cross-section or time series data. Starting with <em>sandwich</em> 2.4.0, this shortcoming has been corrected: Based on methods for two generic functions (<code class="language-plaintext highlighter-rouge">estfun()</code> and <code class="language-plaintext highlighter-rouge">bread()</code>), clustered and panel covariances are provided in <code class="language-plaintext highlighter-rouge">vcovCL()</code>, <code class="language-plaintext highlighter-rouge">vcovPL()</code>, and <code class="language-plaintext highlighter-rouge">vcovPC()</code>. Moreover, clustered bootstrap covariances are provided in <code class="language-plaintext highlighter-rouge">vcovBS()</code>, using model <code class="language-plaintext highlighter-rouge">update()</code> on bootstrap samples. These are directly applicable to models from packages including <em>MASS</em>, <em>pscl</em>, <em>countreg</em>, and <em>betareg</em>, among many others. Some empirical illustrations are provided as well as an assessment of the methods’ performance in a simulation study.</p>2020-10-08T00:00:00+02:00https://www.zeileis.org/news/lavaantree/Structural equation model trees with partykit and lavaan2020-09-07T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/To capture heterogeneity in structural equation models (SEMs), the model-based recursive partitioning (MOB) algorithm from partykit can be coupled with SEM estimation from lavaan.<p>To capture heterogeneity in structural equation models (SEMs), the model-based recursive partitioning (MOB) algorithm from partykit can be coupled with SEM estimation from lavaan.</p> <h2 id="motivation">Motivation</h2> <p>Structural equation models (SEMs) are a popular class of models, especially in the social sciences, to model correlations and dependencies in multivariate data, often involving latent variables. To account for individual heterogeneities in the SEM parameters sometimes finite-mixture models are used, in particular when there are no covariates available to explain the source of the heterogeneity. More recently, starting from the work of Brandmaier <em>et al.</em> (2013, <em>Psychological Methods</em>, <a href="https://doi.org/10.1037/a0030001">doi:10.1037/a0030001</a>) tree-based modeling of SEMs has also been receiving increasing interest in the literature. Based on available covariates SEM trees can capture the heterogeneity by recursively partitioning the data into subgroups. Brandmaier <em>et al.</em> also provide an R implementation for their algorithm in their <a href="https://brandmaier.de/semtree/"><em>semtree</em> package</a> available from <a href="https://CRAN.R-project.org/package=semtree">CRAN</a>.</p> <p>Their original SEM tree algorithm relied on selecting the variables for recursive partitioning based on likelihood ratio tests along with somewhat ad hoc adjustments. Recently, the group around Brandmaier proposed to use score-based tests instead that account more formally for selecting the maximal statistic across a range of possible split points (see Arnold <em>et al.</em> 2020, PsyArXiv Preprints, <a href="https://doi.org/10.31234/osf.io/65bxv">doi:10.31234/osf.io/65bxv</a>). They show that this not only improves the accuracy of the method but can also greatly alleviate the computational burden.</p> <p>The score-based tests draw on the work started by us in Merkle & Zeileis (2013, <em>Psychometrika</em>, <a href="https://doi.org/10.1007/s11336-012-9302-4">doi:10.1007/s11336-012-9302-4</a>) which in fact had already long been available in a general model-based tree algorithm (called MOB for short), proposed by us in Zeileis <em>et al.</em> (2008, <em>Journal of Computational and Graphical Statistics</em>, <a href="https:://doi.org/10.1198/106186008X319331">doi:10.1198/106186008X319331</a>) and available in the <a href="https://CRAN.R-project.org/package=partykit">R package <em>partykit</em></a> (and <a href="https://CRAN.R-project.org/package=party"><em>party</em></a> before that).</p> <p>In this blog post I show how the general <code class="language-plaintext highlighter-rouge">mob()</code> function from <em>partykit</em> can be easily coupled with the <a href="https://CRAN.R-project.org/package=lavaan"><em>lavaan</em> package</a> (Rosseel 2012, <em>Journal of Statistical Software</em>, <a href="https://doi.org/10.18637/jss.v048.i02">doi:10.18637/jss.v048.i02</a>) as an alternative approach to fitting SEM trees.</p> <h2 id="mob-model-based-recursive-partitioning">MOB: Model-based recursive partitioning</h2> <p>MOB is a very broad tree algorithm that can capture subgroups in general parametric models (e.g., probability distributions, regression models, measurement models, etc.). While it can be applied to M-type estimators in general, it is probably easiest to outline the algorithm for maximum likelihood models. The algorithm assumes that there is some data of interest along with a suitable model that can fit the data, at least locally in subgroups. And additionally there are further covariates that can be used for splitting the data to find these subgroups. It proceeds in the following steps.</p> <ol> <li>Estimate the model parameters by maximum likelihood for the observations in the current subsample.</li> <li>Test for associations (or instabilities) of the corresponding model scores and each of the covariates available for splitting.</li> <li>Split the sample along the covariate with the strongest association or instability. Choose the breakpoint with the highest improvement in the log-likelihood.</li> <li>Repeat steps 1-3 recursively in the subsamples until these become too small or there is no significant association/instability (or some other stopping criterion is reached).</li> <li><em>Optionally:</em> Reduce size of the tree by pruning branches of splits that do not improve the model fit sufficiently (e.g., based on information criteria).</li> </ol> <p>The <code class="language-plaintext highlighter-rouge">mob()</code> function in <em>partykit</em> implements this general algorithm and allows to plug in different model-fitting functions, provided they allow to extract the estimated parameters, the maximized log-likelihood, and the corresponding matrix of score (or gradient) contributions for each observation. The details are described in a vignette within the package: <a href="https://CRAN.R-project.org/web/packages/partykit/vignettes/mob.pdf">Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in R</a>.</p> <h2 id="a-sem-mobster-using-lavaan">A SEM mobster using <em>lavaan</em></h2> <p>As the <em>lavaan</em> package readily provides the quantities that MOB needs as input we can easily set up a “mobster” function for SEMs. The <code class="language-plaintext highlighter-rouge">lavaan_fit()</code> function below takes a <em>lavaan</em> <code class="language-plaintext highlighter-rouge">model</code> definition and returns the actual fitting function with the interface as required by <code class="language-plaintext highlighter-rouge">mob()</code>:</p> <pre><code class="language-{r}">lavaan_fit <- function(model) { function(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE) { sem <- lavaan::lavaan(model = model, data = y, start = start) list( coefficients = stats4::coef(sem), objfun = -as.numeric(stats4::logLik(sem)), estfun = if(estfun) sandwich::estfun(sem) else NULL, object = if(object) sem else NULL ) } } </code></pre> <p>The fitting function just calls <code class="language-plaintext highlighter-rouge">lavaan()</code> using the <code class="language-plaintext highlighter-rouge">model</code>, the data <code class="language-plaintext highlighter-rouge">y</code>, and optionally the <code class="language-plaintext highlighter-rouge">start</code>-ing values, ignoring other arguments that <code class="language-plaintext highlighter-rouge">mob()</code> could handle. It then extracts the parameters <code class="language-plaintext highlighter-rouge">coef()</code>, the log-likelihood <code class="language-plaintext highlighter-rouge">logLik()</code>, and the score matrix <code class="language-plaintext highlighter-rouge">estfun()</code> using the generic functions from the corresponding packages and returns them in a list.</p> <h2 id="illustration-linear-growth-curve-model-tree">Illustration: Linear growth curve model tree</h2> <p>To illustrate fitting SEM trees with <em>partykit</em> and <em>lavaan</em>, we consider the example from the <a href="https://brandmaier.de/semtree/user-guide/using-lavaan-with-semtree/">Using <em>lavaan</em> with <em>semtree</em></a> tutorial provided by Brandmaier <em>et al.</em>. It is a linear growth curve model for data measured at five time points: <code class="language-plaintext highlighter-rouge">X1</code>, <code class="language-plaintext highlighter-rouge">X2</code>, <code class="language-plaintext highlighter-rouge">X3</code>, <code class="language-plaintext highlighter-rouge">X4</code>, and <code class="language-plaintext highlighter-rouge">X5</code>. The main parameters of interest are the intercept and the slope of the growth curves while accounting for random variations and correlations among the involved variables according to this SEM. In <em>lavaan</em> notation:</p> <pre><code class="language-{r}">growth_curve_model <- ' inter =~ 1*X1 + 1*X2 + 1*X3 + 1*X4 + 1*X5; slope =~ 0*X1 + 1*X2 + 2*X3 + 3*X4 + 4*X5; inter ~~ vari*inter; inter ~ meani*1; slope ~~ vars*slope; slope ~ means*1; inter ~~ cov*slope; X1 ~~ residual*X1; X1 ~ 0*1; X2 ~~ residual*X2; X2 ~ 0*1; X3 ~~ residual*X3; X3 ~ 0*1; X4 ~~ residual*X4; X4 ~ 0*1; X5 ~~ residual*X5; X5 ~ 0*1; ' </code></pre> <p>The model can also be visualized using the following graphic taken from the tutorial:</p> <p><a href="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/lgcm.png"><img src="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/lgcm.png" alt="Linear growth curve model" /></a></p> <p>In addition to the measurements at the five time points, the data set <a href="https://brandmaier.de/semtree/wp-content/uploads/downloads/2012/07/example1.txt">example1.txt</a> provides three covariates (<code class="language-plaintext highlighter-rouge">agegroup</code>, <code class="language-plaintext highlighter-rouge">training</code>, and <code class="language-plaintext highlighter-rouge">noise</code>) that can be used to capture individual difference in the model parameters. The data can be read and transformed to appropriate classes by:</p> <pre><code class="language-{r}">ex1 <- data.frame(read.csv( "https://brandmaier.de/semtree/wp-content/uploads/downloads/2012/07/example1.txt", sep = "\t")) ex1 <- transform(ex1, agegroup = factor(agegroup), training = factor(training), noise = factor(noise)) </code></pre> <h2 id="fitting-the-sem-tree">Fitting the SEM tree</h2> <p>Given the data, model, and mobster function are available, it is easy to fit the MOB tree with SEMs in every node of the tree. The five measurements are the dependent variables (<code class="language-plaintext highlighter-rouge">y</code>) that need to be passed to the model as a <code class="language-plaintext highlighter-rouge">"data.frame"</code>, the three covariates are the explanatory variables:</p> <pre><code class="language-{r}">library("partykit") tr <- mob(X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise, data = ex1, fit = lavaan_fit(growth_curve_model), control = mob_control(ytype = "data.frame")) </code></pre> <p>The resulting tree <code class="language-plaintext highlighter-rouge">tr</code> correctly detects the three subgroups that were simulated for the data by Brandmaier <em>et al.</em>. It can be visualized (with somewhat larger terminal nodes, all dropped to the bottom of the display):</p> <pre><code class="language-{r}">plot(tr, drop = TRUE, tnex = 2) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/lgcmtree.png"><img src="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/lgcmtree.png" alt="Growth curve model tree" /></a></p> <p>The parameter estimates can also be extracted by <code class="language-plaintext highlighter-rouge">coef(tr)</code>:</p> <pre><code class="language-{r}">t(coef(tr)) ## 2 4 5 ## vari 0.086 0.080 0.105 ## meani 5.020 2.003 1.943 ## vars 0.500 1.627 0.675 ## means -0.144 -1.082 -0.495 ## cov -0.013 -0.041 0.028 ## residual 0.050 0.047 0.052 ## residual 0.050 0.047 0.052 ## residual 0.050 0.047 0.052 ## residual 0.050 0.047 0.052 ## residual 0.050 0.047 0.052 </code></pre> <p>The main parameters of interest are <code class="language-plaintext highlighter-rouge">meani</code>, the mean intercept, and <code class="language-plaintext highlighter-rouge">means</code>, the mean slope that both vary across the subgroups defined by <code class="language-plaintext highlighter-rouge">agegroup</code> and <code class="language-plaintext highlighter-rouge">training</code>: In node 2 the intercept is about 5 while in nodes 4 and 5 it is around 2. The slope is almost zero in node 2, about -1 in node 4, and about -0.5 in node 5. The <code class="language-plaintext highlighter-rouge">residual</code> variance is restricted to be constant across the five time points and hence repeated in the output.</p> <h2 id="further-displays">Further displays</h2> <p>By extracting the node-specific <code class="language-plaintext highlighter-rouge">meani</code> and <code class="language-plaintext highlighter-rouge">means</code> parameters, the expected growth can also be visualized in the following way:</p> <pre><code class="language-{r}">gr <- coef(tr)[, "meani"] + outer(coef(tr)[, "means"], 0:4) cl <- palette.colors(4, "Okabe-Ito")[-1] matplot(t(gr), type = "o", pch = 19, col = cl, ylab = "Expected growth", xlab = "Time", xlim = c(1, 5.2)) text(5, gr[, 5], paste("Node", rownames(gr)), col = cl, pos = 3) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/expectedgrowth.png"><img src="https://www.zeileis.org/assets/posts/2020-09-07-lavaantree/expectedgrowth.png" alt="Expected growth" /></a></p> <p>Finally, using a custom printing function that only shows the subgroup size and the first six parameters, the tree can be nicely printed as:</p> <pre><code class="language-{r}">node_format <- function(node) { c("", sprintf("n = %s", node$nobs), capture.output(print(cbind(node$coefficients[1:6]), digits = 2L))[-1L]) } print(tr, FUN = node_format) ## Model-based recursive partitioning (lavaan_fit(growth_curve_model)) ## ## Model formula: ## X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise ## ## Fitted party: ## [1] root ## | [2] agegroup in 0 ## | n = 200 ## | vari 0.086 ## | meani 5.020 ## | vars 0.500 ## | means -0.144 ## | cov -0.013 ## | residual 0.050 ## | [3] agegroup in 1 ## | | [4] training in 0 ## | | n = 100 ## | | vari 0.080 ## | | meani 2.003 ## | | vars 1.627 ## | | means -1.082 ## | | cov -0.041 ## | | residual 0.047 ## | | [5] training in 1 ## | | n = 100 ## | | vari 0.105 ## | | meani 1.943 ## | | vars 0.675 ## | | means -0.495 ## | | cov 0.028 ## | | residual 0.052 ## ## Number of inner nodes: 2 ## Number of terminal nodes: 3 ## Number of parameters per node: 10 ## Objective function: 1330.735 </code></pre> <h2 id="concluding-remarks">Concluding remarks</h2> <p>The main purpose of this blog post was to show that it is relatively simple to fit model-based trees with custom models using the general <code class="language-plaintext highlighter-rouge">mob()</code> infrastructure from the <em>partykit</em> package. Specifically, <em>lavaan</em> makes it easy to fit SEM trees as the <em>lavaan</em> package readily provides all necessary components. As I had provided this as feedback to Arnold <em>et al.</em> and encouraged them to drill a bit deeper to better understand the differences between their adapted SEM tree algorithm and MOB, I thought I should share the code as it might be useful to others as well.</p> <p>One important difference between the new SEM tree algorithm and the current MOB implementation is the determination of the best split point. The new SEM tree also uses the scores for this while MOB is based on the log-likelihood in the subgroups and hence is slower searching splits in numeric covariates with many possible split points. While we also had experimented with score-based split point estimation in <em>party</em> this has never been released and is currently not available in <em>partykit</em>. However, we are working on making the split point selection more flexible in <em>partykit</em>.</p> <p>Of course, fitting the tree model is actually just the first step in an analysis of subgroups in a SEM. The subsequent steps for analyzing and interpreting the resulting tree model are at least as important. The work bei Brandmaier and his co-authors and their <em>semtree</em> package provide much more guidance on this.</p>2020-09-07T00:00:00+02:00