Abstract

In this project, I attempt to forecast the average strength (measured in IPFPoints) of the population of Raw Powerlifters using data supplied by openpowerlifting.org. Because the data appears to be a stationary procecss with a linear trend I detrended the data three different ways, and fit 1 to 3 ARIMA models for each detrended dataset. This yielded a pool of 7 tentative ARIMA models. I then compared all models trained on a given detrended datasets, and chose the best model from each, leaving 3 candidate models for the forecase. I compared the three candidates by comparing the residuals of each model (since BIC cannot be compared across datasets) and their Sum-Squared Error when predicting on test observations. Not surprisingly, the best models included the previous year’s observation in the same month. The two best models, Model 1 and Model 7 below, had a Root Mean Squared Error of ~6.66 and ~6.14 IPFpoints respectively. These numbers are not exactly interpretable because the IPFPoints score depends on the weight and sex of an individual. However, to give a rough idea, if I were to be off by 7 points when estimating an 83kg (~183lb) male, that would be about a 6kg error. Since my model predicts the population averages and not individual scores, such an interpretation cannot directly be applied to the forecast. Once the top models were selected, I refit those models using the entire dataset, and forecasted future observations. I’m waiting for openpowerlifting.org to update their numbers so I can validate the forecast against real data.

Background

My dataset is the “openpowerliting.csv” dataset downloaded from kaggle and generated by openpowerlifting.org. Each athlete in one powerlifting meet typically performs three attempts at the bench press, three attemps at the squat, and three attempts at the deadlift. This dataset essentially collects all lift attempts for most or all lifters in worldwide powerlifting meets since the 1960s. (I’ll likely eliminate years <1980 because data is limited for that time period). There are different equipment divisions—equipment being what you are allowed to wear during competion. There are “Single-ply”, “Multi-ply”, “Raw”, “Wraps” divisions. Single- and multi-ply indicate that the competitor can wear high-elasticity suit (for squat and deadlift) and shirt (for bench). This enables the competitor to lift more weight. “Wraps” is a class where the competitor can wear knee wraps (again a piece of equipment enabling one to lift more weight). The “Raw” class means that the user is allowed to wear only a belt, wrist wraps, and perhaps sanctioned knee sleeves—no suit or knee wraps allowed. In addition, the dataset contains the competitor’s sex, and weight class.

I’ll be looking to see the performance of competitors over time. The performance metric of interest will be the IPFPoints.IPFPoints are a transformation of TotalKg, where TotalKg = max(all three bench attempts)+max(all 3 squat attempts)+ max(all 3 deadlift attempts). The transformation has been designed to normalize TotalKg across gender and weight class so that awards can be given for best “overall” lifter across men, women, and weight class. It is a measure of relative performance within one’s weight and gender class. If the performance is very high with respect to the average and to the variance of a given weightclass,then the score is higher. The Wilks score is another similar measure, but I’m going to use IPFPoints, because they were developed more recently and with a larger set of data. The metric, however, is not adequate for comparison across different divisions in all instances. Therefore, I will have to consider a more nuanced approach.

Tailoring the data

For my analysis, I’m going to restrict the data in the following ways:

1.) Get rid of years before 1980, as there is not nearly as much data collected before that time, which means that if I look at the average of each month or quarter, there may be insufficient counts…

2.) Remove the year 2019 whenever I am looking at aggregated statistics for a whole year, since 2019 is not completed in the data. The max Date is 2019-04-20.

3.) Remove all instances where the competitor did not perform all 3 lifts. In my opinion, having strength in the squat, bench press, and deadlift simultaneously is a different quality than having strength at any combination of the 3 (i.e just bench press, just deadlift, or bench press and squat, etc).

#Data constraint

#d<- d %>% filter(year(Date)>1979,Event=="SBD",!(Federation %in% c("THSPA", "THSWPA")) ) 
d<- d %>% filter(year(Date)>1979,Event=="SBD") 

Questions of interest: How have strength measures for Raw powerlifting been changing since the start of its growth spurt?

Powerlifting has gained immense popularity since around 2012. See the following chart (n=number of non-unique competitors, one person is counted each time they compete):

More specifically, much of this growth is primarily due to new competitors in the Raw division…

Since I’m personally interested in the Raw powerlifting division, I’d like to see how the strengh of Raw powerlifters has been changing since more individual’s began competing around 2012. To do this, I will summarize the constrained data by finding the average IPFPoints among Raw competitors for each month.

## # A tibble: 6 x 7
## # Groups:   yearMonth [6]
##   yearMonth  Equipment  avgIPFPoints medianIPFPts avgWilks avgTotal     n
##   <date>     <chr>             <dbl>        <dbl>    <dbl>    <dbl> <int>
## 1 1980-01-01 Single-ply         459.         457.     359.     490.   382
## 2 1980-02-01 Single-ply         459.         458.     361.     532.   348
## 3 1980-03-01 Single-ply         515.         518.     413.     608.   176
## 4 1980-04-01 Single-ply         487.         481.     383.     522.   189
## 5 1980-05-01 Single-ply         491.         488.     389.     534.   355
## 6 1980-06-01 Single-ply         558.         574.     452.     650.   170

Below is the Raw division’s average IPFPoints for each month in the constrained dataset. Red points indicate lower sample size for that month. During and after 2012, each month’s sample size seems to be large enough to estimate the monthly average reasonably well. I opt to average the IPFPoints by month (as opposed to quarter,year,etc.) since there are plenty of competitors within each month over which I can average, and that will give a sample size of 88 monthly averages. Also, grouping by month isn’t so granular and noisy that it drowns out the linear trend.

I perform one final constraint to eliminate years before 2012. This is the final dataframe that I will use to model and forecast avgIPFPoints…

##    yearMonth Equipment avgIPFPoints medianIPFPts avgWilks avgTotal   n
## 1 2012-01-01       Raw     506.5944       505.21 346.7853 505.4663 261
## 2 2012-02-01       Raw     442.9443       438.06 305.7509 432.4376 351
## 3 2012-03-01       Raw     502.1596       507.33 344.2972 500.0479 564
## 4 2012-04-01       Raw     526.1417       522.95 358.5486 513.5604 439
## 5 2012-05-01       Raw     473.2145       479.67 327.3114 476.1177 369
## 6 2012-06-01       Raw     517.0691       513.57 352.4177 502.4561 739
##   dataSet
## 1   train
## 2   train
## 3   train
## 4   train
## 5   train
## 6   train

More properties about the dataset…

Below is the time series plot of the training data. I’ve omitted the most recent 5 observations to so I can see how well the model predicts the last 5 observations.

I opt not to transform this data, as I would lose a lot of interpretation by using a boxcox transformation where the mle for lambda = 2

## Warning in adf.test(tsdRaw$avgIPFPoints): p-value smaller than printed p-
## value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  tsdRaw$avgIPFPoints
## Dickey-Fuller = -4.4958, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

An adf test suggests that the data is a deterministic linear trend + a starionary process. This is plausible after looking at the time series plot above as well.

A brief overview of how I will compare models

Okay, at this point, there will be many moving parts and comparisons. It is a good idea to provide a brief overview of how I plan to compare models.

First: Since models 1 through 3 use the same detrended data (detrended), I will find a pool of 3 models using acf, pacf, and armasubsets. Then I will then compare the AIC, AICc, and BIC of these three models and select the first “candidate” model that will be compared with the second and third candidate models mentioned below.

Second: Since modes 4 through 6 use the same detrended data (detrended2), I will find a pool of 3 models using acf, pacf, and arma subsets. Then I will then compare the AIC, AICc, and BIC of these three models and select the second “candidate” model.

Third: Since models 7 and 8 below use the same detrended data (detrended3), I will find a pool of 2 models using acf, pacf, and arma subsets. Then I will then compare the AIC, AICc, and BIC of these two models and select the third “candidate” model. I only use 2 models here because one of the top 3 models threw an error when I attempted to call sarima() (“non-finite finite-difference value”). I couldn’t manage to remedy the error.

Fourth: Once the three candidate models are selected, I will compare their residuals and SSE’s (based on 5 observations reserved for testing). I will select the two best models from these three candidates.

Fifth: Once the two candidate models are selected, I will refit those models using the entire dataset and perform a future forecast for both models.

Finding the three candidate ARMA(p,q) processes. Each will model one of the 3 detrended datasets.

Finding the 1st candidate model

selecting three tentative ARMA(p,q) processes

Let’s start by detrending the data, plotting the detrended data, its acf, pacf, and armasubsets…

The top 3 models for the detrended data by BIC are…

  • 1.) model 1: AR(12), where the 1st and 12th AR coefficients are the only nonzero coefficients.
  • 2.) model 2: ARMA(12,12) where the 12th AR coefficient and the 1st and 12th MA coefficients are the only nonzero coefficients.
  • 3.) model 3: AR(12) where only the 12th AR coefficient is the only nonzero.

Here is the first model fit

##      Estimate     SE t.value p.value
## ar1   -0.2600 0.1010 -2.5747  0.0119
## ar12   0.3254 0.1158  2.8090  0.0062

and the second model fit

##      Estimate     SE t.value p.value
## ar1   -0.2600 0.1010 -2.5747  0.0119
## ar12   0.3254 0.1158  2.8090  0.0062

and the third model fit

##      Estimate     SE t.value p.value
## ar1   -0.2600 0.1010 -2.5747  0.0119
## ar12   0.3254 0.1158  2.8090  0.0062

comparing AIC,AICc,BIC for models 1-3

After fitting these models, here are the AIC,AICc,and BIC comparisons:

As you can see, model 1 (an AR(12) model where only the 1st and 12th AR coefficient are estimated as nonzero) performs the best across all 3 metrics on detrended. In addition, it is the simplest model with very nice interpretation (Y_t is a function of the previous year’s observation during the same month). So it will be chosen as the candidate model.

Finding the 2nd candidate model

selecting three tentative ARMA(p,q) processes

The top 3 models for detrended2 in order by BIC are…

  • 1.) model 4: AR(12), where only the 4th and 12th AR coefficients are nonzero. This model had an insignificant 4th AR coefficient, and removing it would make it identical to model 5, so we keep it for now only for comparison purposes.
  • 2.) model 5: AR(12), where only the 12th AR coefficient is nonzero.
  • 3.) model 6: ARMA(12,3) where only the ar4, ar12, and ma3 coefficients are nonzero. This third model had a nonsignificant 4th AR coefficient as well, and removing it does lead to a lower AIC,AICc,and BIC. So, model 6 will still be ARMA(12,3), but only ar12 and ma3 coefficients are nonzero.

Here is the fourth model fit

##      Estimate     SE t.value p.value
## ar4   -0.0030 0.1080 -0.0281  0.9776
## ar12   0.4921 0.1109  4.4361  0.0000

and the fifth model fit

##      Estimate     SE t.value p.value
## ar12   0.4921 0.1109  4.4371       0

and the 6th model fit

##      Estimate     SE t.value p.value
## ar12   0.4525 0.1137  3.9812  0.0001
## ma3    0.2359 0.1262  1.8693  0.0652

comparing AIC,AICc,BIC for models 4-6

Model 6 is the best model in terms of AIC, and AICc. However, model 5 is the best in terms of BIC. In addition, it is the simplest and most interpretable model with only one significant coefficient (ar12). For these reasons, model 5 is the second candidate model.

Finding the 3rd candidate model

selecting two tentative ARMA(p,q) processes

The top 2 models for the detrended3 data by BIC are…

  • 1.) model 7: ARMA(12,1), where only the ar12 and ma1 coefficients are nonzero.
  • 2.) model 8: ARMA(14,8), where only ar1,ar8,ar12,ar14,and ma8 coefficients are nonzero.

Here is the 7th model fit

##      Estimate     SE t.value p.value
## ar12   0.2083 0.1398  1.4896  0.1402
## ma1   -0.2638 0.1007 -2.6200  0.0105

and the 7th model fit

##      Estimate     SE   t.value p.value
## ar1   -0.0926 0.0010  -97.2107       0
## ar8   -0.3011 0.0008 -381.3748       0
## ar12   0.1537 0.0022   69.2625       0
## ma1   -0.2884 0.0047  -60.7980       0
## ma8    0.4316 0.0009  498.0781       0

Comparing AIC, AICc, and BIC for models 7 and 8

Model 7 outperforms model 8 and will thus be the third candidate model.

Comparing candidate models

Okay, so models 1, 5, and 7 are the candidate models. Let’s compare them using residuals and SSE.

Comparing residuals of candidate models

m1<-sarima(detrended$detrend,12,0,0,fixed=c(NA,rep(0,10),NA,0))
## initial  value 2.403344 
## iter   2 value 2.297748
## iter   3 value 2.280937
## iter   4 value 2.280277
## iter   5 value 2.280270
## iter   6 value 2.280269
## iter   6 value 2.280269
## final  value 2.280269 
## converged
## initial  value 2.473486 
## iter   2 value 2.467751
## iter   3 value 2.467570
## iter   4 value 2.467570
## iter   4 value 2.467570
## iter   4 value 2.467570
## final  value 2.467570 
## converged

m5<- sarima(detrended2$detrend,12,0,0,fixed=c(rep(0,11),NA,0))
## initial  value 2.479005 
## iter   2 value 2.407781
## iter   3 value 2.362989
## iter   4 value 2.362605
## iter   5 value 2.362603
## iter   6 value 2.362602
## iter   6 value 2.362602
## iter   6 value 2.362602
## final  value 2.362602 
## converged
## initial  value 2.574895 
## iter   2 value 2.566981
## iter   3 value 2.566885
## iter   4 value 2.566884
## iter   4 value 2.566884
## final  value 2.566884 
## converged

m7<- sarima(detrended3$detrend,12,0,1,fixed=c(rep(0,11),NA,NA,0))
## initial  value 2.247278 
## iter   2 value 2.191119
## iter   3 value 2.168355
## iter   4 value 2.168011
## iter   5 value 2.167896
## iter   6 value 2.167896
## iter   6 value 2.167896
## final  value 2.167896 
## converged
## initial  value 2.457022 
## iter   2 value 2.451722
## iter   3 value 2.451040
## iter   4 value 2.450973
## iter   5 value 2.450973
## iter   5 value 2.450973
## iter   5 value 2.450973
## final  value 2.450973 
## converged

The residuals of model 5 seem to have some autocorrelation in the residuals, where models 1 and 7 do not. However, model 5 appears to be closer to normal (it is the only nonsignificant p-value in the shapiro-wilks test.

## 
##  Shapiro-Wilk normality test
## 
## data:  m1$fit$residuals
## W = 0.96819, p-value = 0.03745
## 
##  Shapiro-Wilk normality test
## 
## data:  m5$fit$residuals
## W = 0.97838, p-value = 0.1753
## 
##  Shapiro-Wilk normality test
## 
## data:  m7$fit$residuals
## W = 0.96228, p-value = 0.01557

Comparing candidate model predictions against true values using SSE

Here is the predictions for Model 1:

for Model 5:

and for Model 7:

Here is what the forecast looks like for the three models when looking at the actual data (as opposed to the detrended data):

Let’s compare the SSE between the three candidate models.

#model 1 SSE
sum(( test$avgIPFPoints  -(linearPart1+arimaPart$m1Pred) )^2)
## [1] 222.0705
sqrt((sum(( test$avgIPFPoints  -(linearPart1+arimaPart$m1Pred) )^2)/5))
## [1] 6.664391
#model 5 SSE
sum(( test$avgIPFPoints  -(linearPart2+arimaPart$m5Pred) )^2)
## [1] 375.0917
sqrt((sum(( test$avgIPFPoints  -(linearPart2+arimaPart$m5Pred) )^2)/5))
## [1] 8.661312
#model 7 SSE
sum(( test$avgIPFPoints  -(linearPart3+arimaPart$m7Pred) )^2)
## [1] 188.3379
sqrt((sum((  test$avgIPFPoints  -(linearPart3+arimaPart$m7Pred) )^2)/5))
## [1] 6.137392
149.41-142.31
## [1] 7.1
156.5-149.41
## [1] 7.09
376.5-369.4
## [1] 7.1
490.05-482.95
## [1] 7.1
#206 kg - 200 kg

Since model 5 had autocorrelation in the residuals and by far the highest SSE, the top two models for forecasting will be models 1 and 7. To give some interpretation, Model 1 had a root MSE of ~6.66 IPFPoints, and Model 7 had a root MSE of ~6.14 IPFpoints. This difference cannot be directly converted back into TotalKG or TotalLB, as IPFPoints maps to many different totals dependeng on the weight and sex of the individual. However, for the sake of example, if I were to estimate an 83kg male’s (~183lb) IPFPoint score, and I was off by 7 points, that would translate to an error of about 6kg.

Refitting top two models with the whole dataset for forecast

Now I will fit models 1 and 7 to the whole dataset

top 2 models’ coefficients

These are the new coefficients used to detrend the data when considering the whole dataset for model 1:

## 
## Call:
## lm(formula = avgIPFPoints ~ yearMonth + monthFeb + monthMay + 
##     monthJun, data = tsdRaw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.923  -8.606   0.048   5.819  43.003 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 292.913762  29.881636   9.802 1.63e-15 ***
## yearMonth     0.012327   0.001788   6.893 9.86e-10 ***
## monthFeb    -14.163287   4.848981  -2.921 0.004494 ** 
## monthMay    -11.955369   5.153327  -2.320 0.022801 *  
## monthJun     18.937875   5.151152   3.676 0.000418 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.95 on 83 degrees of freedom
## Multiple R-squared:  0.4843, Adjusted R-squared:  0.4594 
## F-statistic: 19.48 on 4 and 83 DF,  p-value: 2.454e-11

and the new coefficients used to detrend the data when considering the whole dataset for model 7:

## 
## Call:
## lm(formula = avgIPFPoints ~ yearMonth + month, data = tsdRaw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.467  -6.908  -0.258   6.981  40.930 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 286.088822  30.152047   9.488 1.76e-14 ***
## yearMonth     0.012288   0.001793   6.853 1.75e-09 ***
## monthFeb     -6.686111   6.483692  -1.031 0.305752    
## monthMar      6.393294   6.484324   0.986 0.327322    
## monthApr      9.502315   6.485473   1.465 0.147054    
## monthMay     -4.481848   6.711942  -0.668 0.506348    
## monthJun     26.412610   6.711247   3.936 0.000184 ***
## monthJul     15.468174   6.711012   2.305 0.023941 *  
## monthAug      6.037659   6.711223   0.900 0.371193    
## monthSep      3.297308   6.711894   0.491 0.624675    
## monthOct     13.147988   6.712982   1.959 0.053879 .  
## monthNov      6.261340   6.714558   0.933 0.354069    
## monthDec      8.133202   6.716522   1.211 0.229726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.97 on 75 degrees of freedom
## Multiple R-squared:  0.5329, Adjusted R-squared:  0.4581 
## F-statistic: 7.129 on 12 and 75 DF,  p-value: 1.759e-08

Here is the fit for the first model ARIMA model when including all data points (not just training data):

##      Estimate     SE t.value p.value
## ar12   0.3485 0.1152  3.0252  0.0033
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##       ar1  ar2  ar3  ar4  ar5  ar6  ar7  ar8  ar9  ar10  ar11    ar12
##         0    0    0    0    0    0    0    0    0     0     0  0.3485
## s.e.    0    0    0    0    0    0    0    0    0     0     0  0.1152
##       xmean
##           0
## s.e.      0
## 
## sigma^2 estimated as 141.5:  log likelihood = -343.53,  aic = 691.07

and here is the fit for the 7th ARIMA model when including all data points (not just training data):

##      Estimate     SE t.value p.value
## ar12   0.2083 0.1398  1.4896  0.1402
## ma1   -0.2638 0.1007 -2.6200  0.0105
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##       ar1  ar2  ar3  ar4  ar5  ar6  ar7  ar8  ar9  ar10  ar11    ar12
##         0    0    0    0    0    0    0    0    0     0     0  0.2083
## s.e.    0    0    0    0    0    0    0    0    0     0     0  0.1398
##           ma1  xmean
##       -0.2638      0
## s.e.   0.1007      0
## 
## sigma^2 estimated as 133.6:  log likelihood = -321.2,  aic = 648.41

top 2 models’ equations

These are the following model Equations:

Let t = the yearMonth, Y_t = the avgIPFPoints at t…

Model 1

\[ Y_t= 292.91376 + 0.01233t - 14.16329(monthFeb) - 11.95537(monthMay) + 18.93788(monthJun) + 0.3485Y_{t-12} +e_t \]

\[ e_t \sim WN(0, \sigma^2 \approx 141.5 ) \]

Model 7

\[ Y_t = 286.08882 + 0.01229t -6.68611(monthFeb) + 6.39329(monthMar) + 9.50232(monthApr) -4.48185(monthMay) +26.41261(monthJun) + 15.46817(monthJul) + 6.03766(monthAug)+ 3.29731(monthSep) +13.14799(monthOct) + 6.26134(monthNov) + 8.13320(monthDec) + 0.2083045 Y_{t-12} + e_t -0.2637751 e_{t-1} \]

\[ e_t \sim WN(0, \sigma^2 \approx 133.6 ) \]

Top Models’ Forecasts

Now, let’s use these models to forecast one year into the future!

##         m1Pred      m7Pred
## 1   0.82395916 -0.81420259
## 2  -1.20685674 -0.72580020
## 3   2.54800737 -0.13529816
## 4  -0.09001247  0.25975655
## 5  -2.81154032 -0.81572625
## 6   5.72924250  2.29501412
## 7   2.40727866  1.73376427
## 8  -4.47887619 -2.85648393
## 9  -3.30734790 -0.41692583
## 10  0.03245647  0.03039012
## 11  0.87835150  0.77573026
## 12  1.49014921  0.48878852
## Time Series:
## Start = 89 
## End = 100 
## Frequency = 1 
##        1        2        3        4        5        6        7        8 
## 502.1912 533.5550 523.5697 514.9152 511.4803 524.8104 517.7434 515.3937 
##        9       10       11       12 
## 510.0810 504.2231 518.4042 521.6072

Conclusion

Models 1 and 7, the two final models, were very similar in that the observation from previous year at the same month (Y_{t-12}) was significant and positive, so the higher the average strength was one year prior to month t, the higher we expect this year at month t to be, holding other variables constant. It would also look that competing in the winter negatively affects strength scores and competing in summer months positively effects strength scores in general. It would be interesting to research the topic further, but my first guess is that this is due to competitive meets being placed in the summer and less competitive meets being in the winter on average. In addition, I do not really think this model can go on linearly forever. The population of lifters may get stronger over time, but that can really only happen up to a point. I do think there’s a lot of room for improvement in the population as it continues to grow and training methodology keeps improving for the foreseeable future, however. I will be interested to see how my model performs over the next few years as I watch these numbers change!