Evaluating predictive models for a 28-day forecast of Walmart Sales within the states of California, Texas, Wisconsin through achieving the lowest possible Root Mean Square Error (RMSE).
Our report presents a comprehensive analysis of Walmart retail stores’ sales data from California, Texas, and Wisconsin. The purpose of our models is to understand sales trends and patterns by examining daily units sold and average prices across these states. Additionally, we build simple linear models to forecast sales in each state, evaluating their performance to refine our predictions. Our metric for comparing the accuracy of models is RMSE. Further analysis techniques led us to use forecasting methods such as LASSO, AR, MA, and ARMA. These methods led to increased performance in our prediction and a decrease in RMSE across all states.
The analysis for this study has been performed using R Studio. The models and Exploratory Data Analysis have also been executed using development tools like R Studio Several packages have been used to perform the initial and final outcome EDA for the analysis. For the initial EDA, we used tidyr, readr, dplyr, tidyverse. Packages like numpy, pandas, tidyverse, etc. have been used for data wrangling and manipulation. For the models that have been created, several packages like ‘plotly’, ‘glmnet’ were implemented.
From this analysis, we can guide strategic decisions for inventory management and promotional activities for Walmart.
Previous studies have been performed to predict sales for Walmart based on the availability of relevant historical data. Rashmi Jeswani from Rochester’s Institute of Technology College of Computing and Information Sciences analyzed the Walmart data set to predict sales (“Predicting Walmart Sales, Exploratory Data Analysis, and Walmart Sales Dashboard” 2017). This paper explored the performance of forecasts for future weekly sales for Walmart stores based on many different models including linear, lasso regression, random forest, and gradient boosting. They mentioned further analysis can be done using ARMA and ARIMA models which we have conducted in this study.
Michael Crown another data scientist in his paper “Weekly Sales Forecasts Using Non-Seasonal ARIMA Models”(2016) Autoregressive Integrated Moving Average (ARIMA) modeling was used in this paper to create one year of weekly forecasts using 2.75 years of sales data, with features for the store, department, date, weekly sales, and if the week contains a major holiday. In this paper, the metric for comparing the accuracy of models is the NRMSE. It was found that “NRMSE for individual department forecasts was 0.155. In other words, 75 percent of the department-wise forecasts had an average error that was less than 15.5% of the mean sales during that time”. We used a similar data set but One of the limitations of this data set that Crown mentioned was the limited data of 143 weeks (2.75 Years). Our data set improves on this with nearly 5 years of data.
Studies on price rigidity during holidays helped us explain why we see what we see in our price graph throughout the years. ‘Holiday Price Rigidity and Cost of Price Adjustment’ (DANIEL LEVY et. al.) examines the prediction that Due to higher store traffic, tasks become more urgent during holidays, and thus the holiday-period opportunity cost of price adjustment may increase dramatically for retail stores’ which leads to greater price rigidity. This paper used weekly retail scanner price data from a major Midwestern supermarket chain. They concluded that the menu cost theory offers the best explanation for the holiday period price rigidity. Menu cost theory is when firms or in this case Walmart hesitate to adjust prices frequently due to the expenses incurred in changing them, such as reprinting menus or updating catalogs.
Random peak in sales of daily average units in Texas on 6/15/2015, could potentially be due to the fact that there was a Tropical Storm that was forecasted to hit Texas from the South. Potentially, people decided to stock up on essentials before the storm hit. 6/16/2015 was the day of the storm and so the graph sees a dip in units. There are yearly dips in daily units sold on or before Christmas possibly because people are spending time with family and certain store locations have limited hours reducing opportunity for sales. In terms of hierarchy, California is the clear frontrunner in the amount of units sold daily, which we think is due to the high population and the difference in store locations among states. Because the population is larger in California it is not inconceivable that the daily units sold would be higher even with fewer store locations but it was surprising to find that there was a striking similarity in the trend of Wisconsin and Texas, as Texas has 509 Walmart locations and therefore there would be more scope for sales.
On initial intuition, we thought California would have had the highest average prices for goods as California has the highest cost of living for the three states but we found that Wisconsin is consistently higher. From 2011 to 2013, there was a decent amount of difference in average prices for all three states, but from 2013 onwards, the prices have converged to similar levels. We think it could be due to logistical reasons such as an increase in distribution centers, but we have no conclusive evidence for the reason for the convergence. The average prices for the three states follow similar trends of rising and falling together, around the same periods, which can be indicative of a nationalistic trend.
Outliers on each month represent a significant event or a holiday such as the one we see in December at nearly 0 Monthly Units sold for each of California, Texas, and Wisconsin. We see an increase in the median from January to August. We assume this is because some workers get the summer off and tourism is increased during the summer months. Back-to-school sales at Walmart may also be a reason we see this increase in those months but drop off in the subsequent months.
The days of the week with the most sales are Saturday and Sunday as expected as most grocery shopping trips occur on the weekend. The parabolic shape shows that people prepare for the week by shopping on the weekend and sales decrease in the middle of the week for that reason. The variance for the daily units sold for Texas was the smallest followed by California and Wisconsin with the largest variance amongst the three states. This is probably because there are far fewer Walmart locations in Wisconsin than in California and Texas so the scope for variance in daily units is larger and the distance between stores could mean that fewer people have access to Walmart as compared to other states. As Texas has the highest amount of stores of the three stores, stores are closer to people, so people are more likely to go to them.
We created a function to calculate the root squared mean error (RMSE) of an OLS model compared to the actual dataset. We then created an OLS model using the lm() function. The model predicts the values of the specified state’s unit sales, which is our dependent variable, using certain predictors (combined_month, combined_day, combined_event, snap, and price) from the training dataset. We then created plots and calculated the RMSE for each of the states. The graphs all generally follow the same trends as the actual values, however, all models tend to underestimate the values. Wisconsin and California however underestimate at a higher magnitude compared to the Texas Model which is closer, but still underestimates. The Texas model follows the actual predicted values accurately. The RMSE for the 3 states are California: 2522.48008765255, Texas: 1218.71943323679, and Wisconsin: 3220.363976644. The month and day variables were all statistically significant, as were the Snap and price variables. Certain events were also significant. Below is the summary of all coefficients, as well as the plots for the graphs.
##
## Call:
## lm(formula = CA ~ combined_month + combined_day + combined_event1 +
## combined_event2 + snap_CA + CA_Price - 1, data = combined5Models_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5486.8 -934.2 -39.8 948.1 6190.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## combined_month1 -63972.24 1830.49 -34.948 < 2e-16 ***
## combined_month2 -63562.73 1820.97 -34.906 < 2e-16 ***
## combined_month3 -63716.13 1827.28 -34.869 < 2e-16 ***
## combined_month4 -63500.34 1827.63 -34.745 < 2e-16 ***
## combined_month5 -63707.30 1818.48 -35.033 < 2e-16 ***
## combined_month6 -62566.46 1818.11 -34.413 < 2e-16 ***
## combined_month7 -62846.71 1826.40 -34.410 < 2e-16 ***
## combined_month8 -62607.36 1827.77 -34.253 < 2e-16 ***
## combined_month9 -63027.25 1830.64 -34.429 < 2e-16 ***
## combined_month10 -63364.89 1833.00 -34.569 < 2e-16 ***
## combined_month11 -64329.08 1833.23 -35.091 < 2e-16 ***
## combined_month12 -64206.08 1833.28 -35.023 < 2e-16 ***
## combined_dayMonday -4303.42 133.69 -32.190 < 2e-16 ***
## combined_dayTuesday -5356.60 130.51 -41.042 < 2e-16 ***
## combined_dayWednesday -5687.89 131.74 -43.174 < 2e-16 ***
## combined_dayThursday -5556.64 130.93 -42.441 < 2e-16 ***
## combined_dayFriday -3912.81 130.33 -30.023 < 2e-16 ***
## combined_daySaturday -478.47 130.40 -3.669 0.00025 ***
## combined_event1Chanukah End 167.08 674.92 0.248 0.80451
## combined_event1Christmas -13679.75 674.81 -20.272 < 2e-16 ***
## combined_event1Cinco De Mayo -890.34 753.75 -1.181 0.23767
## combined_event1ColumbusDay 797.92 680.16 1.173 0.24089
## combined_event1Easter -1780.71 750.74 -2.372 0.01780 *
## combined_event1Eid al-Fitr -5.19 668.76 -0.008 0.99381
## combined_event1EidAlAdha -119.86 668.14 -0.179 0.85765
## combined_event1Father's day -2052.92 757.38 -2.711 0.00678 **
## combined_event1Halloween -1593.96 674.92 -2.362 0.01829 *
## combined_event1IndependenceDay -852.12 676.07 -1.260 0.20768
## combined_event1LaborDay 3266.60 681.84 4.791 1.79e-06 ***
## combined_event1LentStart -329.97 617.90 -0.534 0.59339
## combined_event1LentWeek2 -100.92 618.38 -0.163 0.87037
## combined_event1MartinLutherKingDay 1653.70 680.43 2.430 0.01518 *
## combined_event1MemorialDay 2140.34 680.80 3.144 0.00169 **
## combined_event1Mother's day -2957.67 680.35 -4.347 1.45e-05 ***
## combined_event1NBAFinalsEnd -279.52 753.59 -0.371 0.71074
## combined_event1NBAFinalsStart -145.72 673.57 -0.216 0.82875
## combined_event1NewYear -4011.03 676.51 -5.929 3.62e-09 ***
## combined_event1OrthodoxChristmas -431.40 676.63 -0.638 0.52383
## combined_event1OrthodoxEaster -434.95 1058.20 -0.411 0.68110
## combined_event1Pesach End -398.70 675.13 -0.591 0.55489
## combined_event1PresidentsDay 1890.09 624.08 3.029 0.00249 **
## combined_event1Purim End 131.04 615.18 0.213 0.83134
## combined_event1Ramadan starts -186.89 667.22 -0.280 0.77943
## combined_event1StPatricksDay 15.82 616.31 0.026 0.97952
## combined_event1SuperBowl -1510.80 625.45 -2.416 0.01581 *
## combined_event1Thanksgiving -3670.75 680.52 -5.394 7.77e-08 ***
## combined_event1ValentinesDay -1429.44 617.85 -2.314 0.02080 *
## combined_event1VeteransDay 1417.20 675.19 2.099 0.03595 *
## combined_event2Cinco De Mayo 472.57 1824.30 0.259 0.79563
## combined_event2Easter -3408.94 1815.87 -1.877 0.06063 .
## combined_event2Father's day -2216.33 1659.20 -1.336 0.18178
## combined_event2OrthodoxEaster 1008.21 1658.92 0.608 0.54343
## snap_CA1 1193.17 74.16 16.090 < 2e-16 ***
## CA_Price 18587.47 414.03 44.894 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1482 on 1859 degrees of freedom
## Multiple R-squared: 0.9909, Adjusted R-squared: 0.9907
## F-statistic: 3763 on 54 and 1859 DF, p-value: < 2.2e-16
Many variables, including combined_month, combined_day, combined_event1, snap_CA, and CA_Price, have significant effects on the sales in CA. For instance, the coefficients for months indicate seasonal effects (e.g., June has a significantly higher coefficient), and different weekdays show different sales trends; for example, combined_dayTuesday has a large negative coefficient). The combined_event1Christmas has a large negative coefficient (-13679.75), indicating a significant drop in sales during this period, likely due to store closures or holiday impacts.
Plotting our predicted versus actual data values across a 28-day forecasting period in California, our simple OLS model portrayed a close resemblance to the actual data trend. However, it consistently displayed a tendency to underestimate the actual values. This discrepancy may be due to the omission of certain relevant explanatory variables in our model, or a need to fine-tune the intercept term to better align with the observed data. Further exploration into these factors could have potentially enhanced the predictive accuracy of our model.
##
## Call:
## lm(formula = TX ~ combined_month + combined_day + combined_event1 +
## combined_event2 + snap_TX + TX_Price - 1, data = combined5Models_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3685.2 -524.6 -19.6 492.4 6951.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## combined_month1 -27842.01 941.26 -29.580 < 2e-16 ***
## combined_month2 -27099.16 934.32 -29.004 < 2e-16 ***
## combined_month3 -27324.53 938.15 -29.126 < 2e-16 ***
## combined_month4 -27281.68 938.42 -29.072 < 2e-16 ***
## combined_month5 -27339.81 934.24 -29.264 < 2e-16 ***
## combined_month6 -26705.02 934.61 -28.574 < 2e-16 ***
## combined_month7 -26914.46 938.18 -28.688 < 2e-16 ***
## combined_month8 -26637.79 938.46 -28.385 < 2e-16 ***
## combined_month9 -27118.20 940.58 -28.831 < 2e-16 ***
## combined_month10 -27302.91 941.84 -28.989 < 2e-16 ***
## combined_month11 -27902.38 942.42 -29.607 < 2e-16 ***
## combined_month12 -27661.07 942.99 -29.333 < 2e-16 ***
## combined_dayMonday -2577.89 80.54 -32.007 < 2e-16 ***
## combined_dayTuesday -3366.79 78.63 -42.818 < 2e-16 ***
## combined_dayWednesday -3430.04 79.37 -43.216 < 2e-16 ***
## combined_dayThursday -3337.78 78.88 -42.316 < 2e-16 ***
## combined_dayFriday -2478.35 78.52 -31.564 < 2e-16 ***
## combined_daySaturday -507.09 78.56 -6.455 1.38e-10 ***
## combined_event1Chanukah End 504.45 406.61 1.241 0.214896
## combined_event1Christmas -9036.20 406.54 -22.227 < 2e-16 ***
## combined_event1Cinco De Mayo -336.64 454.12 -0.741 0.458599
## combined_event1ColumbusDay 542.48 409.77 1.324 0.185706
## combined_event1Easter 249.87 452.25 0.553 0.580666
## combined_event1Eid al-Fitr 21.17 403.13 0.053 0.958116
## combined_event1EidAlAdha 22.65 402.52 0.056 0.955141
## combined_event1Father's day -487.75 456.25 -1.069 0.285186
## combined_event1Halloween -1044.80 406.61 -2.570 0.010260 *
## combined_event1IndependenceDay 835.97 406.37 2.057 0.039811 *
## combined_event1LaborDay 2056.76 410.20 5.014 5.84e-07 ***
## combined_event1LentStart 194.09 372.26 0.521 0.602156
## combined_event1LentWeek2 104.81 372.29 0.282 0.778338
## combined_event1MartinLutherKingDay 832.87 409.92 2.032 0.042317 *
## combined_event1MemorialDay 962.80 410.13 2.348 0.019001 *
## combined_event1Mother's day -1446.04 410.09 -3.526 0.000432 ***
## combined_event1NBAFinalsEnd -679.04 453.74 -1.497 0.134680
## combined_event1NBAFinalsStart 119.69 405.81 0.295 0.768081
## combined_event1NewYear -1930.87 407.56 -4.738 2.33e-06 ***
## combined_event1OrthodoxChristmas 388.80 407.64 0.954 0.340309
## combined_event1OrthodoxEaster 508.09 638.01 0.796 0.425919
## combined_event1Pesach End 66.14 406.73 0.163 0.870840
## combined_event1PresidentsDay 664.23 375.73 1.768 0.077250 .
## combined_event1Purim End 129.60 370.70 0.350 0.726672
## combined_event1Ramadan starts -205.09 401.96 -0.510 0.609957
## combined_event1StPatricksDay 326.85 371.31 0.880 0.378829
## combined_event1SuperBowl -187.70 376.33 -0.499 0.617994
## combined_event1Thanksgiving -440.39 409.92 -1.074 0.282819
## combined_event1ValentinesDay -42.46 372.22 -0.114 0.909183
## combined_event1VeteransDay -149.25 407.59 -0.366 0.714266
## combined_event2Cinco De Mayo -929.76 1098.05 -0.847 0.397250
## combined_event2Easter -2808.13 1094.98 -2.565 0.010409 *
## combined_event2Father's day -263.83 1000.11 -0.264 0.791960
## combined_event2OrthodoxEaster 132.25 999.22 0.132 0.894720
## snap_TX1 1061.90 44.56 23.833 < 2e-16 ***
## TX_Price 8971.97 214.64 41.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 892.6 on 1859 degrees of freedom
## Multiple R-squared: 0.9924, Adjusted R-squared: 0.9921
## F-statistic: 4477 on 54 and 1859 DF, p-value: < 2.2e-16
The coefficients for months indicate strong seasonal effects, with all months showing significantly negative coefficients, suggesting a generalized decrease from a base level, potentially overstating seasonal lows. Different weekdays also show distinct sales trends; notably, combined_dayTuesday has a substantial negative coefficient (-3366.79), indicating lower sales on these days compared to the base day (likely Sunday). The combined_event1Christmas variable shows a large negative coefficient (-9036.20), suggesting a significant drop in sales during this period, likely due to holiday-related store closures or shifts in consumer behavior.
Our actual versus predicted model for Texas’ 28-day forecasting period showed an even closer similarity than California, with the biggest discrepancies being an underestimate on May 10th and 15th. Again, the trend with the predicted OLS model was underestimating the actual trend. Texas having comparatively the more accurate predictor could be due to the fact that it has the most stores as a state, making them more accessible.
##
## Call:
## lm(formula = WI ~ combined_month + combined_day + combined_event1 +
## combined_event2 + snap_WI + WI_Price - 1, data = combined5Models_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4676.5 -953.8 -80.3 900.3 6362.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## combined_month1 -94510.56 2431.83 -38.864 < 2e-16 ***
## combined_month2 -93872.22 2422.99 -38.742 < 2e-16 ***
## combined_month3 -94884.58 2435.46 -38.960 < 2e-16 ***
## combined_month4 -95114.58 2435.01 -39.061 < 2e-16 ***
## combined_month5 -95376.40 2424.10 -39.345 < 2e-16 ***
## combined_month6 -94722.84 2423.26 -39.089 < 2e-16 ***
## combined_month7 -95059.20 2433.61 -39.061 < 2e-16 ***
## combined_month8 -95152.59 2436.50 -39.053 < 2e-16 ***
## combined_month9 -95227.95 2438.45 -39.053 < 2e-16 ***
## combined_month10 -95204.86 2438.58 -39.041 < 2e-16 ***
## combined_month11 -94964.93 2434.19 -39.013 < 2e-16 ***
## combined_month12 -94744.94 2434.89 -38.911 < 2e-16 ***
## combined_dayMonday -2164.16 126.60 -17.094 < 2e-16 ***
## combined_dayTuesday -2379.22 123.59 -19.250 < 2e-16 ***
## combined_dayWednesday -2335.49 124.77 -18.719 < 2e-16 ***
## combined_dayThursday -2199.14 123.99 -17.737 < 2e-16 ***
## combined_dayFriday -830.80 123.41 -6.732 2.22e-11 ***
## combined_daySaturday 928.26 123.49 7.517 8.68e-14 ***
## combined_event1Chanukah End 352.10 639.01 0.551 0.581696
## combined_event1Christmas -8879.30 638.93 -13.897 < 2e-16 ***
## combined_event1Cinco De Mayo -930.46 713.71 -1.304 0.192499
## combined_event1ColumbusDay 40.62 644.30 0.063 0.949741
## combined_event1Easter -2257.52 710.92 -3.175 0.001520 **
## combined_event1Eid al-Fitr -326.91 633.24 -0.516 0.605743
## combined_event1EidAlAdha 661.21 632.69 1.045 0.296123
## combined_event1Father's day -981.93 717.04 -1.369 0.171029
## combined_event1Halloween -23.67 639.08 -0.037 0.970456
## combined_event1IndependenceDay -395.64 638.71 -0.619 0.535706
## combined_event1LaborDay 1928.22 644.07 2.994 0.002792 **
## combined_event1LentStart -71.43 584.99 -0.122 0.902825
## combined_event1LentWeek2 -266.01 585.15 -0.455 0.649452
## combined_event1MartinLutherKingDay -189.65 644.46 -0.294 0.768576
## combined_event1MemorialDay 325.27 644.61 0.505 0.613900
## combined_event1Mother's day -2168.61 644.57 -3.364 0.000783 ***
## combined_event1NBAFinalsEnd -214.04 713.13 -0.300 0.764105
## combined_event1NBAFinalsStart 266.36 637.80 0.418 0.676277
## combined_event1NewYear -2368.92 639.04 -3.707 0.000216 ***
## combined_event1OrthodoxChristmas -381.57 639.17 -0.597 0.550590
## combined_event1OrthodoxEaster -1433.04 1002.94 -1.429 0.153219
## combined_event1Pesach End -676.80 639.43 -1.058 0.289993
## combined_event1PresidentsDay -351.85 590.51 -0.596 0.551354
## combined_event1Purim End -419.56 582.50 -0.720 0.471446
## combined_event1Ramadan starts 103.26 631.82 0.163 0.870189
## combined_event1StPatricksDay -103.00 583.56 -0.176 0.859923
## combined_event1SuperBowl -897.01 591.06 -1.518 0.129279
## combined_event1Thanksgiving -2816.61 644.27 -4.372 1.30e-05 ***
## combined_event1ValentinesDay -1078.42 586.38 -1.839 0.066059 .
## combined_event1VeteransDay -470.18 640.59 -0.734 0.463053
## combined_event2Cinco De Mayo 2707.43 1726.03 1.569 0.116915
## combined_event2Easter -2300.09 1720.73 -1.337 0.181486
## combined_event2Father's day -266.21 1571.76 -0.169 0.865524
## combined_event2OrthodoxEaster 15.83 1570.78 0.010 0.991959
## snap_WI1 1887.29 69.74 27.062 < 2e-16 ***
## WI_Price 23747.62 548.68 43.281 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1403 on 1859 degrees of freedom
## Multiple R-squared: 0.9801, Adjusted R-squared: 0.9796
## F-statistic: 1699 on 54 and 1859 DF, p-value: < 2.2e-16
Weekly sales trends are also evident, with days like Tuesday and Wednesday having notable negative coefficients (-2379.22 and -2335.49 respectively), suggesting lower sales on these weekdays compared to the base day (possibly Sunday). The combined_event1Christmas variable shows a significant negative impact (-8879.30), indicating a considerable drop in sales during this period as the store closes on that day. Additionally, special events such as combined_event1LaborDay and combined_event1Mother’s day demonstrate positive and negative impacts respectively, aligning with expected shifts in consumer purchasing patterns during these times.
Wisconsin showed the largest discrepancy between predicted and actual trends. The predicted model consistently underestimated the actual trend. This may be due to omitted explanatory variables. The largest discrepancy occurs on May 15th. As there are 99 total Walmart locations in Wisconsin, our line of thinking is that individuals purchase more products during their Walmart trips as locations are sparse and spread apart.
## [1] "CAlifornia OLS RMSE: 2522.4800876525"
## [1] "Texas OLS RMSE: 1218.71943323681"
## [1] "Wisconsin OLS RMSE: 3220.36397664398"
Our goal is to predict the number of units sold in each of our three states, based on various factors such as the presence of SNAP, holidays, weekday, month, and the price. Lasso regression helps us with this task by analyzing all these factors together and identifying which ones have the most significant impact on the number of units sold. It achieves this by penalizing less important factors, essentially shrinking their influence towards zero. This process simplifies the model while retaining the essential factors crucial for accurate predictions. The workflow involves determining the optimal penalty level through cross-validation of 10 folds and then building a final model to make predictions on new data. Lasso regression is sifting through the dataset to pinpoint the key drivers influencing Walmart store sales in the states, thereby aiding in better decision-making and resource allocation. We took our final model that was obtained through this lasso method, and calculated its RMSE to compare to the OLS and the other models we used.
Autoregressive (AR), Moving Average (MA), and Autoregressive Moving Average (ARMA) models are commonly used to forecast future values based on past data.The AR model looks at the relationship between the current value and previous values in the time series data. It consider how the number of units sold today is related to the number sold in the past. The MA model, on the other hand, focuses on the relationship between the current value and the residual errors from past predictions. This could involve assessing whether any unusual spikes or drops in sales can be attributed to specific factors, such as a temporary promotion or a holiday. Combining the strengths of both AR and MA models, the ARMA model provides a comprehensive approach by considering both the autoregressive and moving average components.In order to select the best AR, MA, and ARMA models, we decided to just brute force it, and select the best model based on the RMSE. We know that there is things such as the ACF and the PACF graphs that one can use to be able to have an idea on what is the best model, but again we decided to just brute force it. For both AR and MA, we did the first 50 lags, as the computation time for each model was getting significantly longer and more complicated the more lags you include. We felt that 50 was an adequate number as it covers at least one month back, but also includes most of the second month before the prediction to include any trends better. For ARMA, we did the combinations of ARMA(1:30,1:4) and ARMA(1:4,1:30), so out of these 240 models, we chose the best one based on the RMSE.
Analyzing the performance of our five prediction models OLS, AR, MA, ARMA, and Lasso by plotting them against the actual values reveals notable insights. Specifically focusing on California, we observe that both AR and ARMA exhibit a consistent and close adherence to the observed trend, effectively predicting peaks in the data. However, AR occasionally lags behind, displaying a tendency to systematically overestimate or underestimate these peaks. Moving on, MA emerges as the next best-performing model in terms of prediction accuracy, followed by Lasso, and lastly, the straightforward OLS approach.
In our analysis for California, the ARMA model achieved the lowest RMSE of 1193.415, representing a significant reduction compared to our baseline OLS model. The ARMA model reduced the RMSE by approximately 52.7% compared to the OLS baseline. Following closely was the AR model which yielded an RMSE of 1205.961 also indicating a substantial improvement, reducing RMSE by 52.2% compared to the OLS model. Conversely, the MA model exhibited a higher RMSE of 2422.327, indicating a relatively poorer predictive performance compared to AR and ARMA. It still, however, reduced the RMSE by about 4%. The Lasso had an RMSE of 2062.748, translating to a reduction of approximately 18.2% compared to OLS. Lastly, the OLS model, serving as our baseline, exhibited the highest RMSE of 2522.48.
The analysis of Texas illustrated a slightly different outcome. Specifically, in Texas, our naive OLS approach surprisingly outperformed the other models in capturing trends and peaks when predicting the units. This was closely trailed by the AR, ARMA, and Lasso models, respectively, with the MA model exhibiting the most significant deviation in accuracy compared to the other models.
In the case of Texas, our baseline OLS model boasted the lowest RMSE of 1218.719. Contrary to expectation, both the AR and ARMA models did not achieve a reduction in RMSE compared to the OLS baseline; instead, they yielded similar levels of performance, with RMSEs of 1223.122 and 1248.528, respectively. The Lasso model displayed a slightly less accurate performance, resulting in an RMSE of 1411.404, which represented an increase of approximately 13.04% compared to our OLS baseline. Notably, the MA model exhibited the poorest performance, demonstrating a substantially higher RMSE of 1984.349, translating to an increase of approximately 62.84% compared to the OLS baseline in Texas.
Examining the graph for Wisconsin revealed a pattern similar to that observed in California. Notably, both ARMA and AR models performed significantly better than the others, with AR lagging slightly behind. Following this trend, Lasso was the next best-performing model, succeeded by MA, and finally, the OLS model lagged behind the rest in terms of predictive accuracy.
For our analysis for Wisconsin, the ARMA model achieved the lowest RMSE of 1290.461, showing a sizable reduction compared to our OLS model’s RMSE value of 3220.364; this translates to a 59.9% decrease in RMSE. The AR model was a close second to ARMA with an RMSE value of 1335.579, indicating a reduction from the OLS baseline of approximately 58.5%. The subsequent models, namely the Lasso Regression, and MA models also led to reductions in RMSE compared to the OLS baseline. The Lasso model resulted in an RMSE of 1992.768, while the MA model yielded an RMSE of 2421.862. Although these values were significantly higher compared to the ARMA and AR models, they still represented notable improvements over the OLS baseline.
For California and Wisconsin which had significant RMSE from the OLS, with both being more than 2500+, we decided to go for the ARMA model as it had the lowest RMSE of the four models we did. Both states ARMA model improved significantly and so we are thus confident in using it as our Final Model. Something interesting to note is that they are not the same ARMA model. California is an ARMA(30,3) model whereas Wisconsin is an ARMA(27,2) model. This is consistent also with the fact that AR models for both states were very close to ARMA models in terms of their RMSE. Texas however is another case as the RMSE was very low to begin with, at only 1218.71943323685 for the OLS. This is significantly lower when compared to California and Wisconsin. None of our models did better than the OLS in terms of RMSE. We however still chose to go with the ARMA(3,7) model even though its RMSE was higher than the AR(31) model for Texas, keeping with the theme of using the ARMA model for our prediction.
## ar1 ar2 ar3 ar4 ar5
## 3.558667e-01 -7.271393e-01 6.487647e-01 -1.022639e-01 1.694344e-03
## ar6 ar7 ar8 ar9 ar10
## 5.679848e-02 2.318451e-01 1.846135e-02 1.037056e-01 -1.260274e-01
## ar11 ar12 ar13 ar14 ar15
## -5.986259e-02 -1.911760e-02 5.647079e-02 1.194663e-01 -2.940521e-02
## ar16 ar17 ar18 ar19 ar20
## 6.113484e-02 -1.337074e-01 -2.838727e-03 -2.966697e-02 -6.727167e-03
## ar21 ar22 ar23 ar24 ar25
## 7.973073e-02 -8.137638e-03 2.504000e-02 -7.603860e-02 -2.499667e-03
## ar26 ar27 ar28 ar29 ar30
## 4.015470e-02 7.781834e-02 2.376836e-01 6.551755e-02 1.430342e-01
## ma1 ma2 ma3 intercept
## -4.839002e-02 7.425929e-01 -4.302537e-01 1.495025e+04
## ar1 ar2 ar3 ma1 ma2
## 2.24656200 -2.24633260 0.99950387 -2.04171124 1.87940167
## ma3 ma4 ma5 ma6 ma7
## -0.61043848 -0.06548488 -0.26739280 0.34451583 -0.18934026
## intercept
## 9903.86471969
## ar1 ar2 ar3 ar4 ar5
## 1.790070e+00 -1.219855e+00 5.738935e-01 -4.875313e-01 2.318016e-01
## ar6 ar7 ar8 ar9 ar10
## 1.113154e-01 5.912248e-02 -3.028156e-01 1.839185e-01 -1.123693e-01
## ar11 ar12 ar13 ar14 ar15
## 1.225514e-01 1.002593e-03 -1.284832e-02 8.201931e-02 -2.719865e-01
## ar16 ar17 ar18 ar19 ar20
## 1.682150e-01 -1.906419e-02 5.845860e-02 -6.106738e-02 -2.041922e-02
## ar21 ar22 ar23 ar24 ar25
## 1.778512e-01 -2.183755e-01 -1.584947e-03 1.233269e-01 -5.384890e-02
## ar26 ar27 ma1 ma2 intercept
## 5.080956e-04 9.634168e-02 -1.530369e+00 7.085451e-01 9.516682e+03
By using ARMA models, we are going to be able to predict future sales the best compared to the other models we used in this analysis. It’s important to note however that the lags are not the same for each state, but they still follow a clear trend that is present throughout our analysis. California and Wisconsin have followed similar trends across the different models we used and the ARMA lags they have are similar, ARMA(30,3) for California, and ARMA(27,2) for Wisconsin. This large number of lags is significantly more compared to Texas’ ARMA(3,7) model. We saw this trend in the RMSE of the OLS, and it follows in the RMSE of these ARMA models. One explanation for this behavior could be the fact that there are significantly more stores in Texas compared to California and Wisconsin. 601 in Texas, 320 in California, and 99 in Wisconsin. (Statista, 2023) As people shop less due to lower access to Walmart stores, the shopping patterns are different in the states. California and Wisconsin customers may be shopping at a larger volume and lower frequencies, which is why the more lags, the more predictive. Since the number of stores in Texas is significantly larger, customers may be purchasing at lower volumes, and higher frequencies. This is why we are seeing fewer lags needed for the Texas model, compared to California and Wisconsin. We are also doing this analysis based on the aggregate of the entire states, but California had four stores in the dataset compared to Texas and Wisconsin which only had three stores in the dataset. We could perhaps do an average per store and then scale that number to the number of stores per state, but the best prediction would obviously be to have the data for all the stores in each state.
The next steps that we can do to further improve our model and to shoot for a lower RMSE, we can combine seasonality and the ARMA models we received. There was a clear case of seasonality in the weekday variable where we saw the peak on the weekend followed by a dropoff on Monday, with sales eventually trending towards an increase heading to the weekend. There was less obvious seasonality in the month variable, however, there does still seem to be an existence there that can be used to further enhance the model. We would do the combinations of ARMA plus weekday, ARMA plus weekday and month, and ARMA plus month. We would compare these three new models to our baseline of just and ARMA model, and pick based on the lowest RMSE of these four models for each state.
Crown, M. (2016). Weekly sales forecasts using non-seasonal arima models. http://mxcrown.com/walmart-sales-forecasting/
Jeswani, Rashmi.(2011). Predicting Walmart Sales, Exploratory Data Analysis, and Walmart Sales Dashboard: Ischool Projects.” www.rit.edu/ischoolprojects/node/104009.
LEVY, D., CHEN, H., MÜLLER, G., DUTTA, S. and BERGEN, M. (2010), Holiday Price Rigidity and Cost of Price Adjustment. Economica, 77: 172-198. https://doi.org/10.1111/j.1468-0335.2008.00738.x
“Number of Walmart Stores in the United States as of 2023, by State.” Statista, Statista Inc., 2023, https://www.statista.com/statistics/1167169/walmart-number-of-stores-by-state-us/