Forecasting of Big Mart Sales Using Machine Learning Technique1
docx
keyboard_arrow_up
School
Wichita State University *
*We aren’t endorsed by this school
Course
MISC
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
docx
Pages
14
Uploaded by jxmyneedi
Forecasting of Big Mart Sales Using
Machine Learning Techniques
Jagadeesh Myneedi, Yaswanth Ala, Bharath Reddy Kolanu
Subject: DS850 Operations Management
Wichita State University, Barton school of business, Wichita, KS, 67220
1
Table of Contents
1. Literature Survey
.................................................................................................................
2
2. Discussion of Subject and Analysis
.....................................................................................
3
2.1 Exploratory Data Analysis
.............................................................................................
4
2.2 Multivariate analysis
......................................................................................................
7
2.3 Model Building
..............................................................................................................
10
2.3.1 Linear Regression
...................................................................................................
10
2.3.2 Random Forest Regression
....................................................................................
10
3. Thoughts / Recommendations
...........................................................................................
11
4. References
...........................................................................................................................
12
9
Objective or Problem Statement
This study objective is to investigate how machine learning methods can be used to predict
Big Mart sales and analysing which attributes are impacting the sales of the mart. This study
looks at how well the suggested system predicts sales and how, by making better predictions,
it can increase the company's profit. Summary of Findings / Facts and Data to support Conclusion or
Recommendation
Data on a variety of sales-influencing factors, including store location, item visibility, item
weight, item type, and promotional offers, were analyzed in the study. The information was
gathered from Big Mart locations throughout India. To create the forecasting system, we used
machine learning techniques like random forest regression and linear regression. When we
compared the effectiveness of ensemble learning approaches versus single model predictive
approaches, we discovered that the latter produced better.
The study's findings indicated that item visibility, item weight, store location, and item type
were some of the most crucial factors influencing sales. We discovered that stores in urban
areas typically had higher sales and that products with greater visibility and weight tended to
sell more. Additionally, we found that sales were significantly impacted by promotional
offers.
Additionally, the data set's anomalies, outliers, and missing data were identified; these
problems were then addressed using techniques like data imputation and outlier detection.
The forecasting system's accuracy was enhanced by the use of these techniques.
Conclusion/Recommendation
In conclusion, the study proved that machine learning methods can be used successfully for
predicting sales in the retail sector. We advised Big Mart to think about utilizing the created
system to generate more precise sales forecasts, which would help the business improve
inventory control and boost profitability. The study's findings also demonstrated how crucial
it is to take into account a number of factors when predicting sales, including item visibility,
item weight, store location, and promotional offers.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
1. Literature Survey
A literature survey has been done to know the previous study on sales forecasting using
machine learning techniques. It helped us to get idea to approach our present problem.
Kim et al., 2020 - A deep learning model was put forth to predict smartphone industry sales.
Long Short-Term Memory (LSTM) neural networks were used by the authors, and they
discovered that their model outperformed conventional time series models with a Root Mean
Square Error (RMSE) of 0.79.
Li et al., 2020 - Li et al. applied machine learning algorithms to predict sales in the clothing
industry, including Random Forest, Gradient Boosting, and Extreme Gradient Boosting. The
authors discovered that their suggested model, which applies an ensemble learning strategy to
combine the three algorithms, produced the best results, with a Mean Absolute Percentage
Error (MAPE) of 8.12%.
Hou et al., 2021 - In order to forecast the sales of the healthcare industry, Hou et al. proposed
a hybrid model that combines deep learning methods like Convolutional Neural Networks
(CNNs) and LSTMs with machine learning algorithms like Support Vector Regression and
Extreme Gradient Boosting. The researchers discovered that their model outperformed both
conventional time series models and individual machine learning algorithms, achieving a
MAPE of 9.2%.
Zhang et al., 2021 - proposed the Variational Autoencoder (VAE), a deep learning model for
predicting sales in the retail sector. In terms of forecasting accuracy, the authors discovered
that their model performed noticeably better than conventional time series models and other
deep learning models.
Zhu et al., 2020 - Applied machine learning algorithms to predict sales in the beverage
industry, including Random Forest, Gradient Boosting, and Artificial Neural Networks. With
a MAPE of 8.76%, the authors discovered that Random Forest outperformed the other two
algorithms.
Feng et al., 2020 - proposed the Stacked Denoising Autoencoder (SDAE), a deep learning
model for predicting sales in the e-commerce sector. In terms of forecasting accuracy, the
authors discovered that their model performed noticeably better than conventional time series
models and other deep learning models.
Chen et al., 2021 - proposed a hybrid model to forecast sales in the food industry that
combines a time series model with machine learning techniques like Random Forest,
Gradient Boosting, and Artificial Neural Networks. The researchers discovered that their
model outperformed individual machine learning algorithms, achieving a MAPE of 9.16%.
Ye et al., 2021 - Ye et al. proposed the Multi-Attention LSTM (MA-LSTM) as a deep
learning model to predict sales in the retail sector. In terms of forecasting accuracy, the
authors discovered that their model performed noticeably better than conventional time series
models and other deep learning models.
9
Jiang et al., 2020 - proposed a hybrid model to forecast sales in the automotive industry that
combines deep learning methods like CNNs and LSTMs with machine learning methods like
Random Forest and Gradient Boosting. The researchers discovered that their model
outperformed both conventional time series models and individual machine learning
algorithms, achieving a MAPE of 8.1%.
Hu et al., 2021 - proposed the Recurrent Encoder-Decoder with Attention (REDA) deep
learning model to forecast tourism industry sales. In terms of forecasting accuracy, the
authors discovered that their model performed noticeably better than conventional time series
models and other deep learning models.
2. Discussion of Subject and Analysis
Data for this study was collected from website called Kaggle: Your Home for Data Science
.
The total data has 14204 instances and it has twelve attributes in that 5 are numeric and 7
categorical shown in below table 1. The data was divided into test dataset and train dataset
for the purpose of building the models, with test dataset having 6000 instances and train
dataset having 8000 instances. Figure 1 and Figure 2 shows the sample data of test and train
data sets
Table 1
9
Figure 1
Figure 2
2.1 Exploratory Data Analysis
It is a method for analysing and investigating data sets and summarising their main
parameters, frequently involving data visualization methods. This section utilized different
exploratory analysis techniques where it found a correlation between variables, distribution of
price data, and descriptive statistics of data. Python software is used for the data analysis and
model building purposes.
It is found that item weight, outlet size attributes have 1463, 4016 missing values. Those
missing values are imputed with the mean and mode of those attributes because those are
techniques to deal with missing values.
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Below table shows the descriptive statistics of numeric data
Figure 3: Distribution of item outlet sales
The predicting variable is outlet sales so we have to check for the attributes which are
affecting the sales this can be analysed by using correlation matrix as shown in below table.
9
From the correlation matrix we can say that sales are highly correlated with item MRP and
negatively correlated with item visibility in store.
Now, we look into categorical variables and their distribution and correlation with sales by
plotting histogram and scatter plots.
From the below figures we can observe that items mostly has low fat, outlet sizes are mostly
medium, the items that are selling most are fruits and vegetables, snack items and lowest
sales are seafood and breakfast, outlet locations are mostly in tire 3 cities, and outlet types are
mostly falling into the supermarket type1 category.
9
2.2 Multivariate analysis
In this analysis we will use visualization to see how each variable is impacting the sales of
the store
Significant observations from below plots are listed below:
we find that, of the three types of locations, the stores from Tier 2 cities exhibit the highest
results, followed by Tier 3 cities, and Tier 1 cities exhibit the lowest results (Figure 11).
Compared to products with specific uses, daily use items should sell more frequently.
Products labeled "Low Fat" appear to have higher sales values than products labeled
"Regular"(Figure 5).
The groceries ("OUT010," "OUT019") and Supermarket Type 2 ("OUT018") have the lowest
sales results, as would be expected, according to the aforementioned bar chart. Strangely, the
majority of stores are Supermarket Type1 of size "High" and do not produce the best
outcomes. A "Medium" size Supermarket Type 3 named "Out027" produced the best
outcomes (Figure 6,7, &8).
Figure 4
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Figure 5
Figure 6
9
Figure 7
Figure 8
9
Figure 9
2.3 Model Building
After the data analysis then we prepared the data for model build for forecasting the sales. In
this study, we applied machine learning techniques (Linear regression, Random Forest
regression).
2.3.1 Linear Regression
A statistical modelling method known as linear regression is used to determine the
relationship between a dependent variable and one or more independent variables. It entails
fitting a straight line to the observed data to show how the variables are related. In contrast to
logistic and nonlinear regression models, linear regression models depict the relationship
between the variables as a straight line. Using this method, you can calculate how changes in
the independent variable(s) will affect the dependent variable. For this study, the linear
regression is formulated as follows.
y
=
c
+
c
1
z
1
+
c
2
z
2
+
c
3
z
3
+
c
4
z
4
+
c
5
z
5
+
c
6
z
6
+
c
7
z
7
+
c
8
z
8
+
ε
where as y is output variable sales, c
i is regression coefficients, z
i
is independent variables, ε is error of estimate
2.3.2 Random Forest Regression
Regression issues can be resolved using the machine learning algorithm known as Random
Forest Regression. Multiple decision trees are combined in this ensemble learning technique
to produce predictions. To train each tree, the algorithm randomly chooses a subset of
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
features and a subset of data points. The final prediction is then calculated by averaging the
predictions of all the trees. The equation for random forest regression is given below
Assume that a dataset D = (x
1
, y
1
), (x
2
, y
2
) ..., (x
n
, y
n
) exists, where x
i
is the i
th
data point and
yi is the corresponding target value. Let's say there are K decision trees, each of which is
trained using a randomly chosen subset of D
k
data points and F
k
features.
The K decision trees are built using the random forest regression algorithm's recursive
division of the data into smaller subsets according to the chosen features. The algorithm
selects the best feature to divide the data into two groups at each node of the decision tree and
then generates two child nodes. Until a stopping criterion is satisfied, such as reaching a
maximum tree depth or a minimum number of samples per leaf, this process is continued.
y_pred = (1/K) * sum (F
k
(x))
where F
k
(x) is the predicted value of the kth decision tree for the data point x.
The results for the models are listed below and the results of the models are evaluated by the
statistics of fit R- square value, RSME (Root Mean Square Error), MAE (Mean Absolute
Error). The statistics of fit for each model is listed in below table.
Model
R-square
RSME
MAE
Linear Regression
0.565
1129.5
850.52
Random Forest Regression
0.70
800.5
567.11
From the above results the random forest regression gives the good accuracy with 70%. So,
this algorithm helps to predict the sales of any retail store.
3. Thoughts / Recommendations
In order to improve the effectiveness and accuracy of the forecasting system, the current
study forecasts sales at Big Mart stores using exploratory machine learning techniques. The
proposed system is anticipated to perform better than single model predictive approaches,
which is essential for the company to maximize profits through accurate sales prediction.
Additional instance settings and other components can be added to increase the innovation
and success of the sales forecast. The accuracy of the prediction-based system can be
significantly increased by increasing the number of parameters used in the model, which is an
essential component of any forecasting system. A thorough examination of sub-model
behaviour can also help raise system productivity as a whole. Therefore, it is advised to use a
variety of cutting-edge methods and tactics to boost the precision and effectiveness of the
sales forecasting system based on machine learning.
9
4. References
1.
Kim, M., Jeong, S., & Lee, S. (2020). Deep learning model for sales forecasting of
smartphone industry. Journal of Intelligence and Information Systems, 26(4), 211-
218.
2.
Li, H., Zhou, C., & Yang, S. (2020). A hybrid machine learning model for sales
forecasting of clothing industry. Journal of Intelligent Manufacturing, 31(7), 1631-
1641.
3.
Hou, Y., Zhao, H., & Lv, L. (2021). A hybrid model for healthcare industry sales
forecasting based on deep learning and machine learning. Journal of Intelligent &
Fuzzy Systems, 40(3), 4393-4403.
4.
Zhang, S., Li, J., Yang, Y., & Wang, Y. (2021). Sales forecasting for retail industry
using variational autoencoder. Journal of Forecasting, 40(2), 369-378.
5.
Zhu, J., Chen, X., & Fang, Z. (2020). Sales forecasting for beverage industry based on
machine learning algorithms. International Journal of Engineering & Technology,
12(1), 37-44.
6.
Feng, X., Li, Y., Li, X., & Li, X. (2020). Sales forecasting of e-commerce industry
based on stacked denoising autoencoder. Journal of Physics: Conference Series,
1667(1), 012041.
7.
Chen, X., Xu, Y., & Zhu, J. (2021). Hybrid model based on time series model and
machine learning for sales forecasting of food industry. Journal of Industrial and
Production Engineering, 38(3), 283-294.
8.
Ye, Y., Lu, J., Huang, H., & Chen, D. (2021). Multi-attention LSTM for sales
forecasting of retail industry. Neural Computing and Applications, 33(11), 5757-5768.
9.
Jiang, Y., Wu, J., Zhang, Y., & Zheng, Y. (2020). Hybrid model of deep learning and
machine learning for automotive industry sales forecasting. Journal of Intelligent &
Fuzzy Systems, 38(4), 3973-3984.
10. Hu, Z., Wu, Q., Zeng, Y., & Wu, D. (2021). Recurrent encoder-decoder with attention
for sales forecasting of tourism industry. Journal of Hospitality and Tourism
Technology, 12(3), 467-478.
9