bayseian
docx
keyboard_arrow_up
School
Wichita State University *
*We aren’t endorsed by this school
Course
753Z
Subject
Statistics
Date
Apr 3, 2024
Type
docx
Pages
15
Uploaded by UltraSeahorse4010
Lab: Bayesian Classifiers
1. Give a brief overview of the selected dataset and provide a link to the source. Specify the meaning of records. (Similar to the assignment for week 1).
we collected the dataset from the data world. The dataset is a Pubg Game collection containing 5,000 matches started by random players.
Link
: https://data.world/darrylhofer/pubgdata/workspace/file?
filename=newNewPubg+-+Sheet1.csv
Meaning of records
: In my dataset, each row represents a separate game or match
played by a player.
2. Specify a data analytics question for your lab.
We want to predict if a player will be in the top 10 percent of the game with the given information of the match kill rank of a player, number of energy drinks used by a player, number of assists given by the player to the teammates, no. of heals by the player in that match, and no. of weapons acquired by that player in the match.
3. Explain your variables (need more than 4 variables in total).
Response variable
: The response variable is Winplaceperc_new (win place percentage) which we derived from winplaceperc which is an existing attribute in our dataset. We can see the pre-processing techniques we used to derive this new attribute in the following sections.
Type of variable: Categorical.
Explaining the response variable using appropriate exploratory data analysis technique:
Fig1. Frequency of Winplaceperc_new
Figure 1 represents how many are in the top 10 percent of the entire game. Here, 0 represents below 90% and 1 represents top 10% of the people in a game. So here we can see that less than 1000 people are in the top 10% in our entire dataset.
Explanatory Variables:
Our explanatory variables are:
Match kill Rank
It shows rank of a player in a specific match based on no of kills he/she achieved. This datatype is considered a ratio since the player with 0 kills is going to be ranked at the least possible rank based on number of players in a match and that position is a zero valued position (value of the position is zero). This column data lies between 1 to 98 in our dataset, but it is possible to range between 0 to 100. We have a mean of 37.87 in match kill rank column. When 2 or more players have the same amount of kills then the rank will be determined based on other data influencing the kill which
are “Distance of kill”, “headshots”(In the game headshots are considered difficult so they have a priority), “time of kill”(player who achieved the kill before other player will
have an advantage) and “Damage taken for the kill”(the amount of damage occurred to his enemy during a shoot).
Fig. 2. Frequency of Matchkillrank
The above histogram in fig.2 represents the frequency of match kill rank. From the above graph, we can say that our dataset contains more data of the less match kill people. In other words, the people who has more rank are less in our dataset. There are around 95 people who has the matchkill rank of 16 in our dataset of 5000 records.
Boosts
It refers to the consumption/usage of boost items by players during matches. Boost items typically include items like energy drinks, pain killers etc.,. This is a ratio datatype. The values in this column in our dataset lies between 0 to 13 with an average 1.43. This means that a lot of data is tending towards the lower values mainly 0. If a player is in game for longer time, then there are high chances that he is
killing players or trying to win the game. In this process he may lose his energy in the
game. Then comes the boost. To regain his energy, he will be taking some boosts present in game. In other words, more the boosts a player is using, higher are the chances for his win. This is the reason for us to consider boosts as one of the attributes in this analysis.
Fig. 3. Frequency of Boosts
The bar chart in figure 3 is right-skewed. This means there are a lot of people who are using less boosts. This makes sense because there are less people in 1 category in the bar chart shown above in the frequency graph of winplaceperc_new. This may be the reason i.e., people using more boosts are less in the above graph which in turn means there are less people in top 10 in our dataset. These are just our
assumptions, and we can’t conclude anything until we complete our analysis.
Assists
It refers to the assistance provided to a teammate, when a player deals damage to an enemy player but the final strike is dealt by teammate resulting in kill for the teammate then it is considered as an assist.
This is a ratio datatype. Since assists are meant to be when a teammate is been helped, this means, in a solo match type there are no assists. The values in this column in our dataset lies between 0 and 7 with an average of 0.33. If a player or the
team is participating in a lot of fights and the kills of the team are high, the probability
of the player getting more kills also increases. The higher value of the assists indicate that team is actively participating in fights and the probability of winning the games will also increase.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Fig. 4 – Frequency of assists
Figure 4 is right skewed. This means there are a lot of players with less assists. In other words, there are few players who gave assists in our entire dataset. There are negligible players with 0 assists and there are around 900 players with 1 assist which
is the highest.
Heals
These are the number of heals taken by a player during a match which involves consuming items like First Aid, Bandage and Med kit.
This is a ratio dataset. The maximum and minimum values are 44 and 0 respectively.
The average is 1.803.
Fig. 5 – Frequency of Heals
Figure 5 represents the frequency of heals. The graph is right skewed. So, there are more players with a smaller number of heals. From the graph 5, we can say that there are more players with 5 heals.
Weapons Acquired
It is the number of weapons a player a has acquired during the match time. And the player increase in the change of the weapons indicate the players knowledge about the weapons and also changing weapons based on the stage of the game, for example in the ending of the match the area of the play is very small, a player
doesn’t require a sniper or we can say that is not the optimal usage of a sniper in short area it is more suited in long range fights.
There is a various slot available for a player to equip at once, 2 main weapons, 1 pocket weapon and 1 melee weapon in total there are 4 slots for a player to equip by
a player.
This is a ratio dataset, with the minimum value of 0 and a maximum value of 42. The average value is 4.143.
Fig. 6 – Frequency of weapons acquired.
In the first look, figure 6 seems to be a normal curve. There are more smaller bars on
its right, but we can say that the players are almost equally distributed on both the sides. Those small numbers will not affect our analysis according to our POV. There are around 3500 players who acquired 4 weapons, and this is the category with the greatest number of players.
4. Explore the relationship between your variables discussing if they will be useful in answering your question using the selected method
.
Before visualizing our data, we expect a S shaped curve between our predictors and response variables.
The following are the graphs we obtained after plotting between each of the predictor
and the response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Fig. 7. Relation between Matchkillrank and Winplaceperc_new
In fig. 8. the distribution of data points at 0 on the Y-axis is in wide range but we can’t
conclude anything looking at the graph because the density of points may be more on the right side of the graph when compared to the left side of the graph. In this case we may get a good s-curve. We cannot hide the fact that the above graph is not
exactly as we expected but can be considered for some extent. Here if we observe the values in 1 for winplaceperc_new, we can see that the players with better ranks for kills are in top 10% but in case of 0 for the winplaceperc_new, again all the data points are scattered. The values on 1 in Y-axis are as we expected but the values of 0 are scattered on the entire axis.
Fig. 8. Scatter plot between Boosts and Winplaceperc_new
In the above graph, we expected winplaceperc_new to be 1 for more no. of boosts and 0 for less no. of boosts. The density matters in this plot too. The datapoints are scattered to almost all the possible values on the x-axis but we may still get a good s
shaped curve if we have density on the desired side of the boosts. Fig. 9 – Scatter plot between assists and winplaceperc_new.
Figure 9 represents the distribution of assists on X-axis. Here the density matters too. We think that there will be more density on X-axis at 0 on the left side and more density on X-axis at 1 on the right side. We can’t conclude anything until we analyze further.
Fig. 10 – Scatterplot between heals and winplaceperc_new
Figure 10 represents the scatterplot between the heals and winplaceperc_new. We expected a s-curve thinking that players with more heals will have winning place as 1
and players with less heals will have winning place as 0. The graph didn’t look as we
expected but can’t conclude anything until we complete the analysis. Fig. 11 – Scatterplot between weaponsacquired and winplaceperc_new
Figure 11 represents the distribution of weapons on X-axis with respect to the Winplaceperc_new. It seems to be a s-curve. we thought players with more weapons
have higher chances of having winplaceperc_new as 1 but it didn’t seems as we expected but it is a good variable to use in the analysis. It just became opposite to our expectations and need to see the performance during the analysis.
So, after looking at our confusion matrix, we can say whether the considered attributes are good for our analysis or not. After visualizing figures 7, 8, 9, 10, and 11
we are still hoping to get a good model.
5. Specify any data pre-processing needed for your analysis. Make sure to use data pre-processing vocabulary.
R studio wasn’t recognizing the data types properly. Variables such as walk
distance, ride distance are shown as strings in R. This is due to the comma present
in that column for the numbers (instead of 12340 it is 12,340). So, we again went
back to excel and changed this type from general to number and again loaded the
sheet in R. This time R recognized these columns as numerical values.
The output column (Winplaceperc_new) is derived from the existing column
Winplaceperc. In our regression analysis, we need to have the response variable as
categorical variable (1 and 0).
Our existing column i.e., Winplaceperc has values between 0.1 and 0.99. Our aim is
to identify whether a player is in top 10% of the game. Since the existing values are
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
the percentages, we converted the decimal numbers greater than or equal to 0.90 to
1 and the rest of the values to 0 as a new column named Winplaceperc_new
[binarization]. [Binarization]
Since the predictors have different orders of magnitude, they are standardized.
We changed Winplaceperc_new to yes/no (instead of 1/0) to avoid issues with the R
functions we selected.
There are no missing values or outliers in the dataset. So, there are no data pre-
processing techniques used other than the above-mentioned conversions.
As everyone, before performing our analysis in R, we convert the entire file
from .xlsx to .csv (Comma Separated Values).
6. Specify the learning algorithm selected to answer your question.
We used Naïve Bayes algorithm to predict if a player is in top 10% in the game.
7. Discuss your model selection approach and report your model.
There are no parameters to select in a naïve bayes model.
Reporting the model
1
2
3
4
5
6
7
8
9
Winplac
eperc = 0
0.00540
7232
0.00811
0848
0.00709
6992
0.01047
6512
0.00912
4704
0.01182
8320
0.01318
0128
0.01216
6272
0.01486
9888
Winplac
eperc = 1
0.06866
1972
0.05457
7465
0.04401
4085
0.04577
4648
0.03345
0704
0.04401
4085
0.02992
9577
0.02640
8451
0.00704
2254
Table 1 – Partial conditionally independent probabilities table for attribute
“Matchkillrank”
0
1
2
3
4
5
6
7
8
Winplac
eperc =
0
0.50799
72184
0.19958
27538
0.13873
43533
0.07093
18498
0.04033
37969
0.02121
00139
0.01182
19750
0.00486
78720
0.00173
85257
Winplac
eperc =
1
0.07010
30928
0.09484
53608
0.14226
80412
0.16494
84536
0.18556
70103
0.12783
50515
0.09484
53608
0.04948
45361
0.03092
78351
Table 2 - Partial conditionally independent probabilities table for attribute “Boosts”
0
1
2
3
4
5
7
Winplacep
erc = 0
0.7870338
097
0.1704426
629
0.0324154
758
0.0073196
236
0.0013942
140
0.0010456
605
0.0003485
535
Winplacep
erc = 1
0.5209205
021
0.2615062
762
0.1380753
138
0.0460251
046
0.0230125
523
0.0062761
506
0.0041841
004
Table 3 - Partial conditionally independent probabilities table for attribute “Assists”
0
1
2
3
4
5
6
7
8
Winplac
eperc =
0
0.53252
59516
0.18166
08997
0.08685
12111
0.04913
49481
0.02871
97232
0.03287
19723
0.02352
94118
0.01903
11419
0.00865
05190
Winplac
eperc =
1
0.13226
45291
0.18637
27455
0.15631
26253
0.11022
04409
0.09218
43687
0.07414
82966
0.03807
61523
0.03406
81363
0.02204
40882
Table 4 - Partial conditionally independent probabilities table for attribute “Heals”
0
1
2
3
4
5
6
7
8
Winplac
eperc =
0
0.00866
25087
0.09632
70963
0.17359
66736
0.20512
82051
0.17567
56757
0.13097
71310
0.09182
25918
0.04504
50450
0.03707
55371
Winplac
eperc =
1
0.00202
02020
0.00606
06061
0.02828
28283
0.10101
01010
0.15959
59596
0.21414
14141
0.18989
89899
0.10909
09091
0.07070
70707
Table 5 - Partial conditionally independent probabilities table for attribute “Weapons
Acquired.”
Winplaceperc = 0
Winplaceperc = 1
2862
471
Table 6 – prior probabilities
8. Briefly describe the software package and functions used to implement the classification model (detailed code should be included in the appendix).
Packages used:
caTools:
caTools is a package in R that provides several utility functions for data splitting. We used this package to divide our dataset into training and testing datasets. e1071:
We used this library for naïve bayes modelling.
Functions and parameters used:
set.seed()
We used set.seed() for generating random numbers. sample.split():
We used this function to split the data into training and testing datasets. We passed two parameters to it. The first parameter tells us about the dataset and the explanatory variable. The second parameter gives the ratio in which we need to split the dataset. We splitted our data in 2/3 ratio.
As.factor():
We passed the one parameter to this function. We used this function to convert the parameter from vector or variable to a factor (categorical data which will be used in naïve bayes models).
naiveBayes():
We used this function to perform naïve bayes analysis on the parameters passed. There are three parameters passed to this function. The first parameter tells the response variable and the explanatory variables separated with a ~. Each explanatory variables is separated with commas (,). The second parameter tells us about the dataset in which we are considering these variables and the third variable is setting laplace to be true i.e., laplace = 1.
predict():
We used this function to predict the test dataset after training the train dataset. We passed two parameters to it. The first parameter is the output of naiveBayes() function and the second parameter is the splitted test set.
table():
We used this function to create the confusion matrix. We used two parameters for this function. The first parameter tells us the variable in a dataset and the second parameter gives us the output of predict() function. 9. Evaluate the quality of the model.
Predicted
0
1
0
1275
156
1
80
156
Table 7 – Confusion Matrix analysis for naïve bayes model
Accuracy
0.86
TPR
0.66
FPR
0.11
Table 8 – Confusion matrix analysis for naïve bayes model
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= (1275 + 156) / (1275 + 156 + 80 + 156)
= 1431 / 1667
= 0.86
True Positive Rate = (TP) / (TP + FN)
= 156 / (156 + 80)
= 0.66
False Positive Rate = FP / (FP + TN)
= 156 / (156 + 1275)
= 0.11
We achieved a good accuracy of 86% in predicting if a player will be in top 10% or not. This means that 86% of the times our model is correctly predicting the players based on our explanatory variables.
Our model has a true positive rate of 66%. While predicting if a player is in top 10% of the game, the model is giving actual positive instances 66% of the times.
Actu
al
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Our model has the false positive rate of 11%. While predicting if a player is in top 10% of the game, our model is predicting the false positives 11% of the times.
10. Explain if you would recommend the model to answer your question and support decisions?
Except for predicting the positive instances correctly, our model performed noticeably well in rest all the metrices. Even the TPR is not bad in our case. It is true that our model has flaws in predicting if a player belongs to top 10% of the game but it is doing its job in the prediction. We cannot recommend this model to accurately predict if a player belongs to top 10% or not, but it is useful for players who want to improve their gameplay. For instance, if they play multiple games and observe their results based on this model before playing the actual tournament, they will be getting
an idea like the no. of kills or weapons acquired are sufficient or not to get into the top 10% of the game.
11. Regardless of the assessed quality, give one example of how a person without R or python would use the model to answer a specific prediction question.
Question: Will a player fall in top 10 percent if he has match kill rank as 8, 4 boosts, 5 assists, 3 heals, and 4 acquired weapons?
Given probabilities for each attribute and their values for "Top 10%":
P(Kill Rank=8
∣
Top 10)=0.045774648
P(Boosts=4
∣
Top 10)=0.1855670103
P(Assists=5
∣
Top 10)=0.0062761506
P(Heals=3
∣
Top 10)=0.0921843687
P(Weapons=1
∣
Top 10)=0.1595959596
Assuming independence (Naive Bayes assumption):
P(Attributes
∣
Top 10)
∝
P(Kill Rank=8
∣
Top 10)×P(Boosts=4
∣
Top 10)×P(Assists=5
∣
Top 10)×P(Heals=3
∣
Top 10)×P(Weapons=1
∣
Top 10)
Calculate the above probability:
P(Attributes
∣
Top10)
∝
0.045774648×0.1855670103×0.0062761506×0.0921843687×0.
1595959596
Next, calculate the probability for not being in the top 10%:
P(Attributes
∣
Not Top 10)
∝
P(Kill Rank=8
∣
Not Top 10)×P(Boosts=4
∣
Not Top 10)×P(Assists=5
∣
Not Top 10)×P(Heals=3
∣
Not Top 10)×P(Weapons=1
∣
Not Top 10)
Given probabilities for each attribute and their values for "Not Top 10%":
P(Kill Rank=8
∣
Not Top 10)=0.010476512
P(Boosts=4
∣
Not Top 10)=0.0403337969
P(Assists=5
∣
Not Top 10)=0.0010456605
P(Heals=3
∣
Not Top 10)=0.0491349481
P(Weapons=1
∣
Not Top 10)=0.1756756757
Calculate the above probability:
P(Attributes
∣
Not Top 10)
∝
0.010476512×0.0403337969×0.0010456605×0.0491349481×0.1756756757
Now, calculate the final probabilities:
P(Top 10
∣
Attributes)
∝
P(Attributes
∣
Top 10)×P(Top 10)
P(Not Top 10
∣
Attributes)
∝
P(Attributes
∣
Not Top 10)×P(Not Top 10)
Given prior probabilities:
P(Top 10) = 471/(2862+471) = 0.14131413
P(Not Top 10) = 2862/(2862+471) = 0.85868587
With the obtained values, we can say that the given player will not be in top 10% of the game.
APPENDIX
Code used in R:
pubg <- read.csv("newNewPubg - sheet2.csv")
pubg$Matchkillrank = as.factor(pubg$Matchkillrank)
pubg$Boosts = as.factor(pubg$Boosts)
pubg$Assists = as.factor(pubg$Assists)
pubg$Heals = as.factor(pubg$Heals)
pubg$Weaponsacquired = as.factor(pubg$Weaponsacquired)
library(caTools)
set.seed(100)
split = sample.split(pubg$Winplaceperc_new, SplitRatio = 2/3)
pubg_train = subset(pubg, split ==TRUE)
pubg_test = subset(pubg, split ==FALSE)
#NAIVE BAYES: Objective: predict whether a player is in top 10 percent or not based on "Matchkillrank", "Boosts", "Assists", "Heals", "Weaponsacquired".
library(e1071)
nb_winplace = naiveBayes(Winplaceperc_new ~ Matchkillrank + Boosts + Assists + Heals + Weaponsacquired, data = pubg_train, laplace = 1)
nb_winplace$tables #look at the probability tables
nb_winplace$apriori #this gives you the number of records. How the probability?
nb_pred = predict(nb_winplace, newdata = pubg_test)
#confusion matrix analysis
confusionTnb <- table(pubg_test$Winplaceperc_new, nb_pred)
Fig. 12
Figure 12 gives us the probability tables of the given variables.
Fig. 13
Figure 13 is the screenshot of the apriori table obtained in R studio.
Fig. 14
Figure 14 is the screenshot of the confusion matrix from R studio.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt