bayseian

docx

School

Wichita State University *

*We aren’t endorsed by this school

Course

753Z

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

15

Uploaded by UltraSeahorse4010

Report
Lab: Bayesian Classifiers 1. Give a brief overview of the selected dataset and provide a link to the source. Specify the meaning of records. (Similar to the assignment for week 1). we collected the dataset from the data world. The dataset is a Pubg Game collection containing 5,000 matches started by random players. Link : https://data.world/darrylhofer/pubgdata/workspace/file? filename=newNewPubg+-+Sheet1.csv Meaning of records : In my dataset, each row represents a separate game or match played by a player. 2. Specify a data analytics question for your lab. We want to predict if a player will be in the top 10 percent of the game with the given information of the match kill rank of a player, number of energy drinks used by a player, number of assists given by the player to the teammates, no. of heals by the player in that match, and no. of weapons acquired by that player in the match. 3. Explain your variables (need more than 4 variables in total). Response variable : The response variable is Winplaceperc_new (win place percentage) which we derived from winplaceperc which is an existing attribute in our dataset. We can see the pre-processing techniques we used to derive this new attribute in the following sections. Type of variable: Categorical. Explaining the response variable using appropriate exploratory data analysis technique: Fig1. Frequency of Winplaceperc_new Figure 1 represents how many are in the top 10 percent of the entire game. Here, 0 represents below 90% and 1 represents top 10% of the people in a game. So here we can see that less than 1000 people are in the top 10% in our entire dataset.
Explanatory Variables: Our explanatory variables are: Match kill Rank It shows rank of a player in a specific match based on no of kills he/she achieved. This datatype is considered a ratio since the player with 0 kills is going to be ranked at the least possible rank based on number of players in a match and that position is a zero valued position (value of the position is zero). This column data lies between 1 to 98 in our dataset, but it is possible to range between 0 to 100. We have a mean of 37.87 in match kill rank column. When 2 or more players have the same amount of kills then the rank will be determined based on other data influencing the kill which are “Distance of kill”, “headshots”(In the game headshots are considered difficult so they have a priority), “time of kill”(player who achieved the kill before other player will have an advantage) and “Damage taken for the kill”(the amount of damage occurred to his enemy during a shoot). Fig. 2. Frequency of Matchkillrank The above histogram in fig.2 represents the frequency of match kill rank. From the above graph, we can say that our dataset contains more data of the less match kill people. In other words, the people who has more rank are less in our dataset. There are around 95 people who has the matchkill rank of 16 in our dataset of 5000 records. Boosts It refers to the consumption/usage of boost items by players during matches. Boost items typically include items like energy drinks, pain killers etc.,. This is a ratio datatype. The values in this column in our dataset lies between 0 to 13 with an average 1.43. This means that a lot of data is tending towards the lower values mainly 0. If a player is in game for longer time, then there are high chances that he is
killing players or trying to win the game. In this process he may lose his energy in the game. Then comes the boost. To regain his energy, he will be taking some boosts present in game. In other words, more the boosts a player is using, higher are the chances for his win. This is the reason for us to consider boosts as one of the attributes in this analysis. Fig. 3. Frequency of Boosts The bar chart in figure 3 is right-skewed. This means there are a lot of people who are using less boosts. This makes sense because there are less people in 1 category in the bar chart shown above in the frequency graph of winplaceperc_new. This may be the reason i.e., people using more boosts are less in the above graph which in turn means there are less people in top 10 in our dataset. These are just our assumptions, and we can’t conclude anything until we complete our analysis. Assists It refers to the assistance provided to a teammate, when a player deals damage to an enemy player but the final strike is dealt by teammate resulting in kill for the teammate then it is considered as an assist. This is a ratio datatype. Since assists are meant to be when a teammate is been helped, this means, in a solo match type there are no assists. The values in this column in our dataset lies between 0 and 7 with an average of 0.33. If a player or the team is participating in a lot of fights and the kills of the team are high, the probability of the player getting more kills also increases. The higher value of the assists indicate that team is actively participating in fights and the probability of winning the games will also increase.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig. 4 – Frequency of assists Figure 4 is right skewed. This means there are a lot of players with less assists. In other words, there are few players who gave assists in our entire dataset. There are negligible players with 0 assists and there are around 900 players with 1 assist which is the highest. Heals These are the number of heals taken by a player during a match which involves consuming items like First Aid, Bandage and Med kit. This is a ratio dataset. The maximum and minimum values are 44 and 0 respectively. The average is 1.803. Fig. 5 – Frequency of Heals Figure 5 represents the frequency of heals. The graph is right skewed. So, there are more players with a smaller number of heals. From the graph 5, we can say that there are more players with 5 heals. Weapons Acquired It is the number of weapons a player a has acquired during the match time. And the player increase in the change of the weapons indicate the players knowledge about the weapons and also changing weapons based on the stage of the game, for example in the ending of the match the area of the play is very small, a player
doesn’t require a sniper or we can say that is not the optimal usage of a sniper in short area it is more suited in long range fights. There is a various slot available for a player to equip at once, 2 main weapons, 1 pocket weapon and 1 melee weapon in total there are 4 slots for a player to equip by a player. This is a ratio dataset, with the minimum value of 0 and a maximum value of 42. The average value is 4.143. Fig. 6 – Frequency of weapons acquired. In the first look, figure 6 seems to be a normal curve. There are more smaller bars on its right, but we can say that the players are almost equally distributed on both the sides. Those small numbers will not affect our analysis according to our POV. There are around 3500 players who acquired 4 weapons, and this is the category with the greatest number of players. 4. Explore the relationship between your variables discussing if they will be useful in answering your question using the selected method . Before visualizing our data, we expect a S shaped curve between our predictors and response variables. The following are the graphs we obtained after plotting between each of the predictor and the response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig. 7. Relation between Matchkillrank and Winplaceperc_new In fig. 8. the distribution of data points at 0 on the Y-axis is in wide range but we can’t conclude anything looking at the graph because the density of points may be more on the right side of the graph when compared to the left side of the graph. In this case we may get a good s-curve. We cannot hide the fact that the above graph is not exactly as we expected but can be considered for some extent. Here if we observe the values in 1 for winplaceperc_new, we can see that the players with better ranks for kills are in top 10% but in case of 0 for the winplaceperc_new, again all the data points are scattered. The values on 1 in Y-axis are as we expected but the values of 0 are scattered on the entire axis. Fig. 8. Scatter plot between Boosts and Winplaceperc_new
In the above graph, we expected winplaceperc_new to be 1 for more no. of boosts and 0 for less no. of boosts. The density matters in this plot too. The datapoints are scattered to almost all the possible values on the x-axis but we may still get a good s shaped curve if we have density on the desired side of the boosts. Fig. 9 – Scatter plot between assists and winplaceperc_new. Figure 9 represents the distribution of assists on X-axis. Here the density matters too. We think that there will be more density on X-axis at 0 on the left side and more density on X-axis at 1 on the right side. We can’t conclude anything until we analyze further. Fig. 10 – Scatterplot between heals and winplaceperc_new
Figure 10 represents the scatterplot between the heals and winplaceperc_new. We expected a s-curve thinking that players with more heals will have winning place as 1 and players with less heals will have winning place as 0. The graph didn’t look as we expected but can’t conclude anything until we complete the analysis. Fig. 11 – Scatterplot between weaponsacquired and winplaceperc_new Figure 11 represents the distribution of weapons on X-axis with respect to the Winplaceperc_new. It seems to be a s-curve. we thought players with more weapons have higher chances of having winplaceperc_new as 1 but it didn’t seems as we expected but it is a good variable to use in the analysis. It just became opposite to our expectations and need to see the performance during the analysis. So, after looking at our confusion matrix, we can say whether the considered attributes are good for our analysis or not. After visualizing figures 7, 8, 9, 10, and 11 we are still hoping to get a good model. 5. Specify any data pre-processing needed for your analysis. Make sure to use data pre-processing vocabulary. R studio wasn’t recognizing the data types properly. Variables such as walk distance, ride distance are shown as strings in R. This is due to the comma present in that column for the numbers (instead of 12340 it is 12,340). So, we again went back to excel and changed this type from general to number and again loaded the sheet in R. This time R recognized these columns as numerical values. The output column (Winplaceperc_new) is derived from the existing column Winplaceperc. In our regression analysis, we need to have the response variable as categorical variable (1 and 0). Our existing column i.e., Winplaceperc has values between 0.1 and 0.99. Our aim is to identify whether a player is in top 10% of the game. Since the existing values are
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the percentages, we converted the decimal numbers greater than or equal to 0.90 to 1 and the rest of the values to 0 as a new column named Winplaceperc_new [binarization]. [Binarization] Since the predictors have different orders of magnitude, they are standardized. We changed Winplaceperc_new to yes/no (instead of 1/0) to avoid issues with the R functions we selected. There are no missing values or outliers in the dataset. So, there are no data pre- processing techniques used other than the above-mentioned conversions. As everyone, before performing our analysis in R, we convert the entire file from .xlsx to .csv (Comma Separated Values). 6. Specify the learning algorithm selected to answer your question. We used Naïve Bayes algorithm to predict if a player is in top 10% in the game. 7. Discuss your model selection approach and report your model. There are no parameters to select in a naïve bayes model. Reporting the model 1 2 3 4 5 6 7 8 9 Winplac eperc = 0 0.00540 7232 0.00811 0848 0.00709 6992 0.01047 6512 0.00912 4704 0.01182 8320 0.01318 0128 0.01216 6272 0.01486 9888 Winplac eperc = 1 0.06866 1972 0.05457 7465 0.04401 4085 0.04577 4648 0.03345 0704 0.04401 4085 0.02992 9577 0.02640 8451 0.00704 2254 Table 1 – Partial conditionally independent probabilities table for attribute “Matchkillrank” 0 1 2 3 4 5 6 7 8 Winplac eperc = 0 0.50799 72184 0.19958 27538 0.13873 43533 0.07093 18498 0.04033 37969 0.02121 00139 0.01182 19750 0.00486 78720 0.00173 85257 Winplac eperc = 1 0.07010 30928 0.09484 53608 0.14226 80412 0.16494 84536 0.18556 70103 0.12783 50515 0.09484 53608 0.04948 45361 0.03092 78351 Table 2 - Partial conditionally independent probabilities table for attribute “Boosts” 0 1 2 3 4 5 7 Winplacep erc = 0 0.7870338 097 0.1704426 629 0.0324154 758 0.0073196 236 0.0013942 140 0.0010456 605 0.0003485 535 Winplacep erc = 1 0.5209205 021 0.2615062 762 0.1380753 138 0.0460251 046 0.0230125 523 0.0062761 506 0.0041841 004 Table 3 - Partial conditionally independent probabilities table for attribute “Assists”
0 1 2 3 4 5 6 7 8 Winplac eperc = 0 0.53252 59516 0.18166 08997 0.08685 12111 0.04913 49481 0.02871 97232 0.03287 19723 0.02352 94118 0.01903 11419 0.00865 05190 Winplac eperc = 1 0.13226 45291 0.18637 27455 0.15631 26253 0.11022 04409 0.09218 43687 0.07414 82966 0.03807 61523 0.03406 81363 0.02204 40882 Table 4 - Partial conditionally independent probabilities table for attribute “Heals” 0 1 2 3 4 5 6 7 8 Winplac eperc = 0 0.00866 25087 0.09632 70963 0.17359 66736 0.20512 82051 0.17567 56757 0.13097 71310 0.09182 25918 0.04504 50450 0.03707 55371 Winplac eperc = 1 0.00202 02020 0.00606 06061 0.02828 28283 0.10101 01010 0.15959 59596 0.21414 14141 0.18989 89899 0.10909 09091 0.07070 70707 Table 5 - Partial conditionally independent probabilities table for attribute “Weapons Acquired.” Winplaceperc = 0 Winplaceperc = 1 2862 471 Table 6 – prior probabilities 8. Briefly describe the software package and functions used to implement the classification model (detailed code should be included in the appendix). Packages used: caTools: caTools is a package in R that provides several utility functions for data splitting. We used this package to divide our dataset into training and testing datasets. e1071: We used this library for naïve bayes modelling. Functions and parameters used: set.seed() We used set.seed() for generating random numbers. sample.split(): We used this function to split the data into training and testing datasets. We passed two parameters to it. The first parameter tells us about the dataset and the explanatory variable. The second parameter gives the ratio in which we need to split the dataset. We splitted our data in 2/3 ratio. As.factor(): We passed the one parameter to this function. We used this function to convert the parameter from vector or variable to a factor (categorical data which will be used in naïve bayes models).
naiveBayes(): We used this function to perform naïve bayes analysis on the parameters passed. There are three parameters passed to this function. The first parameter tells the response variable and the explanatory variables separated with a ~. Each explanatory variables is separated with commas (,). The second parameter tells us about the dataset in which we are considering these variables and the third variable is setting laplace to be true i.e., laplace = 1. predict(): We used this function to predict the test dataset after training the train dataset. We passed two parameters to it. The first parameter is the output of naiveBayes() function and the second parameter is the splitted test set. table(): We used this function to create the confusion matrix. We used two parameters for this function. The first parameter tells us the variable in a dataset and the second parameter gives us the output of predict() function. 9. Evaluate the quality of the model. Predicted 0 1 0 1275 156 1 80 156 Table 7 – Confusion Matrix analysis for naïve bayes model Accuracy 0.86 TPR 0.66 FPR 0.11 Table 8 – Confusion matrix analysis for naïve bayes model Accuracy = (TP + TN) / (TP + TN + FP + FN) = (1275 + 156) / (1275 + 156 + 80 + 156) = 1431 / 1667 = 0.86 True Positive Rate = (TP) / (TP + FN) = 156 / (156 + 80) = 0.66 False Positive Rate = FP / (FP + TN) = 156 / (156 + 1275) = 0.11 We achieved a good accuracy of 86% in predicting if a player will be in top 10% or not. This means that 86% of the times our model is correctly predicting the players based on our explanatory variables. Our model has a true positive rate of 66%. While predicting if a player is in top 10% of the game, the model is giving actual positive instances 66% of the times. Actu al
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Our model has the false positive rate of 11%. While predicting if a player is in top 10% of the game, our model is predicting the false positives 11% of the times. 10. Explain if you would recommend the model to answer your question and support decisions? Except for predicting the positive instances correctly, our model performed noticeably well in rest all the metrices. Even the TPR is not bad in our case. It is true that our model has flaws in predicting if a player belongs to top 10% of the game but it is doing its job in the prediction. We cannot recommend this model to accurately predict if a player belongs to top 10% or not, but it is useful for players who want to improve their gameplay. For instance, if they play multiple games and observe their results based on this model before playing the actual tournament, they will be getting an idea like the no. of kills or weapons acquired are sufficient or not to get into the top 10% of the game. 11. Regardless of the assessed quality, give one example of how a person without R or python would use the model to answer a specific prediction question. Question: Will a player fall in top 10 percent if he has match kill rank as 8, 4 boosts, 5 assists, 3 heals, and 4 acquired weapons? Given probabilities for each attribute and their values for "Top 10%": P(Kill Rank=8 Top 10)=0.045774648 P(Boosts=4 Top 10)=0.1855670103 P(Assists=5 Top 10)=0.0062761506 P(Heals=3 Top 10)=0.0921843687 P(Weapons=1 Top 10)=0.1595959596 Assuming independence (Naive Bayes assumption): P(Attributes Top 10) P(Kill Rank=8 Top 10)×P(Boosts=4 Top 10)×P(Assists=5 Top 10)×P(Heals=3 Top 10)×P(Weapons=1 Top 10) Calculate the above probability: P(Attributes Top10) 0.045774648×0.1855670103×0.0062761506×0.0921843687×0. 1595959596 Next, calculate the probability for not being in the top 10%: P(Attributes Not Top 10) P(Kill Rank=8 Not Top 10)×P(Boosts=4 Not Top 10)×P(Assists=5 Not Top 10)×P(Heals=3 Not Top 10)×P(Weapons=1 Not Top 10) Given probabilities for each attribute and their values for "Not Top 10%": P(Kill Rank=8 Not Top 10)=0.010476512 P(Boosts=4 Not Top 10)=0.0403337969 P(Assists=5 Not Top 10)=0.0010456605 P(Heals=3 Not Top 10)=0.0491349481 P(Weapons=1 Not Top 10)=0.1756756757
Calculate the above probability: P(Attributes Not Top 10) 0.010476512×0.0403337969×0.0010456605×0.0491349481×0.1756756757 Now, calculate the final probabilities: P(Top 10 Attributes) P(Attributes Top 10)×P(Top 10) P(Not Top 10 Attributes) P(Attributes Not Top 10)×P(Not Top 10) Given prior probabilities: P(Top 10) = 471/(2862+471) = 0.14131413 P(Not Top 10) = 2862/(2862+471) = 0.85868587 With the obtained values, we can say that the given player will not be in top 10% of the game. APPENDIX Code used in R: pubg <- read.csv("newNewPubg - sheet2.csv") pubg$Matchkillrank = as.factor(pubg$Matchkillrank) pubg$Boosts = as.factor(pubg$Boosts) pubg$Assists = as.factor(pubg$Assists) pubg$Heals = as.factor(pubg$Heals) pubg$Weaponsacquired = as.factor(pubg$Weaponsacquired) library(caTools) set.seed(100) split = sample.split(pubg$Winplaceperc_new, SplitRatio = 2/3) pubg_train = subset(pubg, split ==TRUE) pubg_test = subset(pubg, split ==FALSE) #NAIVE BAYES: Objective: predict whether a player is in top 10 percent or not based on "Matchkillrank", "Boosts", "Assists", "Heals", "Weaponsacquired". library(e1071) nb_winplace = naiveBayes(Winplaceperc_new ~ Matchkillrank + Boosts + Assists + Heals + Weaponsacquired, data = pubg_train, laplace = 1) nb_winplace$tables #look at the probability tables nb_winplace$apriori #this gives you the number of records. How the probability? nb_pred = predict(nb_winplace, newdata = pubg_test) #confusion matrix analysis confusionTnb <- table(pubg_test$Winplaceperc_new, nb_pred)
Fig. 12 Figure 12 gives us the probability tables of the given variables. Fig. 13 Figure 13 is the screenshot of the apriori table obtained in R studio. Fig. 14 Figure 14 is the screenshot of the confusion matrix from R studio.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help