F23_HW6_arule_student

docx

School

The University of Tennessee, Knoxville *

*We aren’t endorsed by this school

Course

474

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

6

Uploaded by GeneralValorAntelope31

Report
Fall 2023 Homework 6 - Association Rules Casey Lyons Oct 31 Important Note: For this assignment, some of the R output may be awkwardly formatted. Please go to your Word document after knitting, highlight the code/output in R chunks that “look bad” by spilling over onto the next line, and decrease the font size (I suggest 6pt) so it fits on one line. Even then, some chunks are still going to look ugly but that’s ok. Towards a Recommendation Engine for Music ******************************************** The file musiclikesmod.csv contains information on users of last.fm ( https://www.last.fm/ ), an online service that lets you stream music. It also makes recommendations on what songs you might like based on your listening history (and that of your friends, if you share data). By setting up the data appropriately, we can perform a “market basket analysis” to determine what combinations of artists are often enjoyed by the same listener. To convert this into a market basket analysis problem, we will treat each listener as “cart”. Each listener has a list of bands they like, and these bands can be thought of as “items in the cart”. Perhaps association rules exist like “If a listener likes Lady Gaga and Kelly Clarkson, then they may also like Katy Perry”. In the data (which comes from 2011, so think of the music scene back then), there are 1821 “carts” (listeners) each of which contains one or more of 2459 unique items (bands). Let’s see what we can learn about listening habits! Run the following code (making sure this .Rmd file and the musiclikesmod.csv file are in the same folder; double-check by doing Save As and saving this document into your 474 folder where you’ve moved the data file). This code will make a plot of the most frequently appearing bands. You’ll notice that some spelling liberties have been applied (P!nk is Pnk, Ke$sha is Keha, Beyonce is Beyonc, etc); basically all special characters and spaces have been removed. #Read in data, making sure text is read in as text instead of categorical variables MUSIC <- read.transactions ( "musiclikesmod.csv" , sep= "," , format= "basket" ) #Look at the top 20 frequently appearing product_ids itemFrequencyPlot (MUSIC, topN= 20 , type= "relative" , horiz= TRUE , cex.na mes= 0.6 )
Question 1: a. What fraction of listeners like TheCure ? How many total listeners liked LadyGaga ? Use itemFrequency for these two questions. What fraction liked at least one of Rush or PinkFloyd ? Hint: you’ll need to find the number of transactions that have one or both of these items then multiply by the number of transactions. # Fraction of the 1821 listeners like `TheCure`: itemFrequency (MUSIC)[ "TheCure" ] ## TheCure ## 0.109 # How many total listeners liked LadyGaga? itemFrequency (MUSIC, type= "absolute" )[ "LadyGaga" ] ## LadyGaga ## 580 # What fraction liked at least one of `Rush` or `PinkFloyd`
mean (MUSIC %in% c ( "Rush" , "PinkFloyd" )) ## [1] 0.139 Grading: 3 pts total, 1 pt each. b. Build a ruleset RULES_1c using this support (100/1821) and a minimum level of confidence of 75%. Do not put any restrictions on the lengths of the rules. Remove the redundant and non-significant ones, then print out the top 5 rules in terms of Lift to the screen. # Your code: RULES_1c <- apriori (MUSIC, parameter = list ( support = 100 / 1821 , confidence = 0.75 )) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext ## 0.75 0.1 1 none FALSE TRUE 5 0.0549 1 10 rules TRUE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 100 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[2459 item(s), 1821 transaction(s)] done [0.02s]. ## sorting and recoding items ... [109 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 7 8 done [0.01s]. ## writing ... [12290 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s]. RULES_1c <- RULES_1c[ ! is.redundant (RULES_1c)] RULES_1c <- RULES_1c[ is.significant (RULES_1c, MUSIC)] inspect ( sort (RULES_1c, by= "lift" , decreasing = T)[ 1 : 5 ]) ## lhs rhs support confidence coverage lift count ## [1] {BritneySpears, SelenaGomeztheScene} => {MileyCyrus} 0.0549 0.877 0.0626 6.07 100 ## [2] {KatyPerry, SelenaGomeztheScene} => {MileyCyrus} 0.0571 0.874 0.0653 6.05 104 ## [3] {AshleyTisdale, KatyPerry, Rihanna} => {MileyCyrus} 0.0566 0.858 0.0659 5.94 103 ## [4] {AshleyTisdale, BritneySpears, KatyPerry} => {MileyCyrus} 0.0588 0.856 0.0686 5.93 107
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## [5] {BritneySpears, KatyPerry, Keha, Rihanna, TaylorSwift} => {MileyCyrus} 0.0549 0.855 0.0643 5.92 100 Grading: 2 pt. c. You should find that one of the top rules in terms of lift is {BritneySpears,SelenaGomeztheScene} => {MileyCyrus} . How many listeners like this set of 3 artists? Write a line of R code that produces this number (you may need length and which along with %ain$ ). # Your code: length ( which (MUSIC %ain% c ( "BritneySpears" , "SelenaGomeztheScene" , "MileyCyrus" ))) ## [1] 100 Grading: 2 pt. d. The rule {HilaryDuff,Rihanna} => {AshleyTisdale} has a confidence of 65%. First, explain in English what the rule {HilaryDuff,Rihanna} => {AshleyTisdale} means. Then, carefully interpret the confidence. Response: If a listener likes HilaryDuff and Rihanna, then they may also like AshleyTisdale. Confidence = 0.65: For all listeners who like HilaryDuff and Rihanna, 65% of them also like AshleyTisdale. Grading: 2 pts. e. The lift of the rule {HilaryDuff,Rihanna} => {AshleyTisdale} is about 7.2. Interpret this value. # support of rhs: itemFrequency (MUSIC)[ "AshleyTisdale" ] ## AshleyTisdale ## 0.0895 Response: Overall, the probability that you like Ashley Tisdale is 8.95%. However, among listeners that like Hilary Duff and Rihanna, the probability that you like Ashley Tisdale increases by a factor or 7.2 (the lift) to 64.44% (the confidence). Grading: 2 pt. Question 2 Remake the rules from MUSIC , but use the default parameters for everything (i.e., remove the argument parameter = list(supp = , conf = ) ) Remove the redundant and non-significant ones.
## Your code: RULES_2 <- apriori (MUSIC, parameter = list) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext ## 0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules TRUE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 182 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[2459 item(s), 1821 transaction(s)] done [0.01s]. ## sorting and recoding items ... [41 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 done [0.00s]. ## writing ... [522 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s]. summary (RULES_2) ## set of 522 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 5 6 ## 37 198 207 69 11 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 3.00 4.00 3.65 4.00 6.00 ## ## summary of quality measures: ## support confidence coverage lift count ## Min. :0.101 Min. :0.800 Min. :0.103 Min. :2.55 Min. :183 ## 1st Qu.:0.106 1st Qu.:0.859 1st Qu.:0.117 1st Qu.:3.25 1st Qu.:193 ## Median :0.114 Median :0.905 Median :0.126 Median :3.53 Median :207 ## Mean :0.120 Mean :0.896 Mean :0.134 Mean :3.49 Mean :219 ## 3rd Qu.:0.128 3rd Qu.:0.935 3rd Qu.:0.144 3rd Qu.:3.80 3rd Qu.:233 ## Max. :0.226 Max. :0.995 Max. :0.271 Max. :4.57 Max. :411
## ## mining info: ## data ntransactions support confidence call ## MUSIC 1821 0.1 0.8 apriori(data = MUSIC, parameter = list) Grading: 2 pt b. How many rules have levels of confidence between 0.90 and 0.95 (inclusive)? # Your code: length (RULES_2[ quality (RULES_2) $ confidence >= . 90 & quality (RULES_2) $ confidence <= . 95 ]) ## [1] 209 Grading: 2 pt c. How many rules have levels of confidence of 0.85 and above and lifts between 3 and 4 (inclusive)? # Your code: length (RULES_2[ quality (RULES_2) $ confidence > . 85 & quality (RULES_2) $ lift >= 3 & quality (RULES_2) $ lift <= 4 ]) ## [1] 271 Grading: 2 pt d. Print to the screen the rule that has the highest lift, which you’ll find is Katy Perry + Miley Cyrus -> Kesha. The output shows that this rule applies to 194 users, and that the rule has a confidence of 0.851. # Your code inspect ( sort (RULES_2, by= "lift" , decreasing= TRUE )[ 1 ]) ## lhs rhs support confidence coverage lift count ## [1] {KatyPerry, MileyCyrus} => {Keha} 0.107 0.851 0.125 4.57 194 Grading: 2 pt
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help