HW2_5350_SparkMLlib_SP24
pdf
keyboard_arrow_up
School
University of North Texas *
*We aren’t endorsed by this school
Course
MISC
Subject
Information Systems
Date
Apr 3, 2024
Type
Pages
2
Uploaded by AmbassadorFlyPerson1011
DSCI 5350: HW2 Handout Spark MLlib with Databricks By Dr. Scott Hamilton Based on Material Learning Spark This handout introduces you to HW 2
. You are expected to work on this HW, prepare a report, and turn it in on Canvas by the due day. The purpose of this exercise is to have you go through the basic steps to enter structured data into Databricks Cloud computing notebook, perform data cleaning and run a machine learning linear regression model using Apache Spark with the Pyspark API. The main message of this exercise is that you can produce a machine learning application in a cloud computing environment utilizing the available processing and storage of the cloud platform. Download AirBNB DATA for a chosen city Direct your browser to the Inside Airbnb data depository website at http://insideairbnb.com/get-the-data.html
. Scroll down to the ‘Data Downloads’ and make a selection of a city that you would like to investigate. For HW2 you are going to Download the ‘listings.csv.gz’ file and save it to your computer. Refer to the Databricks notebook and code published online and available at the link (
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173
bcfc/1671077683580240/4014362052352744/5635557025976360/latest.ht
ml
DSCI 5350: HW2 Handout HW2 ASSIGNMENT Prepare a typed report (your name and “HW2” in the header, please!) Include: (1) screenshotsS1-S6 listed below and (2) answers to questions Q1-Q2 listed below. Screen Shots/Exhibits: S1
. A plot of variables using the display( ) command in databricks S2
. An output statement about the split dataset into training and test data S3
. A univariate regression model with one feature changed from the provided example S4
. A linear regression line print out showing your variable selection for the univariate model. S5
. A printout of RMSE and R^2 values for your univariate model. S6. A printout of RMSE and R^2 values for the full mode
l S7 Bonus: A printout of RMSE and R^2 values for the model tested out on data from another city.
Questions: Q1
. Were your predictions improved from the univariate model by including more variables in your linear regression? In your own words, describe how the indicators (RMSE and R^2) showed the difference in the two models. Q2
. Through Spark and Databricks you were able to create a basic machine learning prediction model. This is a beginning step to datamining using big data analytics techniques. Use your predictions and equation line to make a business application that interprets the model that you just created. This should be 1-2 paragraphs max in your own words showing your understanding of the model and how it might be used to make business decisions. A note on copyright This handout may contain references to trademarks, registered trademarks, patented solutions and proprietary products. VirtualBox is a trademark of Oracle. Quickstart is a trademark of Cloudera. Avro, Flume, HBase, Hive, Oozie, Pig, Solr, Spark, and Sqoop are trademarks of The Apache Software Foundation. Databricks Community edition is a trademark of Databricks Platform Services. These companies are neither authors nor publishers of this handout and are not responsible for its content. For terms of use,warranty information, and liability information, please refer to the user agreement that is applicable to you.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help