assignment2-bigdata

pdf

School

University of North Texas *

*We aren’t endorsed by this school

Course

5300

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

14

Uploaded by amulyam232

Report
assignment2-bigdata November 3, 2023 [ ]: Answer Here: Strengths of HeatMap: 1)A heatmap can show the relationship between categorical data, time series data, or any other type of data where a relationship can be represented by using colors, for instance. 2)The color coding of heatmaps represents the strength and direction of rela- tionships based on the strength of the relationships. Weakness of HeatMap: 1)Heatmaps may not give a full view of the data; its main purpose is to illustrate the correlations between variables. They might not highlight distributional features or outliers. 2)Heatmaps can illustrate the strength of a relationship, but they might not explain causality. One does not infer causation from a substan- tial correlation. Strengths of PairPlot: 1)Pairplots are very helpful when examining relationships between multiple variables. For every pair of variables in the dataset, they offer scatterplots and histograms, which is quite helpful for identifying patterns and distributions. 2)Pairplots are useful for locating abnormalities and outliers in the data. Look at scatterplots; it can be easier to iden- tify outliers. Weakness of PairPlot: 1)The best use case for pairplots is numerical data. Text or category data might not be well-suited to them, necessitating further preparation. 2)Pairplots can become cluttered and difficult to comprehend for large datasets with many of variables. Usually, datasets of a moderate size work well with it. [10]: import pandas as pd import plotly.express as px # Load the dataset from the CSV file df = pd . read_csv( "supermarket_sales.csv" ) # Calculate a pivot table with City and Product line as indices and Unit price as values pivot_table = pd . pivot_table(df, values = "Unit price" , index = "City" , columns = "Product line" , aggfunc = 'sum' ) # Calculate the total sales for each city pivot_table[ "Total Sales" ] = pivot_table . sum(axis =1 ) # Create a heatmap with labels and hovering fig = px . imshow(pivot_table, color_continuous_scale = "Blues" , labels = { "x" : "Product Line" , "y" : "City" , "color" : "Total Sales" }) # Set the title fig . update_layout(title = "Sales Heatmap by City and Product Line" ) 1
# Show the plot fig . show() [29]: import pandas as pd import plotly.figure_factory as ff categories = [ 'Category A' , 'Category B' , 'Category C' ] data1 = [ '2023-02-01' , '2022-03-01' , '2021-04-01' ] data2 = [ '2023-07-01' , '2022-11-01' , '2021-08-01' ] # Create a DataFrame from the data with the required column names data = { 'Task' : categories, 'Start' : data1, 'Finish' : data2 } df = pd . DataFrame(data) # Create the Gantt chart figure fig = ff . create_gantt(df, index_col = 'Task' , show_colorbar = True , group_tasks = True ) fig . update_layout(title = 'Gantt Chart' ) # Set the marker color to red for all tasks fig . update_traces(marker = dict (color = 'red' )) # Show the plot fig . show() 1)Matplotlib - A fundamental library for making simple and static interactive graphics is Mat- plotlib. It is compatible with mpl_toolkits and other libraries.To add interactivity to charts, use mplcursors and mplot3d for 3D charting. 2)Seaborn- Seaborn offers a high-level interface for producing visually appealing and educational statistical visualizations, and it is built on top of Matplotlib. Despite having limited interactive features, it can be integrated with other libraries to create interactive elements. 3)Plotly-A well-liked library for making interactive visualizations is called Plotly. Numerous chart kinds are supported by it, such as heatmaps, bar charts, scatter plots, and more. Plotly visualizations come with tooltips, zoom, and pan capabilities by default, making them interactive. 4)Bokeh-An effective library for building dynamic, online visualizations is called Bokeh. It provides a plethora of features and tools for creating interactive dashboards, such as server-based apps and widgets for user interaction. 5)Altair-With a clear syntax, Altair is a declarative statistical visualization library that makes it easier to create interactive charts. Because of its Vega-Lite foundation, creating sophisticated visuals with little code is simple. 6)Holoviews- Holoviews has a clear, declarative syntax that makes it simple to create interactive visualizations. It offers versatility in rendering choices and can be used with Bokeh, Matplotlib, Plotly, and other 2
backends with ease. 7)Folium-A Python package called Folium is used to make interactive leaflet maps. With support for customized markers, pop-ups, and tooltips, it’s perfect for visualizing geographical data. [32]: import pandas as pd import matplotlib.pyplot as plt df = pd . read_csv( "hotel_bookings-1.csv" ) # Group the unique values in the 'market_segment' and calculate the count count_of_bookings = df[ 'market_segment' ] . value_counts() . reset_index() count_of_bookings . columns = [ 'market_segment' , 'count' ] # Create a bar plot plt . figure(figsize = ( 12 , 6 )) plt . bar(count_of_bookings[ 'market_segment' ], count_of_bookings[ 'count' ], color = 'b' , alpha =0.7 ) # Customize the plot plt . title( "Distribution of Market Segments" ) plt . xlabel( "Market Segment" ) plt . ylabel( "Count of Bookings" ) plt . xticks(rotation =60 ) # Create a custom legend for i, label in enumerate (count_of_bookings[ 'market_segment' ]): plt . bar([], [], color = 'b' , label = f' { label } : { count_of_bookings[ "count" ][i] } ' ) plt . legend(loc = 'upper right' , title = "Market Segment" ) # Show the plot plt . show() 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[33]: import pandas as pd import matplotlib.pyplot as plt df = pd . read_csv( "hotel_bookings-1.csv" ) # Group the unique values in the 'market_segment' and calculate the count count_of_bookings = df[ 'market_segment' ] . value_counts() . reset_index() count_of_bookings . columns = [ 'market_segment' , 'count' ] # Sort the data by count in descending order count_of_bookings = count_of_bookings . sort_values(by = 'count' , ascending = False ) # Create a bar plot plt . figure(figsize = ( 12 , 6 )) bars = plt . bar(count_of_bookings[ 'market_segment' ], count_of_bookings[ 'count' ], color = 'b' , alpha =0.7 ) # Highlight the highest and second-highest segments in green and red bars[ 0 ] . set_color( 'green' ) bars[ 1 ] . set_color( 'red' ) # Customize the plot plt . title( "Distribution of Market Segments" ) plt . xlabel( "Market Segment" ) 4
plt . ylabel( "Count of Bookings" ) plt . xticks(rotation =60 ) # Add total count on top of each bar for i, count in enumerate (count_of_bookings[ 'count' ]): plt . text(i, count, str (count), ha = 'center' , va = 'bottom' ) # Print the market segments with the highest and lowest booking counts highest_segment = count_of_bookings . iloc[ 0 ][ 'market_segment' ] lowest_segment = count_of_bookings . iloc[ -1 ][ 'market_segment' ] print ( "Market segment with the highest booking count:" , highest_segment) print ( "Market segment with the lowest booking count:" , lowest_segment) # Show the plot plt . show() Market segment with the highest booking count: Online TA Market segment with the lowest booking count: Undefined [57]: import seaborn as sns import matplotlib.pyplot as plt # Set the style for the plots sns . set(style = "whitegrid" ) 5
# Create a figure with subplots for each feature fig, axes = plt . subplots( 1 , diabetes_df . shape[ 1 ], figsize = ( 16 , 4 )) # Create individual box plots for each feature with outliers marked in violet for i, column in enumerate (diabetes_df . columns): sns . boxplot(x = diabetes_df[column], ax = axes[i], color = "blue" , flierprops = dict (markerfacecolor = "violet" , markersize =5 ), medianprops = dict (color = "red" )) axes[i] . set_title( f'Box Plot for { column } ' ) # Set common y-label and a main title fig . text( 0.04 , 0.5 , "Feature Value" , va = 'center' , rotation = 'vertical' ) plt . suptitle( "Box Plots of Diabetes Dataset Features" , fontsize =16 ) # Adjust layout and show the plot plt . tight_layout() plt . subplots_adjust(top =0.8 ) plt . show() [56]: import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes # Load the diabetes dataset diabetes_data = load_diabetes() diabetes_df = pd . DataFrame(data = diabetes_data . data, columns = diabetes_data . feature_names) # Create the violin plot with a green central region plt . figure(figsize = ( 12 , 6 )) sns . set(style = "whitegrid" ) # Set the style for the plots # Create the violin plot with a green central region sns . violinplot(data = diabetes_df, palette = "YlGn" , inner = "quartile" ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Set labels, title, and customize axes plt . xlabel( "Feature Name" ) plt . ylabel( "Feature Value" ) plt . title( "Violin Plot of Diabetes Dataset Features" ) # Show the plot plt . show() [62]: import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes # Load the diabetes dataset diabetes_data = load_diabetes() diabetes_df = pd . DataFrame(data = diabetes_data . data, columns = diabetes_data . feature_names) # Choose two features for the scatter plot feature1 = 'age' # Replace with the name of the first feature you want to plot feature2 = 'bmi' # Replace with the name of the second feature you want to plot # Create a scatter plot plt . figure(figsize = ( 8 , 6 )) sns . set_style( "whitegrid" ) # Custom scatter plot parameters scatter_kwargs = { 7
'marker' : 'D' , # Use diamonds as markers 's' : 40 , # Marker size 'color' : 'crimson' , # Color of data points 'edgecolor' : 'gold' , # Edge color } # Create the scatter plot sns . scatterplot(x = diabetes_df[feature1], y = diabetes_df[feature2], ** scatter_kwargs) # Set background color plt . gca() . set_facecolor( '#F1D7DE' ) # Set labels and title plt . xlabel(feature1) plt . ylabel(feature2) plt . title( f"Scatter Plot between { feature1 } and { feature2 } " ) # Show the plot plt . show() 8
The RDD (Resilient Distributed Dataset) is a core data structure in Apache Spark that symbolizes a distributed data collection. Actions and transformations are the two primary categories into which RDD operations fall. In order to complete distributed data processing tasks in Spark, several actions are essential. The following are the main distinctions between RDD actions and transformations: RDD Transformations: 1)Lazy Evaluation:-Transformations on RDDs result in the creation of a new RDD. They are assessed slowly, which means that the calculation takes some time to complete. Rather, Spark logs the conversion in order to create a sensible implementation strategy. 2)Immutability-Because RDDs are immutable, applying transforms results in a new RDD rather than altering the existing one. 3)Parallel Execution-Transformations are appropriate for distributed data processing because they can run concurrently across various RDD partitions. RDD Actions: 1)Eager Evaluation:Operations known as actions are what set off the previously specified transformations to be carried out. The real computing happens when an action is called, and they either return values or write data to an external storage system. 2)Side Effects:Usually, side effects are carried out through actions, such as writing data to a file or sending the driver software the results. [64]: ! pip install pyspark Collecting pyspark Using cached pyspark-3.5.0.tar.gz (316.9 MB) Preparing metadata (setup.py) … done Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist- packages (from pyspark) (0.10.9.7) Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) … done Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=454913b408749c1b27c011a0646565300896ee3479125e5944decaf1ba78c965 Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9a c9241e5e44a01940da8fbb17fc Successfully built pyspark Installing collected packages: pyspark Successfully installed pyspark-3.5.0 [65]: #Importing spark Session from pyspark.sql import SparkSession [66]: spark = SparkSession . builder . appName( 'Assignment2_Amulya' ) . getOrCreate() [72]: spark [72]: <pyspark.sql.session.SparkSession at 0x7baf4a7d9450> [76]: df = spark . read . csv( "/content/movies-1.csv" , header = True , inferSchema = True ) df1 = spark . read . csv( "/content/ratings-1-1.csv" , header = True , inferSchema = True ) [77]: df . show() 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
+-------+--------------------+--------------------+ |movieId| title| genres| +-------+--------------------+--------------------+ | 1| Toy Story (1995)|Adventure|Animati…| | 2| Jumanji (1995)|Adventure|Childre…| | 3|Grumpier Old Men …| Comedy|Romance| | 4|Waiting to Exhale…|Comedy|Drama|Romance| | 5|Father of the Bri…| Comedy| | 6| Heat (1995)|Action|Crime|Thri…| | 7| Sabrina (1995)| Comedy|Romance| | 8| Tom and Huck (1995)| Adventure|Children| | 9| Sudden Death (1995)| Action| | 10| GoldenEye (1995)|Action|Adventure|…| | 11|American Presiden…|Comedy|Drama|Romance| | 12|Dracula: Dead and…| Comedy|Horror| | 13| Balto (1995)|Adventure|Animati…| | 14| Nixon (1995)| Drama| | 15|Cutthroat Island …|Action|Adventure|…| | 16| Casino (1995)| Crime|Drama| | 17|Sense and Sensibi…| Drama|Romance| | 18| Four Rooms (1995)| Comedy| | 19|Ace Ventura: When…| Comedy| | 20| Money Train (1995)|Action|Comedy|Cri…| +-------+--------------------+--------------------+ only showing top 20 rows [78]: df1 . show() +------+-------+------+ |userId|movieId|rating| +------+-------+------+ | 1| 1| 4.0| | 1| 3| 4.0| | 1| 6| 4.0| | 1| 47| 5.0| | 1| 50| 4.6| | 1| 70| 3.0| | 1| 101| 5.0| | 1| 110| 4.0| | 1| 151| 4.8| | 1| 157| 4.7| | 1| 163| 4.6| | 1| 216| 4.9| | 1| 223| 3.0| | 1| 231| 4.6| | 1| 235| 4.0| | 1| 260| 4.6| 10
| 1| 296| 3.0| | 1| 316| 3.0| | 1| 333| 4.6| | 1| 349| 4.0| +------+-------+------+ only showing top 20 rows [79]: df . count() [79]: 9742 [80]: df1 . count() [80]: 100836 [81]: df1 . rating >=3 [81]: Column<'(rating >= 3)'> [86]: from pyspark.sql.functions import col comedy_movies_df = df . filter(col( "genres" ) . contains( "Comedy" )) # Calculate the count of Comedy genre records comedy_count = comedy_movies_df . count() # Calculate the total count of all records total_count = df . count() # Calculate the percentage of Comedy genres comedy_percentage = (comedy_count / total_count) * 100 # Display the count and percentage print ( f"Count of Comedy Genre Records: { comedy_count } " ) print ( f"Percentage of Comedy Genres: { comedy_percentage : .2f } %" ) Count of Comedy Genre Records: 3756 Percentage of Comedy Genres: 38.55% :Global views, also known as Global Temporary Views in Apache Spark, are a means to manage and generate temporary views of SQL tables or DataFrames that may be shared throughout several Spark sessions or contexts. With the help of these views, you can conduct SQL queries on your data and construct an organized schema on top of it. With a predetermined lifespan, Global Tem- porary Views can be used until they are specifically removed or the Spark session in which they were created ends. Creating Global Temporary Views: createOrReplaceGlobalTempView method available on a DataFrame. dataFrame.createOrReplaceGlobalTempView(“viewName”) Lifecycle of Global Temporary Views: 1)Creation-You create a Global Temporary View using the createOr- 11
ReplaceGlobalTempView method 2)Availability-Queries can be run against the Global Temporary View within the Spark session in which it was generated. Like any other DataFrame or table, you may run SQL queries on it. 3)Cross-Session Access:Within the same Spark application, multiple Spark sessions or contexts can access Global Temporary Views. This gives them a global scope as opposed to standard DataFrame views. 4)Termination-Until Global Temporary View is specifically removed with the DROP statement, it can still be used. A global temporary view can be dropped in this way DROP GLOBAL TEMPORARY VIEW IF EXISTS global_temp.viewName; 5)Session Termination-When a Spark session ends, such as when your Spark application closes, any Global Temporary Views that were established during that session are deleted automatically. This makes sure that views are cleaned up together with the session and are not left dangling. [103]: df1 . createOrReplaceGlobalTempView( "global_rating_view" ) result = spark . sql( "SELECT * FROM global_temp.global_rating_view" ) result . show() # Identify users who have given the most ratings most_rated_users = spark . sql( """ SELECT userId, COUNT(*) AS rating_count FROM global_temp.global_rating_view GROUP BY userId ORDER BY rating_count DESC LIMIT 1 """ ) # Show the user(s) with the most ratings most_rated_users . show() +------+-------+------+ |userId|movieId|rating| +------+-------+------+ | 1| 1| 4.0| | 1| 3| 4.0| | 1| 6| 4.0| | 1| 47| 5.0| | 1| 50| 4.6| | 1| 70| 3.0| | 1| 101| 5.0| | 1| 110| 4.0| | 1| 151| 4.8| | 1| 157| 4.7| | 1| 163| 4.6| | 1| 216| 4.9| | 1| 223| 3.0| | 1| 231| 4.6| | 1| 235| 4.0| | 1| 260| 4.6| 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
| 1| 296| 3.0| | 1| 316| 3.0| | 1| 333| 4.6| | 1| 349| 4.0| +------+-------+------+ only showing top 20 rows +------+------------+ |userId|rating_count| +------+------------+ | 414| 2698| +------+------------+ [110]: common_column = "movieId" # Join the two DataFrames using the common column to create a new DataFrame joined_df = df1 . join(df, on = common_column, how = "inner" ) # Create a Global Temporary View for the joined DataFrame joined_df . createOrReplaceGlobalTempView( "my_combined_view" ) # Use SQL to query and view the data in the Global Temporary View result1 = spark . sql( "SELECT * FROM global_temp.my_combined_view" ) # Show the data result1 . show() +-------+------+------+--------------------+--------------------+ |movieId|userId|rating| title| genres| +-------+------+------+--------------------+--------------------+ | 1| 1| 4.0| Toy Story (1995)|Adventure|Animati…| | 3| 1| 4.0|Grumpier Old Men …| Comedy|Romance| | 6| 1| 4.0| Heat (1995)|Action|Crime|Thri…| | 47| 1| 5.0|Seven (a.k.a. Se7…| Mystery|Thriller| | 50| 1| 4.6|Usual Suspects, T…|Crime|Mystery|Thr…| | 70| 1| 3.0|From Dusk Till Da…|Action|Comedy|Hor…| | 101| 1| 5.0|Bottle Rocket (1996)|Adventure|Comedy|…| | 110| 1| 4.0| Braveheart (1995)| Action|Drama|War| | 151| 1| 4.8| Rob Roy (1995)|Action|Drama|Roma…| | 157| 1| 4.7|Canadian Bacon (1…| Comedy|War| | 163| 1| 4.6| Desperado (1995)|Action|Romance|We…| | 216| 1| 4.9|Billy Madison (1995)| Comedy| | 223| 1| 3.0| Clerks (1994)| Comedy| | 231| 1| 4.6|Dumb & Dumber (Du…| Adventure|Comedy| | 235| 1| 4.0| Ed Wood (1994)| Comedy|Drama| | 260| 1| 4.6|Star Wars: Episod…|Action|Adventure|…| | 296| 1| 3.0| Pulp Fiction (1994)|Comedy|Crime|Dram…| | 316| 1| 3.0| Stargate (1994)|Action|Adventure|…| 13
| 333| 1| 4.6| Tommy Boy (1995)| Comedy| | 349| 1| 4.0|Clear and Present…|Action|Crime|Dram…| +-------+------+------+--------------------+--------------------+ only showing top 20 rows [115]: results = spark . sql( """ SELECT genres, AVG(rating) AS avg_rating FROM global_temp.my_combined_view GROUP BY genres ORDER BY avg_rating DESC """ ) # Show the result results . show() +--------------------+----------+ | genres|avg_rating| +--------------------+----------+ |Action|Horror|Mys…| 4.6| |Comedy|Drama|Fant…| 4.6| |Animation|Childre…| 4.6| |Adventure|Comedy|…| 4.6| |Drama|Fantasy|Mus…| 4.6| |Drama|Horror|Romance| 4.6| |Adventure|Romance…| 4.6| |Animation|Crime|D…| 4.6| |Animation|Drama|F…| 4.6| |Action|Crime|Dram…| 4.6| |Fantasy|Mystery|W…| 4.6| |Action|Comedy|Dra…| 4.6| |Adventure|Drama|F…| 4.6| |Comedy|Crime|Dram…| 4.6| |Animation|Drama|S…| 4.6| |Comedy|Horror|Mys…| 4.6| |Comedy|Crime|Fantasy| 4.6| |Comedy|Crime|Dram…| 4.55| | Animation|Romance| 4.55| |Children|Drama|Ro…| 4.55| +--------------------+----------+ only showing top 20 rows 14