Assignment_1

pdf

School

Washington State University *

*We aren’t endorsed by this school

Course

319

Subject

Electrical Engineering

Date

Feb 20, 2024

Type

pdf

Pages

7

Uploaded by GrandDeerMaster1061

Report
1. In your own words, provide a definition or description for each of the following: (a) Random Variable A random variable is a numerical variable that is generated by a method that falls within a specific distribution. (b) Correlation Correlation refers to a relationship between multiples things. It can be a direct or inverse relationship, as long as they relate to each other. (c) Probability Density Function A probability density function is an expression that calculates the likelihood of an outcome for a specific variable. (d) Marginal Distribution Marginal distribution is distribution of the probabilities of isolated variables. (e) Statistical Independence Statistical independence refers to multiple variables and their relationship to one another. If they are not affected by each other changing, they are said to be independent from each other and therefore statistically independent. 1. In your own words, provide brief responses to the following: (a) What problem was MapReduce introduced to solve? MapReduce was introduced to help process large datasets. It does this through parallel processing on multiple devices. (b) Explain the differences between the Map and Reduce steps. Map: In the map step, the data is broken into smaller pieces which are processed independently of each other with a map function. The map function then makes key-value pairs as intermediate outputs. Reduce: In the reduce step, the aforementioned pairs are grouped by their keys and then each group is then associated with a specific key. The outcome of which is aggregated values. (c) What is skew in MapReduce?
Skew is an uneven distribution of processing tasks in clusters of nodes. Skew can be caused by some of the value keys being associated with more data than other keys. import numpy as np # Define the joint probability distribution as a 2D NumPy array joint_distribution = np.array([[ 0.3 , 0.15 , 0.25 ], [ 0.0 , 0.25 , 0.05 ]]) # (a) Compute the marginal distributions of X and Y marginal_X = np. sum (joint_distribution, axis = 0 ) marginal_Y = np. sum (joint_distribution, axis = 1 ) # (b) Compute the expected values and variances of X and Y X_values = np.array([ 0 , 1 , 2 ]) Y_values = np.array([ 0 , 1 ]) mean_X = np. sum (X_values * marginal_X) mean_Y = np. sum (Y_values * marginal_Y) X_squared = X_values ** 2 var_X = np. sum (X_squared * marginal_X) - mean_X ** 2 var_Y = np. sum (Y_values ** 2 * marginal_Y) - mean_Y ** 2 # (c) Compute the covariance of X and Y cov_XY = np. sum (np.outer(X_values - mean_X, Y_values - mean_Y) * joint_distribution.T) # (d) Calculate the conditional distribution of X given Y=1 conditional_X_given_Y1 = joint_distribution[ 1 , :] / marginal_Y[ 1 ] # (e) Create a new distribution with independent variables a = 1 - (marginal_X[ 1 ] + marginal_X[ 2 ]) d = 1 - (conditional_X_given_Y1[ 1 ] + conditional_X_given_Y1[ 2 ]) b = marginal_X[ 0 ] * marginal_Y[ 1 ] c = marginal_X[ 2 ] * marginal_Y[ 1 ] e = conditional_X_given_Y1[ 0 ] * marginal_Y[ 0 ] f = conditional_X_given_Y1[ 2 ] * marginal_Y[ 0 ] # Print the results print ( "(a) Marginal distribution of X:" , marginal_X) print ( " Marginal distribution of Y:" , marginal_Y) print ( "(b) Mean of X:" , mean_X) print ( " Variance of X:" , var_X) print ( " Mean of Y:" , mean_Y) print ( " Variance of Y:" , var_Y) print ( "(c) Covariance of X and Y:" , cov_XY) print ( "(d) Conditional distribution of X given Y=1:" ,
conditional_X_given_Y1) print ( "(e) New distribution with independent variables:" ) print ( "X/Y \t 0 \t 1 \t 2" ) print ( f"0 \t { a :.4f} \t { b :.4f} \t { c :.4f} " ) print ( f"1 \t { d :.4f} \t { e :.4f} \t { f :.4f} " ) (a) Marginal distribution of X: [0.3 0.4 0.3] Marginal distribution of Y: [0.7 0.3] (b) Mean of X: 1.0 Variance of X: 0.6000000000000001 Mean of Y: 0.3 Variance of Y: 0.21 (c) Covariance of X and Y: 0.049999999999999996 (d) Conditional distribution of X given Y=1: [0. 0.83333333 0.16666667] (e) New distribution with independent variables: X/Y 0 1 2 0 0.3000 0.0900 0.0900 1 0.0000 0.0000 0.1167 import numpy as np import pandas as pd import matplotlib.pyplot as plt # Load the LA AQ dataset la_aq = pd.read_csv( 'LA_AQ.csv' ) # Extract the numeric columns numeric_columns = la_aq.select_dtypes(include = 'number' ) # Construct all of the pairwise scatterplots for col1 in numeric_columns.columns: for col2 in numeric_columns.columns: if col1 != col2: fig, ax = plt.subplots() ax.scatter(la_aq[col1], la_aq[col2], alpha = 0.5 ) ax.set_xlabel(col1) ax.set_ylabel(col2) ax.set_title( f' { col1 } vs. { col2 } ' ) plt.show() numeric_df = df.select_dtypes(include = [ 'number' ]) means = numeric_df.mean() cov_matrix = numeric_df.cov() correlation_matrix = numeric_df.corr()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print (means) print (cov_matrix) print (correlation_matrix) Wind 7.500000 Solar.radiation 73.857143 CO 4.547619 NO 2.190476 NO2 10.047619 O3 9.404762 HC 3.095238 dtype: float64 Wind Solar.radiation CO NO NO2 \ Wind 2.500000 -2.780488 -0.378049 -0.463415 - 0.585366 Solar.radiation -2.780488 300.515679 3.909408 -1.386760 6.763066 CO -0.378049 3.909408 1.522067 0.673635 2.314750 NO -0.463415 -1.386760 0.673635 1.182346 1.088269 NO2 -0.585366 6.763066 2.314750 1.088269 11.363531 O3 -2.231707 30.790941 2.821719 -0.810685 3.126597 HC 0.170732 0.623693 0.141696 0.176539 1.044135 O3 HC Wind -2.231707 0.170732 Solar.radiation 30.790941 0.623693 CO 2.821719 0.141696 NO -0.810685 0.176539 NO2 3.126597 1.044135 O3 30.978513 0.594657 HC 0.594657 0.478513 Wind Solar.radiation CO NO NO2 \ Wind 1.000000 -0.101442 -0.193803 -0.269543 - 0.109825 Solar.radiation -0.101442 1.000000 0.182793 -0.073569 0.115732 CO -0.193803 0.182793 1.000000 0.502152 0.556584 NO -0.269543 -0.073569 0.502152 1.000000 0.296898 NO2 -0.109825 0.115732 0.556584 0.296898 1.000000 O3 -0.253593 0.319124 0.410929 -0.133952
0.166642 HC 0.156098 0.052010 0.166032 0.234704 0.447768 O3 HC Wind -0.253593 0.156098 Solar.radiation 0.319124 0.052010 CO 0.410929 0.166032 NO -0.133952 0.234704 NO2 0.166642 0.447768 O3 1.000000 0.154451 HC 0.154451 1.000000 import pandas as pd import seaborn as sns # Load the dataset data = pd.read_csv( 'LA_AQ.csv' ) # Calculate covariances and correlations covariance_matrix = data.cov() correlation_matrix = data.corr() # Create a heatmap for covariances sns. set (font_scale = 1.2 ) # Adjust font size for better readability plt.figure(figsize = ( 10 , 6 )) sns.heatmap(covariance_matrix, annot = True , cmap = 'coolwarm' , fmt = ".2f" , linewidths = 0.5 ) plt.title( 'Covariance Matrix' ) plt.show() # Create a heatmap for correlations plt.figure(figsize = ( 10 , 6 )) sns.heatmap(correlation_matrix, annot = True , cmap = 'coolwarm' , fmt = ".2f" , linewidths = 0.5 ) plt.title( 'Correlation Matrix' ) plt.show()
1. Load in the Los Angeles Air Quality (LA AQ) dataset and construct all of the pairwise scatterplots of the numeric columns. (a) What do you observe from the scatterplots?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CO seems to have strong relationships with NO and NO2. Wind has almost no relationship with anything. NO2 seems to have a moderate relationship with HC. Everything else feels like it doesn't have a strong relationship. (b) Compute the sample means, covariances, and correlations from this data. see above.. (c) Create heatmaps for the covariances and correlations from part (B) separately. What do you observe about these visualiations? Basically everything I saw in the scatterplots is what I'm seeing in the heatmap. It is just way easier to read, as there aren't a ton of them to go through. 1. Design a MapReduce algorithm to take a very large file of integers as input and produce as output the average of all the integers. You do not have to write any code or implement your algorithm, just describe the method you would use. You may decide how the integers are stored in the input and should describe this in your solution. I would begin by reading the input file, line by line. The output of which are key-value pairs. I would then create a framework that groups together keys with the same value. For each key I would get the interger values and place them in a list. I would then iterate over the list and find the sum and count of the intergers. By doing this, I would hopefully end up with, as an example, 5 3's, 6 4's, and 17 2's. This data would then be stored in a key called "all," which contains 2 intergers called sun and count. The average is equal to the sum divided by the count. I would then have an output called average that I could print which would provide me with the average of the intergers. 1. Make a copy of CoLab 1 here: link and determine which letter is at the beginning of the most words in the provided dataset. S is the letter at the begining of the most words in the dataset.