1_senator_pca - Jupyter Notebook

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

54

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by DeanWombat2487

Report
PCA and senate voting data In this problem, we are given the data matrix with entries in , where each row corresponds to a senator and each column to a bill. We first import this data, print some relevant values, and normalize it as necessary to ready it for further computation. To run this code, you'll need a number of standard Python libraries, all of which are installable via or . We highly recommend using a virtual environment (https://realpython.com/python-virtual-environments-a-primer/) , for this class and in general. Lastly, ensure that all data files ( senator_pca_data_matrix.csv and senator_pca_politician_labels.txt ) are located in the same folder as the notebook. Places you will need to modify this code are enclosed in a block. You should not need to modify code outside these blocks to complete the problems. Questions that you are expected to answer in text are marked in red . For solution files, solutions will be presented in blue . # In [1]: In [2]: We observe that the number of rows, , is the number of senators and is equal to 100. The number of columns, , is the number of bills and is equal to 542. X.shape: (100, 542) # import the necessary packages for data manipulation, computation and PCA import pandas as pd import numpy as np import scipy as sp from numpy import linalg as LA import matplotlib.pyplot as plt from sklearn.decomposition import PCA % matplotlib inline np.random.seed( 7 ) # import the data matrix senator_df = pd.read_csv( 'senator_pca_data_matrix.csv' ) affiliation_file = open ( 'senator_pca_politician_labels.txt' , 'r' ) affiliations = [line.split( '\n' )[ 0 ].split( ' ' )[ 1 ] for line in affiliation_file X = np.array(senator_df.values[:, 3 :].T, dtype = 'float64' ) # transpose to get s print ( 'X.shape: ' , X.shape) n = X.shape[ 0 ] # number of senators d = X.shape[ 1 ] # number of bills # this is just used for plotting, feel free to ignore assert set (affiliations) == { "Red" , "Blue" , "Yellow" } # assign a marker and hatch to each affiliation markers = [( "Red" , "o" , "/" ), ( "Blue" , "^" , "-" ), ( "Yellow" , "D" , "+" )]
In [3]: A row of consists of 542 entries -1 (senator voted against), 1 (senator voted for), or 0 (senator abstained), one for each bill. In [4]: (542,) [ 1. 1. 1. -1. -1. 1. 1. 1. 1. -1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. -1. -1. -1. 1. 1. -1. -1. -1. -1. 1. 1. 1. -1. 1. 1. -1. 1. 1. -1. 1. 1. 1. 1. -1. 1. -1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 0. -1. 1. 1. 1. -1. -1. 1. 1. -1. -1. 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. -1. -1. 1. 1. 1. -1. -1. -1. -1. -1. -1. 1. -1. 1. 1. -1. -1. -1. 1. -1. 1. -1. 1. 0. 0. 1. 1. -1. 1. 1. -1. 1. 1. -1. 1. -1. -1. 1. 1. 1. 1. 0. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 0. 1. 0. -1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. -1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. -1. 1. -1. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. 1. -1. 1. -1. -1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. -1. -1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. -1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. -1. -1. 0. 0. 0. 0. 0. 0. 0. 1. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. -1. -1. 1. 1. 1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. 1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1.] (100,) [ 1. 1. 1. 1. 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. -1. 1. 1. -1. 1. 1. -1. 1. -1. 1. 1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. -1. -1. -1. -1. 1. -1. -1. 1. 1. -1. -1. -1. -1. -1. 1. 1. -1. -1. 1. 1. -1. -1. -1. -1. -1. 1. 1. 1. 1. 1. -1. -1. -1. 1. -1. -1. 1. -1. -1. 1. 1. 1. -1. -1. -1. 1. 1. -1. 1. -1. 1. 1. 1. -1. -1. -1. -1. -1. 1. 1. 1. -1. -1. -1.] # print an example row of the data matrix typical_row = X[ 0 ] print (typical_row.shape) print (typical_row) # print an example column of the data matrix typical_column = X[:, 0 ] print (typical_column.shape) print (typical_column)
A column of consists of 100 entries in {-1, 0, 1}, one for each senator that voted on the bill. In [5]: We observe that the mean of the columns is not zero, so we center the data by subtracting the mean of each bill's vote from its respective column. In [6]: a) Maximizing In this problem, you are asked to find a unit-norm vector maximizing the empirical variance . We first provide a function to calculate the scores, . # compute the mean vote on each bill X_mean = np.mean(X, axis = 0 ) plt.plot(X_mean) plt.title( 'means of each column of X' ) plt.xlabel( 'column/bill' ) plt.ylabel( 'mean vote' ) plt.show() # center the data matrix X_original = X.copy() # save a copy for part (d) and (e) X = X - np.mean(X, axis = 0 )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [7]: Before we calculate the that maximizes variance, let's observe what the scalar projections on a random direction look like. # define score function def f (X, a): return X @ a
In [8]: Note here that projecting along the random vector does not explain much variance at all — data points are clustered together and intermixed across parties. It is clear that this variance along random direction a_rand: 9.267454390893336 # generate a random direction and normalize the vector a_rand = np.random.rand(d) a_rand = a_rand / np.linalg.norm(a_rand) # compute associated scores along a_rand scores_rand = f(X, a_rand) # visualize the scores along a_rand, coloring them by party affiliation for aff, marker, _ in markers: plt.scatter( scores_rand[np.array(affiliations) == aff], np.zeros_like(scores_rand[np.array(affiliations) == aff]), c = aff, marker = marker, edgecolors = "black" , linewidth = 0.5 , label = aff ) plt.legend() plt.title( 'projections along random direction a_rand' ) plt.xlabel( '$\\langle x_i, a\\rangle$' ) cur_axes = plt.gca() cur_axes.axes.get_yaxis().set_visible( False ) plt.show() print ( 'variance along random direction a_rand: ' , scores_rand.var())
In [9]: If you computed correctly, you should observe that the variance is much higher than the projection, and that blue and red dots are now spread in two clusters. This makes sense: the first principal component is the direction along which data varies most, and that is often along party lines. You just found a mathematical model for partisanship! --------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 9 1 ##################################################################### ########### 2 ### TODO: Calculate a_1, the first principal component of X. 3 # Hint: The PCA package imported from sklearn.decomposition will be u seful here, (...) 6 ### end TODO 7 ##################################################################### ########### ----> 9 a_1 = a_1 / np . linalg . norm(a_1) 10 # compute and visualize the scores along a_1 11 scores_a_1 = f(X, a_1) NameError : name 'a_1' is not defined ############################################################################## ### TODO: Calculate a_1, the first principal component of X. # Hint: The PCA package imported from sklearn.decomposition will be useful her # in particular the function pca.fit(). What should the dimensions of a_1 be? ### end TODO ############################################################################## a_1 = a_1 / np.linalg.norm(a_1) # compute and visualize the scores along a_1 scores_a_1 = f(X, a_1) for aff, marker, _ in markers: plt.scatter( scores_a_1[np.array(affiliations) == aff], np.zeros_like(scores_a_1[np.array(affiliations) == aff]), c = aff, marker = marker, edgecolors = "black" , linewidth = 0.5 , label = aff ) plt.legend() plt.title( 'projections along first principal component a_1' ) plt.xlabel( '$\\langle x_i, a \\rangle$' ) cur_axes = plt.gca() cur_axes.axes.get_yaxis().set_visible( False ) plt.show() print ( 'variance along first principal component: ' , scores_a_1.var())
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help