1_senator_pca - Jupyter Notebook

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by DeanWombat2487

PCA and senate voting data In this problem, we are given the data matrix with entries in , where each row corresponds to a senator and each column to a bill. We first import this data, print some relevant values, and normalize it as necessary to ready it for further computation. To run this code, you'll need a number of standard Python libraries, all of which are installable via or . We highly recommend using a virtual environment (https://realpython.com/python-virtual-environments-a-primer/) , for this class and in general. Lastly, ensure that all data files ( senator_pca_data_matrix.csv and senator_pca_politician_labels.txt ) are located in the same folder as the notebook. Places you will need to modify this code are enclosed in a block. You should not need to modify code outside these blocks to complete the problems. Questions that you are expected to answer in text are marked in red . For solution files, solutions will be presented in blue . # In [1]: In [2]: We observe that the number of rows, , is the number of senators and is equal to 100. The number of columns, , is the number of bills and is equal to 542. X.shape: (100, 542) # import the necessary packages for data manipulation, computation and PCA import pandas as pd import numpy as np import scipy as sp from numpy import linalg as LA import matplotlib.pyplot as plt from sklearn.decomposition import PCA % matplotlib inline np.random.seed( 7 ) # import the data matrix senator_df = pd.read_csv( 'senator_pca_data_matrix.csv' ) affiliation_file = open ( 'senator_pca_politician_labels.txt' , 'r' ) affiliations = [line.split( '\n' )[ 0 ].split( ' ' )[ 1 ] for line in affiliation_file X = np.array(senator_df.values[:, 3 :].T, dtype = 'float64' ) # transpose to get s print ( 'X.shape: ' , X.shape) n = X.shape[ 0 ] # number of senators d = X.shape[ 1 ] # number of bills # this is just used for plotting, feel free to ignore assert set (affiliations) == { "Red" , "Blue" , "Yellow" } # assign a marker and hatch to each affiliation markers = [( "Red" , "o" , "/" ), ( "Blue" , "^" , "-" ), ( "Yellow" , "D" , "+" )]

In [3]: A row of consists of 542 entries -1 (senator voted against), 1 (senator voted for), or 0 (senator abstained), one for each bill. In [4]: (542,) [ 1. 1. 1. -1. -1. 1. 1. 1. 1. -1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. -1. -1. -1. 1. 1. -1. -1. -1. -1. 1. 1. 1. -1. 1. 1. -1. 1. 1. -1. 1. 1. 1. 1. -1. 1. -1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 0. -1. 1. 1. 1. -1. -1. 1. 1. -1. -1. 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. -1. -1. 1. 1. 1. -1. -1. -1. -1. -1. -1. 1. -1. 1. 1. -1. -1. -1. 1. -1. 1. -1. 1. 0. 0. 1. 1. -1. 1. 1. -1. 1. 1. -1. 1. -1. -1. 1. 1. 1. 1. 0. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 0. 1. 0. -1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. -1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. -1. 1. -1. 1. -1. -1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. -1. 1. -1. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. 1. 1. -1. 1. -1. -1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. -1. -1. 1. -1. 1. 1. 1. 1. 1. -1. 1. -1. 1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. -1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. -1. -1. 0. 0. 0. 0. 0. 0. 0. 1. 1. -1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. -1. 1. 1. 1. 1. -1. -1. 1. 1. 1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. 1. 1. 1. -1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. -1. -1. 1. 1. 1. 1. -1. -1. 1. 1. -1. 1. 1. 1. 1. 1. 1.] (100,) [ 1. 1. 1. 1. 1. 1. 1. -1. 1. -1. 1. -1. 1. -1. -1. -1. 1. 1. -1. 1. 1. -1. 1. -1. 1. 1. 1. -1. -1. 1. 1. 1. -1. 1. 1. 1. -1. -1. -1. -1. 1. -1. -1. 1. 1. -1. -1. -1. -1. -1. 1. 1. -1. -1. 1. 1. -1. -1. -1. -1. -1. 1. 1. 1. 1. 1. -1. -1. -1. 1. -1. -1. 1. -1. -1. 1. 1. 1. -1. -1. -1. 1. 1. -1. 1. -1. 1. 1. 1. -1. -1. -1. -1. -1. 1. 1. 1. -1. -1. -1.] # print an example row of the data matrix typical_row = X[ 0 ] print (typical_row.shape) print (typical_row) # print an example column of the data matrix typical_column = X[:, 0 ] print (typical_column.shape) print (typical_column)

A column of consists of 100 entries in {-1, 0, 1}, one for each senator that voted on the bill. In [5]: We observe that the mean of the columns is not zero, so we center the data by subtracting the mean of each bill's vote from its respective column. In [6]: a) Maximizing In this problem, you are asked to find a unit-norm vector maximizing the empirical variance . We first provide a function to calculate the scores, . # compute the mean vote on each bill X_mean = np.mean(X, axis = 0 ) plt.plot(X_mean) plt.title( 'means of each column of X' ) plt.xlabel( 'column/bill' ) plt.ylabel( 'mean vote' ) plt.show() # center the data matrix X_original = X.copy() # save a copy for part (d) and (e) X = X - np.mean(X, axis = 0 )

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

In [7]: Before we calculate the that maximizes variance, let's observe what the scalar projections on a random direction look like. # define score function def f (X, a): return X @ a

In [8]: Note here that projecting along the random vector does not explain much variance at all — data points are clustered together and intermixed across parties. It is clear that this variance along random direction a_rand: 9.267454390893336 # generate a random direction and normalize the vector a_rand = np.random.rand(d) a_rand = a_rand / np.linalg.norm(a_rand) # compute associated scores along a_rand scores_rand = f(X, a_rand) # visualize the scores along a_rand, coloring them by party affiliation for aff, marker, _ in markers: plt.scatter( scores_rand[np.array(affiliations) == aff], np.zeros_like(scores_rand[np.array(affiliations) == aff]), c = aff, marker = marker, edgecolors = "black" , linewidth = 0.5 , label = aff ) plt.legend() plt.title( 'projections along random direction a_rand' ) plt.xlabel( '$\\langle x_i, a\\rangle$' ) cur_axes = plt.gca() cur_axes.axes.get_yaxis().set_visible( False ) plt.show() print ( 'variance along random direction a_rand: ' , scores_rand.var())

In [9]: If you computed correctly, you should observe that the variance is much higher than the projection, and that blue and red dots are now spread in two clusters. This makes sense: the first principal component is the direction along which data varies most, and that is often along party lines. You just found a mathematical model for partisanship! --------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 9 1 ##################################################################### ########### 2 ### TODO: Calculate a_1, the first principal component of X. 3 # Hint: The PCA package imported from sklearn.decomposition will be u seful here, (...) 6 ### end TODO 7 ##################################################################### ########### ----> 9 a_1 = a_1 / np . linalg . norm(a_1) 10 # compute and visualize the scores along a_1 11 scores_a_1 = f(X, a_1) NameError : name 'a_1' is not defined ############################################################################## ### TODO: Calculate a_1, the first principal component of X. # Hint: The PCA package imported from sklearn.decomposition will be useful her # in particular the function pca.fit(). What should the dimensions of a_1 be? ### end TODO ############################################################################## a_1 = a_1 / np.linalg.norm(a_1) # compute and visualize the scores along a_1 scores_a_1 = f(X, a_1) for aff, marker, _ in markers: plt.scatter( scores_a_1[np.array(affiliations) == aff], np.zeros_like(scores_a_1[np.array(affiliations) == aff]), c = aff, marker = marker, edgecolors = "black" , linewidth = 0.5 , label = aff ) plt.legend() plt.title( 'projections along first principal component a_1' ) plt.xlabel( '$\\langle x_i, a \\rangle$' ) cur_axes = plt.gca() cur_axes.axes.get_yaxis().set_visible( False ) plt.show() print ( 'variance along first principal component: ' , scores_a_1.var())

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Documents

student_code.py

Copy of 126 POGIL 1.docx

CSE575_KnowledgeCheck_GoingBeyondMinCut.pdf

INT 220 Business Brief mod 2.docx

Case Study, Muscle, BIOS 252.docx

Homework_1 2.pdf

Mitigating Risk .docx

Performance Lab 2.docx

Viruses and Malicious Code.docx

Cybercrime Tech:Response .docx

ITN v7 SG SECTION 1 (Modules 1-3).docx

M04 Quiz Take 1.pdf

Recommended textbooks for you

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

Microsoft Visual C#

Computer Science

ISBN:9781337102100

Author:Joyce, Farrell.

Publisher:Cengage Learning,

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning