problem17
pdf
keyboard_arrow_up
School
University of South Carolina *
*We aren’t endorsed by this school
Course
B420
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
2
Uploaded by MinisterGull3580
In [1]:
In [2]:
In [3]:
In [4]:
In [5]:
In [6]:
In [7]:
In [8]:
In [9]:
In [10]:
In [11]:
In [12]:
In [13]:
In [14]:
In [15]:
In [16]:
In [17]:
In [18]:
In [19]:
In [20]:
In [21]:
In [22]:
In [23]:
In [24]:
Problem 17: Spectral graph partitioning
Version 1.3
Changelog:
1.3: Added another more informative error message.
[Dec 5, 2019]
1.2: Added more hints and more informative error messages.
[Dec 3, 2019]
1.1: Added more examples; no changes to code or test cells.
[Dec 2, 2019]
1.0: Initial version
In this problem, you'll consider the data mining task of "clustering" a graph. That is, given a graph or network of relationships, can you identify distinct
communities within it? This problem assesses your Pandas and Numpy skills.
Exercises.
There are six exercises, numbered 0-5, worth a total of ten (10) points. However, only Exercises 0-4 require you to write code. If you have done
them correctly, then Exercise 5, which consists only of a hidden test cell, should pass when submitted to the autograder.
Regarding dependencies and partial credit:
Exercise 1 (1 points) depends on a correct Exercise 0 (2 points).
Exercises 2 (2 points) and 3 (2 points) are independent. Neither depends on Exercises 0 or 1.
Exercise 4 (2 points) relies on all earlier exercises.
Exercise 5 (1 point) depends on Exercise 4.
As always
, it is possible that you will pass Exercises 0-4 but not pass Exercise 5 if there is a subtle bug that the test cells happen not to catch, so be prepared
for that possibility!
Setup
The main modules you'll need are
pandas
,
numpy
, and
scipy
. The rest are auxiliary functions for loading data, plotting results, and test code support. Start by
running the following cell.
# Main modules you'll need:
import
numpy
as
np
import
scipy
as
sp
import
pandas
as
pd
from
pandas
import
DataFrame
# Support code for data loading, code testing, and visualization:
import
sys
sys
.
path
.
insert(
0
,
'resource/asnlib/public'
)
from
cse6040utils
import
tibbles_are_equivalent, pandas_df_to_markdown_table, hidden_cell_template_msg, cspy
from
matplotlib.pyplot
import
figure, subplots
%
matplotlib
inline
from
networkx.convert_matrix
import
from_pandas_edgelist
from
networkx.drawing.nx_pylab
import
draw
from
networkx
import
DiGraph
from
networkx.drawing
import
circular_layout
# Location of input data:
def
dataset_path
(base_filename):
return
f
"resource/asnlib/publicdata/
{base_filename}
"
Background: Relationship networks and partitioning
Suppose we have data on the following five people, stored in the
nodes
data frame (run the next cell):
nodes
=
DataFrame({
'name'
: [
'alice'
,
'bob'
,
'carol'
,
'dave'
,
'edith'
],
'age'
: [
35
,
18
,
27
,
57
,
41
]})
nodes
Also suppose we have some information on their relationships, as might be captured in a social network or database of person-to-person transactions. In
particular, if some person
follows another person
, we say
is the
source
and is the
target
. For the people listed above, suppose these relationships are
stored in a data frame named
edges
(run this code cell):
edges
=
DataFrame({
'source'
: [
'alice'
,
'alice'
,
'dave'
,
'dave'
,
'dave'
,
'bob'
],
'target'
: [
'dave'
,
'edith'
,
'alice'
,
'edith'
,
'carol'
,
'carol'
]})
edges
We can visualize these relationships as a
directed graph
or
directed network
, where the people are shown as
nodes
(circles) and the follows-relationships are
shown as edges from source to target. Run the next code cell to see the relationships in our example.
G
=
from_pandas_edgelist(edges, source
=
'source'
, target
=
'target'
, create_using
=
DiGraph())
figure(figsize
=
(
4
,
4
))
draw(G, arrows
=
True
, with_labels
=
True
, pos
=
circular_layout(G),
node_size
=1200
, node_color
=
None
, font_color
=
'w'
,
width
=2
, arrowsize
=20
)
Observation 0:
From the
edges
data frame, recall that
alice
follows
edith
; therefore, there is an arrow (or edge) pointing from
alice
to
edith
. Since
alice
and
dave
both follow one another, there is a double-headed arrow between them.
Observation 1:
One can arguably say there are two distinct "groups" in this picture. One group consists of
alice
,
dave
, and
edith
, who have at least 1 one-
way follow-relationships among all pairs of them. Similarly,
bob
and
carol
have a one-way relationship between them. However, there is only one relationship
between someone from the first group and someone from the second group.
In this problem, we might ask a data mining question, namely, whether we can automatically identify these clusters or groups, given only the known
relationships. The method you will implement is known as
spectral graph partitioning
or
spectral graph clustering
, which is formulated as a linear algebra
problem.
Exercises
Exercise 0
(2 points -- 0.5 exposed, 1.5 hidden). For our analysis, we won't care whether
follows
or
follows
, only that there is some interaction
between them. To do so, let's write some code to "symmetrize" the edges. That is, if there is a directed edge from node
to node
, then symmetrize will
ensure there is also a directed edge
to
, unless one already exists. (
Recall Notebook 10!
)
For example, a symmetrized version of the
edges
data frame from above would look like the following:
source target
alice
dave
alice
edith
bob
carol
carol
bob
carol
dave
dave
alice
dave
carol
dave
edith
edith
alice
edith
dave
Complete the function,
symmetrize_df(edges)
, below, so that it symmetrizes
edges
. Assume that
edges
is a pandas
DataFrame
with
source
and
target
columns. Your function should return a new pandas
DataFrame
with the edges symmetrized. Your function should also reset the index, so that the
output is a proper "tibble."
Note 0:
The order of the edges in the output does
not
matter.
Note 1:
You may assume the input data frame has columns named
'source'
and
'target'
.
Note 2:
You should drop any duplicate edges. In the example, the edges
'dave'
'alice'
and
'alice'
'dave'
already exist in the
input. Therefore, observe that they appear in the output, but only once each.
Note 3:
Your function should work even if there is a "self-edge," i.e., an edge
. The example above does not contain such a case, but the
hidden test might check it.
def
symmetrize_df
(edges):
assert
'source'
in
edges
.
columns
assert
'target'
in
edges
.
columns
from
pandas
import
concat
edges_transpose
=
edges
.
rename(columns
=
{
'source'
:
'target'
,
'target'
:
'source'
})
edges_all
=
concat([edges, edges_transpose], sort
=
False
) \
.
drop_duplicates() \
.
reset_index(drop
=
True
)
return
edges_all
# Demo of your function:
symmetrize_df(edges)
# Test cell: `ex0_symmetrize_df__visible` (1 point)
edges_input
=
DataFrame({
'source'
: [
'alice'
,
'alice'
,
'dave'
,
'dave'
,
'dave'
,
'bob'
],
'target'
: [
'dave'
,
'edith'
,
'alice'
,
'edith'
,
'carol'
,
'carol'
]})
edges_output
=
symmetrize_df(edges_input)
# The following comment block suggests there is hidden content in this cell, but there really isn't.
###
### AUTOGRADER TEST - DO NOT REMOVE
###
edges_output_soln
=
pd
.
read_csv(dataset_path(
'symmetrize_soln.csv'
))
assert
tibbles_are_equivalent(edges_output, edges_output_soln), \
"Your solution does not produce the expected output."
print
(
"
\n
(Passed.)"
)
# Test cell: `ex0_symmetrize_df__hidden`
print
(hidden_cell_template_msg())
###
### AUTOGRADER TEST - DO NOT REMOVE
###
Exercise 1
(1 point). Given the nodes and edges of a graph as pandas
DataFrames
(i.e., like
nodes
and
edges
above), complete the function below so that it
returns a corresponding Scipy sparse matrix in coordinate (COO) format.
That is, your function should do the following:
The sparse matrix will hold source-to-target relationships. A value of 1.0 (floating-point) in row and column will indicate that person is connected
to person .
Use
nodes.index
to determine the numerical ID (row or column) of each node. For example, Alice's ID is 0, Bob's ID is 1, Carol's ID is 2, and so on.
Therefore, in the sparse matrix, Alice will correspond with row 0 and column 0, Bob with row 1 and column 1, and so on.
Ensure that the dimensions of your sparse matrix are exactly
, where
is the largest ID plus one. In the example, Edith has the largest ID of 4;
therefore, the sparse matrix should have dimensions of
.
For example, recall that Alice and Dave are connected by an edge. Therefore, the sparse matrix should have an entry of 1.0 in the (0, 3) position, since Alice's
ID is 0 and Dave's is 3.
Hint 0:
To help you get started, we've provided a couple lines that (a) determine the output matrix's dimension,
n
; and (b) creates a dictionary,
name_to_id
, which you can use to look up a node's ID given its name.
Hint 1:
Consider combining Hint 0(b) with the pandas
.replace()
function, which you will recall from an earlier notebook works on either
DataFrame
or
Series
objects.
Hint 2:
You don't need to handle symmetry in your function. That is, if
edges
was not symmetrized, then there may be a nonzero at location
but not
. Both should appear only if both are present in
edges
.
Hint 3:
Recall Notebook 11.
def
df_to_coo
(nodes, edges):
"""Returns a Scipy coordinate sparse matrix, given an input graph in a data frame representation."""
assert
'name'
in
nodes
.
columns
assert
'source'
in
edges
.
columns
assert
'target'
in
edges
.
columns
from
scipy.sparse
import
coo_matrix
# Use this to construct your sparse matrix!
n
=
nodes
.
index
.
max()
+ 1
name_to_id
=
dict
(
zip
(nodes[
'name'
], nodes
.
index))
values
=
[
1.0
]
*
len
(edges)
rows
=
edges[
'source'
]
.
replace(name_to_id)
cols
=
edges[
'target'
]
.
replace(name_to_id)
return
coo_matrix((values, (rows, cols)), shape
=
(n, n))
nodes
# Demo cell: This function calls your function and draws a picture
# of the resulting sparse matrix, which you can use for debugging.
# In this case, it takes the original (unsymmetrized) `edges`
# data frame, symmetrizes it, and then calls your function. Thus,
# the sparse matrix you see should appear symmetric.
A
=
df_to_coo(nodes, symmetrize_df(edges))
figure(figsize
=
(
4
,
4
))
cspy(A, cbar
=
False
)
# Test cell: `ex1_df_to_coo`
def
check1
(n
=5
, k
=
None
):
assert
n
< 26
,
"Currently limited to a max dimension of 26"
from
numpy.random
import
randint
from
numpy
import
array, vectorize, issubdtype, floating
def
gen_name
(c):
return
chr
(
ord
(
'A'
)
+
c)
nodes
=
DataFrame({
'name'
: array([gen_name(c)
for
c
in
range
(n)])})
name_to_id
=
dict
(
zip
(nodes[
'name'
], nodes
.
index))
if
k
is
None
:
k
=
randint(
0
,
5*
n)
if
k
> 0
:
sources
=
vectorize(gen_name)(randint(n, size
=
k))
targets
=
vectorize(gen_name)(randint(n, size
=
k))
else
:
sources
=
[]
targets
=
[]
edges
=
symmetrize_df(DataFrame({
'source'
: sources,
'target'
: targets}))
edges_by_ids
=
set
([(name_to_id[a], name_to_id[b])
for
a, b
in
zip
(edges[
'source'
], edges[
'target'
])])
try
:
A
=
df_to_coo(nodes, edges)
assert
A
.
nnz
==
len
(edges_by_ids), f
"Expected {len(edges_by_ids)} edges, but you produced
{A.nnz}
."
assert
issubdtype(A
.
dtype, floating), f
"Value type is `
{A.dtype}
`, which is incorrect."
A_edge_set
=
set
()
for
i, j, v
in
zip
(A
.
row, A
.
col, A
.
data):
assert
v
== 1.0
, f
"All values should equal 1.0, but you have
{v}
at position (
{i}
,
{j}
)."
A_edge_set
|=
{(i, j)}
assert
edges_by_ids
==
A_edge_set, \
f
"Your matrix has incorrect edges in it:
\n
"
\
f
"
* Expected:
{edges_by_ids}
\n
"
\
f
"
* You:
{A_edge_set}
\n
"
except
:
print
(
"=== Failing test case ==="
)
print
(
"* Nodes:"
)
display(nodes)
print
(
"* Edges:"
)
display(edges)
raise
for
_
in
range
(
10
):
check1()
# Check empty test case
check1(k
=0
)
print
(
"(Passed.)"
)
Exercise 2
(2 points). Suppose you are given a vector (1-D Numpy array). Write a function that returns a pair of 1-D Numpy arrays: one containing the positions
(i.e., indices or locations) of all positive entries (strictly greater than zero), and one containing all remaining positions (less than or equal to zero).
For example, if one runs
then
Note 0:
Your function must return a pair of 1-D Numpy arrays. Each 1-D array must be truly 1-D. That is, it must have an
.ndim=1
; a 2-D array
where one of the dimensions equals 1 will not be considered correct.
Note 1:
Your function should not be too slow. The test cell first checks a bunch of small examples for correctness; it then runs a larger test for
e
ffi
ciency. As usual, the autograder must complete both for you to get any credit.
Hint:
See
numpy.where
. If you use it, pay close attention to its return value to ensure you extract the right information from it.
x
=
np
.
array([
1
,
-1
,
-2
,
2
,
-3
,
3
,
4
,
-4
,
-5
,
5
])
k_gtz, k_lez
=
partition_inds_by_sign(x)
assert
(k_gtz
==
np
.
array([
0
,
3
,
5
,
6
,
9
]))
.
all()
# Positive positions (> 0)
assert
(k_lez
==
np
.
array([
1
,
2
,
4
,
7
,
8
]))
.
all()
# Non-positive (<= 0)
def
partition_inds_by_sign
(x):
from
numpy
import
where
K_pos
=
where(x
> 0
)[
0
]
K_neg
=
where(x
<= 0
)[
0
]
return
K_pos, K_neg
# Demo:
x
=
np
.
array([
1
,
-1
,
-2
,
2
,
-3
,
3
,
4
,
-4
,
-5
,
5
])
k_gtz, k_lez
=
partition_inds_by_sign(x)
print
(f
"x[
{k_gtz}
] ==
{x[k_gtz]}
"
)
# Correct output: `x[[0 3 5 6 9]] == [1 2 3 4 5]`
print
(f
"x[
{k_lez}
] ==
{x[k_lez]}
"
)
# Correct output: `x[[1 2 4 7 8]] == [-1 -2 -3 -4 -5]`
a
=
np
.
array([
1
,
-1
,
-2
,
2
,
-3
,
3
,
4
,
-4
,
-5
,
5
])
np
.
where(a
> 0
)[
0
]
# Test cell: `ex2_partition_inds_by_sign`
def
check2
(n
=8
):
from
random
import
randrange
from
numpy
import
array, issubdtype, integer, ndarray
x
=
[]
kp
=
[]
kn
=
[]
for
i
in
range
(n):
x
.
append(randrange(
-10
,
10
))
(kp
if
x[
-1
]
> 0
else
kn)
.
append(i)
x, kp, kn
=
array(x), array(kp, dtype
=
int
), array(kn, dtype
=
int
)
try
:
k_gtz, k_lez
=
partition_inds_by_sign(x)
assert
isinstance
(k_gtz, ndarray)
and
isinstance
(k_lez, ndarray), \
"You did not return Numpy arrays."
assert
issubdtype(k_gtz
.
dtype, integer)
and
issubdtype(k_lez
.
dtype, integer), \
f
"Both Numpy element types must be integers (yours are
{k_gtz.dtype}
and
{k_lez.dtype}
)."
assert
(k_gtz
==
kp)
.
all(),
"Indices of positive values is incorrect."
assert
(k_lez
==
kn)
.
all(),
"Indices of non-positive values is incorrect."
except
:
print
(
"=== Error on test case! ==="
)
print
(f
"Input:
{x}
"
)
print
(f
"Your solution:
{k_gtz}
,
{k_lez}
"
)
print
(f
"True solution:
{kp}
,
{kn}
"
)
assert
False
,
"Please double-check your approach on this test case."
print
(
"Checking for correctness on small inputs..."
)
for
_
in
range
(
100
):
check2()
print
(
"Checking for efficiency on a larger input..."
)
check2(
2000000
)
print
(
"(Passed.)"
)
Permutation vectors.
A permutation vector is a list of indices that corresponds to a rearrangement of the elements of another vector.
For example, above we determined where the positive elements are and where the non-positive elements are. If we concatenate these into a single vector, the
result is a permutation vector that can be used to place all the positive elements first, followed by the negative elements. For example:
print
(f
"x ==
{x}
"
)
k
=
np
.
concatenate((k_gtz, k_lez))
print
(f
"k ==
{k}
"
)
y
=
x[k]
# permute!
print
(f
"y = x[k] ==
{y}
"
)
# elements rearranged
Inverse permutations.
It's sometimes useful to
invert
a permutation. In the previous example, suppose you are given
y
and want to recover
x
. The
inverse
permutation vector
,
k_inv
, produces
x = y[k_inv]
.
For example, here is a code snippet that creates an array,
x
and then permutes it into a new array
y
using the permutation vector (list)
k
:
If we wish to recover
x
from
y
, we can use the following permutation,
k_inv
, which is the inverse permutation of
k
:
If it's helpful, here are those same arrays drawn as a picture:
x
=
[
1
,
-1
,
-2
,
2
,
-3
,
3
,
4
,
-4
,
-5
,
5
]
# input
k
=
[
0
,
3
,
5
,
6
,
9
,
1
,
2
,
4
,
7
,
8
]
# some desired permutation of `x`
y
=
x[k]
# the permuted array
print
(y)
# should print: `[1, 2, 3, 4, 5, -1, -2, -3, -4, -5]`
k_inv = [0, 5, 6, 1, 7, 2, 3, 8, 9, 4] # the inverse of permutation `k`
print(y[k_inv]) # should be the same as `x`
Example: Permutations and their inverses
Exercise 3
(2 points). Complete the function
k_inv = invert_perm(k)
, below, so that returns the inverse of the permutation vector
k
. For the preceding
example:
If it would be helpful, here is a picture of your task, namely, to calculate
k_inv
given
k
:
invert_perm(k)
==
np
.
array([
0
,
5
,
6
,
1
,
7
,
2
,
3
,
8
,
9
,
4
])
Example: Permutations and their inverses (2)
def
invert_perm
(k):
from
numpy
import
ndarray, issubdtype, integer
assert
isinstance
(k, ndarray),
"Input permutation should be a Numpy array."
assert
issubdtype(k
.
dtype, integer),
"Input permutation should have integer elements."
assert
(k
>= 0
)
.
all(),
"Input permutation should contain positive values."
from
numpy
import
empty, arange
k_inv
=
empty(
len
(k), dtype
=
int
)
k_inv[k]
=
arange(
len
(k))
return
k_inv
# Demo
k_inv
=
invert_perm(k)
y
=
x[k]
print
(f
"x ==
{x}
"
)
print
(f
"k ==
{k}
"
)
print
(f
"y = x[k] ==
{y}
"
)
print
(f
"k_inv ==
{k_inv}
"
)
print
(f
"y[k_inv] ==
{y[k_inv]}
"
)
# If all goes well, should equal `x`
assert
(y[k_inv]
==
x)
.
all(),
"Demo failed!"
k
=
np
.
array([
0
,
3
,
5
,
6
,
9
,
1
,
2
,
4
,
7
,
8
])
np
.
array([(i, k)
for
i,k
in
enumerate
(k)])
inds
=
np
.
arange(
len
(k))
[inds[e]
for
e
in
k ]
k_inv
=
np
.
zeros_like(k)
np
.
arange(
len
(k))[k]
# Test cell: `ex3_invert_perm`
def
check3
(n
=8
):
from
numpy
import
arange, ndarray, issubdtype, integer
from
numpy.random
import
permutation
try
:
x
=
arange(n)
p
=
permutation(n)
p_inv
=
invert_perm(p)
assert
isinstance
(p_inv, ndarray)
and
(p_inv
.
ndim
== 1
),
"You did not return a 1-D Numpy array."
assert
issubdtype(p_inv
.
dtype, integer), f
"Numpy element type must be integer, not
{p_inv.dtype}
."
y
=
x[p]
z
=
y[p_inv]
assert
(x
==
z)
.
all()
except
:
print
(
"=== Failed test case ==="
)
print
(f
"Input: x ==
{x}
"
)
print
(f
"A random permutation: p ==
{p}
"
)
print
(f
"Permuted `x`: y == x[p] ==
{y}
"
)
print
(f
"Your inverse permutation: p_inv ==
{p_inv}
"
)
print
(f
"Result of y[p_inv] should be `x`, and yours is
{z}
"
)
raise
for
_
in
range
(
100
):
check3()
print
(
"(Passed.)"
)
Spectral partitioning (or clustering)
With the preceding building blocks, you are now ready to implement a technique known as
spectral graph partitioning
.
Without going into too many details, the technique proceeds as follows:
1. Form a special matrix called the
graph Laplacian
. This is a matrix
, where
is a symmetrized matrix of relationships like what you worked
with above, and
is a diagonal matrix consisting of the degrees of every node. (Recall from an earlier notebook that the degree of a node is the
number of edges that touch it.)
2. Compute the
second smallest eigenvalue
(in magnitude) and its corresponding eigenvector. This eigenvector is known as the
Fiedler vector
.
3. Use the signs of the components of the Fiedler vector to identify two groups of nodes. That is, form one group from the components with positive
sign, and the other from components with non-positive sign.
We will provide you with code to perform steps 1 and 2. In Exercise 4 below, you need to write code to carry out step 3.
Calculating the Fiedler vector.
Here is the code to compute the Fiedler vector, given a (symmetrized) matrix of relationships,
. In particular, the function
calc_fiedler_vector(A)
returns the Fiedler vector. The cell below also prints the Fiedler vector for the example network from above.
def
calc_degrees_matrix
(A):
from
scipy.sparse
import
spdiags
n
=
min
(A
.
shape)
return
spdiags(A
.
sum(axis
=1
)
.
reshape(n), diags
=0
, m
=
A
.
shape[
0
], n
=
A
.
shape[
1
])
def
calc_fiedler_vector
(A):
from
scipy.sparse.linalg
import
eigsh
D
=
calc_degrees_matrix(A)
L
=
D
-
A
# Form Laplacian
_, V
=
eigsh(L, k
=2
, which
=
'SM'
)
# Determine 2nd smallest eigenpair
return
V[:,
1
]
# Demo:
v
=
calc_fiedler_vector(A)
print
(v)
Suppose we use the Fiedler vector to create groups, partitioning by sign using your code from Exercise 2. Observe what happens.
k_gtz, k_lez
=
partition_inds_by_sign(v)
print
(
"=== Group 0 ==="
)
display(nodes
.
loc[k_gtz])
print
(
"
\n
=== Group 1 ==="
)
display(nodes
.
loc[k_lez])
Notice that
alice
,
dave
, and
edith
appear in one group, and
bob
and
carol
in the other! That corresponds with what we said was our intuitive idea of the
"natural" clusters in the network.
Hint for the next (and last) exercise.
Given a coordinate sparse matrix,
A
, recall that you can access its row indices, column indices, and values using the
following attributes of
A
(run the code cell).
print
(
"COO row indices:"
, A
.
row)
print
(
"COO column indices:"
, A
.
col)
print
(
"COO values:"
, A
.
data)
Exercise 4
(2 points). Given a coordinate sparse matrix
A
and a Fiedler vector
v
, complete the function
reorder_by_fiedler(A, v)
below so that it does
the following:
Uses the Fiedler vector
v
to create a permutation vector that groups positive values first followed by non-positive values.
Creates a new coordinate sparse matrix,
A_perm
, which is
A
with its rows and columns permuted according to the permutation vector.
Permuting the rows and columns means simultaneously reordering them according to a permutation vector. You can do so by simply "relabeling" the indices in
just the right way. In particular, let's call the permutation vector
k
. Then, in the permuted matrix
A_perm
, the row index that was
k[i]
in
A
should become row
index
i
in
A_perm
. Similarly, any column index
k[j]
in
A
should become column index
j
in
A_perm
.
For example, suppose the Fiedler vector of the example matrix
A
above had
[0, 3, 4]
as positive components and
[1, 2]
as non-positive, meaning the
concatenated permutation vector is
[0, 3, 4, 1, 2]
. Then, reading the permutation vector from left-to-right, the row index that was 0 in
A
should remain 0
in the permuted matrix
A_perm
; the row index that was 3 in
A
should be changed to 1 in
A_perm
; the row index that was 4 in
A
should become 2 in
A_perm
;
the row index that was 1 should become 3; and the final row index of 2 should become 4. An identical relabeling should be applied to the column indices. A
spy plot of the original matrix and its reordered version are as follows:
Example of a matrix symmetrically reordered by the permutation vector `[0, 3, 4, 1, 2]`.
Notice how the nonzeros are grouped di
ff
erently, with some clustering happening along the diagonal blocks after permutation. That's a visual signal that we've
identified some clusters.
Note:
The eigensolver used to compute the Fiedler vector can sometimes "flip" the signs. The results are essentially the same, but it may mean
that the permuted result will have the diagonal blocks you see above reversed.
Hint:
The example you see above is tricky, because it turns out the permutation vector
[0, 3, 4, 1, 2]
is its own inverse! In developing
your solution,
you'll need to think carefully about whether you should use a permutation vector, its inverse, both, or neither.
def
reorder_by_fiedler
(A, v):
from
scipy.sparse
import
coo_matrix
assert
isinstance
(A,
type
(coo_matrix((
1
,
1
))))
k_gtz, k_lez
=
partition_inds_by_sign(v)
k
=
np
.
concatenate((k_gtz, k_lez))
k_inv
=
invert_perm(k)
rows_perm
=
k_inv[A
.
row]
cols_perm
=
k_inv[A
.
col]
return
coo_matrix((A
.
data, (rows_perm, cols_perm)), shape
=
A
.
shape)
A_perm
=
reorder_by_fiedler(A, v)
figure(figsize
=
(
4
,
4
))
cspy(A_perm, cbar
=
False
)
# Test cell: `ex4_reorder_by_fiedler`
def
check4
(n
=8
):
from
random
import
randrange
from
numpy
import
array, arange, where
import
scipy.sparse
as
sparse
v
=
[]
kp
=
[]
kn
=
[]
for
i
in
range
(n):
v
.
append(randrange(
-10
,
10
))
(kp
if
v[
-1
]
> 0
else
kn)
.
append(i)
v, p
=
array(v), array(kp
+
kn)
A
=
sparse
.
random(n, n,
0.25
)
A
.
data
=
arange(
1
,
len
(A
.
data)
+1
)
try
:
A_perm
=
reorder_by_fiedler(A, v)
assert
isinstance
(A_perm,
type
(A)), \
f
"You returned an object of type {type(A_perm)}, not {type(A)}."
assert
A
.
shape
==
A_perm
.
shape, \
f
"You returned a sparse matrix with dimensions
{A_perm.shape}
instead of
{A.shape}
."
assert
A
.
nnz
==
A_perm
.
nnz, \
f
"Your result has
{A_perm.nnz}
nonzeros instead of
{A.nnz}
."
for
i, j, a_perm_i_j
in
zip
(A_perm
.
row, A_perm
.
col, A_perm
.
data):
i0, j0
=
p[i], p[j]
k0
=
where((A
.
row
==
i0)
&
(A
.
col
==
j0))[
0
]
a_i0_j0
=
A
.
data[k0]
assert
a_perm_i_j
==
a_i0_j0, \
f
"Entry (
{i}
,
{j}
) of your solution does not appear to be correct."
except
:
print
(
"=== Error on test case! ==="
)
print
(f
"Input matrix:"
)
print
(f
"* rows =
{A.row}
"
)
print
(f
"* cols =
{A.col}
"
)
print
(f
"* vals =
{A.data}
"
)
print
(f
"Fiedler vector:
{v}
"
)
print
(f
"Permutation:
{p}
"
)
assert
False
,
"Please double-check your approach on this test case."
print
(
"Checking for correctness..."
)
for
_
in
range
(
100
):
check4()
print
(
"(Passed.)"
)
Application: Political blogs analysis
As a final step, let's try this method on a real dataset. If you already did the above exercises correctly, you do not have to write more code. However, there is
one more test cell that must pass to get full points on this problem; see Exercise 5, below.
Political blogs dataset.
The dataset consists of political blogs and their links to one another. Let's load these data and take a look.
blog_nodes
=
pd
.
read_csv(dataset_path(
'polblogs3_nodes.csv'
))
blog_edges
=
pd
.
read_csv(dataset_path(
'polblogs3_edges.csv'
))
display(blog_nodes
.
head())
print
(f
"==> The `blog_nodes` data frame lists {len(blog_nodes)} blog sites."
)
display(blog_edges
.
head())
print
(f
"==> The `blog_edges` data frame contains {len(blog_edges)} directed links"
\
f
"among sites (~ {len(blog_edges)/len(blog_nodes):.1f} links/site)."
)
The
blog_nodes
data frame above has a column named
'pol'
, which is the political a
ffi
liation of the blog: a
+1
means a conservative (politically right-
leaning) site, and a
-1
means a liberal (left-leaning) site.
Let's see what happens if we partition the graph corresponding to these sites and their links to one another.
a
b
a
b
a
b
b
a
a
b
b
a
→
→
a
→
a
i
j
i
j
n
×
n
n
5
×
5
(
i
,
j
)
(
j
,
i
)
L
=
D
−
A
A
D
A
Out[2]:
name age
0
alice
35
1
bob
18
2
carol
27
3
dave
57
4
edith
41
Out[3]:
source target
0
alice
dave
1
alice
edith
2
dave
alice
3
dave
edith
4
dave
carol
5
bob
carol
Out[5]:
source target
0
alice
dave
1
alice
edith
2
dave
alice
3
dave
edith
4
dave
carol
5
bob
carol
6
edith
alice
7
edith
dave
8
carol
dave
9
carol
bob
(Passed.)
This test cell will be replaced with one or more hidden tests.
You will only know the result after submitting to the autograder.
If the autograder times out, then either your solution is highly
inefficient or contains a bug (e.g., an infinite loop). To see
the result of the autograder run, inspect the grading report.
Out[9]:
name age
0
alice
35
1
bob
18
2
carol
27
3
dave
57
4
edith
41
(Passed.)
x[[0 3 5 6 9]] == [1 2 3 4 5]
x[[1 2 4 7 8]] == [-1 -2 -3 -4 -5]
Out[13]:
array([0, 3, 5, 6, 9])
Checking for correctness on small inputs...
Checking for efficiency on a larger input...
(Passed.)
x == [ 1 -1 -2
2 -3
3
4 -4 -5
5]
k == [0 3 5 6 9 1 2 4 7 8]
y = x[k] == [ 1
2
3
4
5 -1 -2 -3 -4 -5]
x == [ 1 -1 -2
2 -3
3
4 -4 -5
5]
k == [0 3 5 6 9 1 2 4 7 8]
y = x[k] == [ 1
2
3
4
5 -1 -2 -3 -4 -5]
k_inv == [0 5 6 1 7 2 3 8 9 4]
y[k_inv] == [ 1 -1 -2
2 -3
3
4 -4 -5
5]
Out[17]:
array([0, 3, 5, 6, 9, 1, 2, 4, 7, 8])
(Passed.)
[ 0.41931948 -0.702415
-0.3379981
0.20177414
0.41931948]
=== Group 0 ===
name age
0
alice
35
3
dave
57
4
edith
41
=== Group 1 ===
name age
1
bob
18
2
carol
27
COO row indices: [0 0 3 3 3 1 4 4 2 2]
COO column indices: [3 4 0 4 2 2 0 3 3 1]
COO values: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Checking for correctness...
(Passed.)
name pol
0
kevincourser.blogspot.com 1
1
theleftcoaster.com
-1
2
kicktheleftist.blogspot.com -1
3
madanthony.org
1
4
karmicsoup.blogspot.com
-1
==> The `blog_nodes` data frame lists 754 blog sites.
source
target
0
kevincourser.blogspot.com littlegreenfootballs.com/weblog
1
kevincourser.blogspot.com drudgereport.com
2
theleftcoaster.com
warandpiece.com
3
theleftcoaster.com
bartcop.com
4
theleftcoaster.com
prospect.org/weblog
==> The `blog_edges` data frame contains 25042 directed linksamong sites (~ 33.2 links/site).
In [25]:
In [26]:
In [27]:
In [28]:
In [29]:
Let's see what happens if we partition the graph corresponding to these sites and their links to one another.
A_blogs
=
df_to_coo(blog_nodes, blog_edges)
v_blogs
=
calc_fiedler_vector(A_blogs)
A_blogs_perm
=
reorder_by_fiedler(A_blogs, v_blogs)
f, (ax1, ax2)
=
subplots(
1
,
2
, figsize
=
(
10
,
5
))
cspy(A_blogs, ax
=
ax1, cbar
=
False
, s
=1
)
cspy(A_blogs_perm, ax
=
ax2, cbar
=
False
, s
=1
)
If everything is working, you should see two spy plots. The left-hand plot does not appear to have much structure; however, the right-hand plot, which is the
same graph permuted by the Fiedler vector, should have some strong block-diagonal structure, indicating the presence of clusters.
Indeed, let's see which sites correspond to positive entries of the Fiedler vector, and which correspond to negative entries.
kp_blogs, kn_blogs
=
partition_inds_by_sign(v_blogs)
len
(kp_blogs),
len
(kn_blogs)
The next cell displays a random sample of the two groups:
display(blog_nodes
.
loc[kp_blogs]
.
sample(
10
))
display(blog_nodes
.
loc[kn_blogs]
.
sample(
10
))
You should see that one of the samples has mostly +1 political a
ffi
liations, whereas the other has mostly -1 a
ffi
liations. Indeed, if we calculate the statistics on
the
pol
column of the two groups, you can confirm this finding:
display(blog_nodes
.
loc[kp_blogs]
.
describe())
display(blog_nodes
.
loc[kn_blogs]
.
describe())
You should see the mean and median values near +1 in one group and -1 in the other.
Exercise 5
(1 point). You do not need to write any more code. This "exercise" is simply an end-to-end test of all the code you wrote in Exercises 0-4. If those
are working correctly and are "e
ffi
cient enough," then the test hidden cell that appears immediately below should pass the autograder.
# Test cell: `ex5_end_to_end_check`
print
(hidden_cell_template_msg())
###
### AUTOGRADER TEST - DO NOT REMOVE
###
Fin!
You’ve reached the end of this part. Don’t forget to restart and run all cells again to make sure it’s all working when run in sequence; and make sure your
work passes the submission process. If the autograder seems to fail, be sure to inspect the grading report for more information. Good luck!
Reference:
The US political blogs dataset used in this problem were recorded in 2005 by L. A. Adamic and N. Glance, "
The political
blogosphere and the 2004 US Election
," in the Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). The full dataset
is available
here
, but in this problem, we use a subset of that data.
Out[26]:
(408, 346)
name pol
122
ayerdis.com/rants.php
1
626
wizbangblog.com
1
273
rogerlsimon.com
1
663
zeke01.typepad.com
1
79
vikingpundit.blogspot.com
1
678
thetemplarpundit.blogspot.com 1
619
templeofjennifer.com/blog
1
594
bitheads.blogspot.com
1
458
activistchat.com/blogiran
1
408
lt-smash.us
1
name pol
163
flagrancy.net
-1
400
righthandthief.blogspot.com
-1
742
presidentboxer.blogspot.com
-1
50
strangedoctrines.typepad.com
-1
281
jewssansfrontieres.blogspot.com -1
150
allanjenkins.typepad.com
-1
35
whoviating.blogspot.com
-1
282
blogs.salon.com/0002874
-1
459
raedinthemiddle.blogspot.com
-1
224
jdeeth.blogspot.com
-1
pol
count
408.000000
mean
0.970588
std
0.241041
min
-1.000000
25%
1.000000
50%
1.000000
75%
1.000000
max
1.000000
pol
count
346.000000
mean
-0.919075
std
0.394653
min
-1.000000
25%
-1.000000
50%
-1.000000
75%
-1.000000
max
1.000000
This test cell will be replaced with one or more hidden tests.
You will only know the result after submitting to the autograder.
If the autograder times out, then either your solution is highly
inefficient or contains a bug (e.g., an infinite loop). To see
the result of the autograder run, inspect the grading report.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- C++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningC++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology PtrProgramming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:Cengage
- EBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENTSystems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningFundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning