3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples') noisy by a Gaussian noise of standard deviation 0.1 ('noise'). The ratio between the radius of the small circle and the large circle # is 0.3 (factor'). data, labels sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) Point cloud display plt.scatter (data[:,0], data[:,11, c-labels) plt.show() Question 1: How many groups does this dataset have? Question 2: Perform a clustering of this dataset using k-means. What can we expect? What do you notice? 2 Since the two circles are separated by an area with no data, a density-based method seems appropriate. We can create a clustering model using DBSCAN by importing it from scikit- learn: from sklearn.cluster import DBSCAN db DBSCAN () The constructor arguments of DBSCAN are as follows: eps: the dimension of the neighborhood, i.e. the maximum distance between two observations allowing them to be considered as neighbors of each other, min_samples: the minimum number of neighbors that a central point must have, metric: the distance to consider (by default, the Euclidean distance is used). You can call the following methods: ■.fit(X): performs an automatic classification using the DBSCAN method on the observation matrix X. The results are stored in the .labels_attribute. ■.fit_predict (X): same as .fit (X) but returns group labels directly. The following attributes are available after calling the .fit() method: ■ core_sample_indices_: the indices of the core points. labels: the group numbers of the points in the observation matrix. Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ε and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) # equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions 3 plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn Nearest Neighbors (n_neighbors-4).fit(data) distances, nn. kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees np.sort (distances [:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$45-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. Question 7: How many groups do you get? What are observations with label -1? 4. Classification with DBSCAN of "Iris" dataset The Iris dataset is a classic in statistical learning. It includes 150 observations of plants according to four attributes: ■ sepal length, ■ sepal width, petal length, width of the petal. Sightings fall into one of three classes, corresponding to the three species of iris: Setosa, Versicolour, or Virginica. More details can be found in the documentation. Iris is built into scikit-learn and is available from the sklearn.datasets submodule: from sklearn.datasets import load iris X, y load iris (return_X_y=True) Question 8: How many observations does this dataset contain? To simulate the presence of outliers, we will randomly generate 20 noisy points, drawn according to a uniform law between the minimum and maximum values of each column of the observation matrix: Question 9: Perform a principal component analysis and visualize the Iris dataset projected along its first two principal axes. Answer: (I provided you the answer to be able to do the question 9) from sklearn.decomposition import PCA pca PCA (n_components-2) X_pca pca.fit_transform (X) plt.scatter (X_pca[:,0), X_pca[:,11, c-y) and plt.show()

icon
Related questions
Question

Can you do questions 3, 4, 5, and 6, please? Thank you 

3. Classification with DBSCAN of generated data
Let's start by generating a dataset of 500 observations in 2D space. We will use a function
built into scikit-learn to produce circular point clouds.
import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets
from sklearn.utils import shuffle
Let's generate a scatter plot composed of two circles
The cloud contains 500 observations ('n samples') noisy by
a Gaussian noise of standard deviation 0.1 ('noise').
The ratio between the radius of the small circle and the large circle
# is 0.3 (factor').
data, labels sklearn.datasets.make_circles (n_samples-500, noise-0.1,
factor-0.3, random_state=0)
print (data.shape)
Random permutation of the rows of the matrix (we mix the observations)
data, labels shuffle (data, labels)
Point cloud display
plt.scatter (data[:,0], data[:,11, c-labels)
plt.show()
Question 1: How many groups does this dataset have?
Question 2: Perform a clustering of this dataset using k-means. What can we expect? What
do you notice?
2
Since the two circles are separated by an area with no data, a density-based method seems
appropriate. We can create a clustering model using DBSCAN by importing it from scikit-
learn:
from sklearn.cluster import DBSCAN
db
DBSCAN ()
The constructor arguments of DBSCAN are as follows:
eps: the dimension of the neighborhood, i.e. the maximum distance between two
observations allowing them to be considered as neighbors of each other,
min_samples: the minimum number of neighbors that a central point must have,
metric: the distance to consider (by default, the Euclidean distance is used).
You can call the following methods:
■.fit(X): performs an automatic classification using the DBSCAN method on the
observation matrix X. The results are stored in the .labels_attribute.
■.fit_predict (X): same as .fit (X) but returns group labels directly.
The following attributes are available after calling the .fit() method:
■ core_sample_indices_: the indices of the core points.
labels: the group numbers of the points in the observation matrix.
Question 3: What are the default values for important DBSCAN parameters in scikit-learn
(ε and m)?
Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is
possible to perform this step in two steps by calling fit () then accessing the labels_
attribute, or to do everything in one operation using the fit predict() method
predictions = db.fit_predict (data)
# equivalent to
#db. fit (data)
predictions = db.labels_
# Display of the scatter plot colored by the predictions
3
Transcribed Image Text:3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples') noisy by a Gaussian noise of standard deviation 0.1 ('noise'). The ratio between the radius of the small circle and the large circle # is 0.3 (factor'). data, labels sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) Point cloud display plt.scatter (data[:,0], data[:,11, c-labels) plt.show() Question 1: How many groups does this dataset have? Question 2: Perform a clustering of this dataset using k-means. What can we expect? What do you notice? 2 Since the two circles are separated by an area with no data, a density-based method seems appropriate. We can create a clustering model using DBSCAN by importing it from scikit- learn: from sklearn.cluster import DBSCAN db DBSCAN () The constructor arguments of DBSCAN are as follows: eps: the dimension of the neighborhood, i.e. the maximum distance between two observations allowing them to be considered as neighbors of each other, min_samples: the minimum number of neighbors that a central point must have, metric: the distance to consider (by default, the Euclidean distance is used). You can call the following methods: ■.fit(X): performs an automatic classification using the DBSCAN method on the observation matrix X. The results are stored in the .labels_attribute. ■.fit_predict (X): same as .fit (X) but returns group labels directly. The following attributes are available after calling the .fit() method: ■ core_sample_indices_: the indices of the core points. labels: the group numbers of the points in the observation matrix. Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ε and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) # equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions 3
plt.scatter (data[:,0], data[:,11, c-predictions)
plt.show()
Question 4: What do you notice? On what parameter is it probably necessary to play to
improve this result?
To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance
graph in the observation cloud. We consider for the moment that min_samples is fixed
at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our
observation matrix.
from sklearn.neighbors import Nearest Neighbors
nn Nearest Neighbors (n_neighbors-4).fit(data)
distances, nn. kneighbors (data)
Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what
the code above does.
We can now draw the 4-distance graph. To do this, we only keep the distance from each
point to its fourth neighbor, then we sort this list in descending order.
distances triees np.sort (distances [:, -1]) [::-1]
plt.plot (distances_triees)
plt.xlabel("Number of points")
plt.ylabel("$45-distance")
plt.show()
Question 6:
From the 4-distance graph, determine the appropriate eps value for this dataset using the
current view heuristic. Reapply DBSCAN with these settings. Display the resulting point
cloud.
Question 7: How many groups do you get? What are observations with label -1?
4. Classification with DBSCAN of "Iris" dataset
The Iris dataset is a classic in statistical learning. It includes 150 observations of plants
according to four attributes:
■ sepal length,
■ sepal width,
petal length,
width of the petal.
Sightings fall into one of three classes, corresponding to the three species of iris: Setosa,
Versicolour, or Virginica. More details can be found in the documentation. Iris is built
into scikit-learn and is available from the sklearn.datasets submodule:
from sklearn.datasets import load iris
X, y load iris (return_X_y=True)
Question 8: How many observations does this dataset contain?
To simulate the presence of outliers, we will randomly generate 20 noisy points, drawn
according to a uniform law between the minimum and maximum values of each column
of the observation matrix:
Question 9: Perform a principal component analysis and visualize the Iris dataset
projected along its first two principal axes.
Answer: (I provided you the answer to be able to do the question 9)
from sklearn.decomposition import PCA
pca PCA (n_components-2)
X_pca pca.fit_transform (X)
plt.scatter (X_pca[:,0), X_pca[:,11, c-y) and plt.show()
Transcribed Image Text:plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn Nearest Neighbors (n_neighbors-4).fit(data) distances, nn. kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees np.sort (distances [:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$45-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. Question 7: How many groups do you get? What are observations with label -1? 4. Classification with DBSCAN of "Iris" dataset The Iris dataset is a classic in statistical learning. It includes 150 observations of plants according to four attributes: ■ sepal length, ■ sepal width, petal length, width of the petal. Sightings fall into one of three classes, corresponding to the three species of iris: Setosa, Versicolour, or Virginica. More details can be found in the documentation. Iris is built into scikit-learn and is available from the sklearn.datasets submodule: from sklearn.datasets import load iris X, y load iris (return_X_y=True) Question 8: How many observations does this dataset contain? To simulate the presence of outliers, we will randomly generate 20 noisy points, drawn according to a uniform law between the minimum and maximum values of each column of the observation matrix: Question 9: Perform a principal component analysis and visualize the Iris dataset projected along its first two principal axes. Answer: (I provided you the answer to be able to do the question 9) from sklearn.decomposition import PCA pca PCA (n_components-2) X_pca pca.fit_transform (X) plt.scatter (X_pca[:,0), X_pca[:,11, c-y) and plt.show()
Expert Solution
steps

Step by step

Solved in 2 steps

Blurred answer