Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ɛ and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) #equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions ✓ 3.JPG 3 i Q Q plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn = Nearest Neighbors (n_neighbors 4) .fit (data) distances, nn.kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees = np.sort (distances (:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$4$-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. ولا 1. Introduction The objective of this final exam is to illustrate the practical application of automatic classification with DBSCAN. First, we will observe the behavior of the algorithm in a synthetic case involving k-means failure. We will see how to use classical heuristics to determine the parameters of the algorithm and how to interpret the partitioning obtained. Finally, we will apply DBSCAN on a real data set and we will illustrate its use for the detection of outliers. 2. Density classification in scikit-learn As we saw in the assignments, scikit-learn offers many automatic classification methods. For this final exam, we are interested in DBSCAN (Density-Based Spatial Clustering of Applications with Noise). 1 Like k-means, DBSCAN makes it possible to obtain for an observation matrix n_samples x n_features a vector corresponding to the n_samples numbers of the groups (cluster) for these observations. On the other hand, DBSCAN does not make it possible to produce classifications for new data (it is necessary to retrain the model by including new observations). The description of the DBSCAN implementation can be found in https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. 3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples) noisy by # a Gaussian noise of standard deviation 0.1 ("noise"). The ratio between the radius of the small circle and the large circle is 0.3 (factor'). data, labels = sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) #Point cloud display plt.scatter (data[:,0], data[:,1], c-labels) plt.show()

icon
Related questions
Question

Can you do questions 3, 4, 5 and 6 because they are really hard. I couldn't find it out. Thanks 

Question 3: What are the default values for important DBSCAN parameters in scikit-learn
(ɛ and m)?
Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is
possible to perform this step in two steps by calling fit () then accessing the labels_
attribute, or to do everything in one operation using the fit predict() method
predictions = db.fit_predict (data)
#equivalent to
#db. fit (data)
predictions = db.labels_
# Display of the scatter plot colored by the predictions
✓
3.JPG
3
i
Q Q
plt.scatter (data[:,0], data[:,11, c-predictions)
plt.show()
Question 4: What do you notice? On what parameter is it probably necessary to play to
improve this result?
To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance
graph in the observation cloud. We consider for the moment that min_samples is fixed
at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our
observation matrix.
from sklearn.neighbors import Nearest Neighbors
nn = Nearest Neighbors (n_neighbors 4) .fit (data)
distances,
nn.kneighbors (data)
Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what
the code above does.
We can now draw the 4-distance graph. To do this, we only keep the distance from each
point to its fourth neighbor, then we sort this list in descending order.
distances triees = np.sort (distances (:, -1]) [::-1]
plt.plot (distances_triees)
plt.xlabel("Number of points")
plt.ylabel("$4$-distance")
plt.show()
Question 6:
From the 4-distance graph, determine the appropriate eps value for this dataset using the
current view heuristic. Reapply DBSCAN with these settings. Display the resulting point
cloud.
ولا
Transcribed Image Text:Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ɛ and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) #equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions ✓ 3.JPG 3 i Q Q plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn = Nearest Neighbors (n_neighbors 4) .fit (data) distances, nn.kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees = np.sort (distances (:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$4$-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. ولا
1. Introduction
The objective of this final exam is to illustrate the practical application of automatic
classification with DBSCAN. First, we will observe the behavior of the algorithm in a
synthetic case involving k-means failure. We will see how to use classical heuristics to
determine the parameters of the algorithm and how to interpret the partitioning obtained.
Finally, we will apply DBSCAN on a real data set and we will illustrate its use for the
detection of outliers.
2. Density classification in scikit-learn
As we saw in the assignments, scikit-learn offers many automatic classification methods.
For this final exam, we are interested in DBSCAN (Density-Based Spatial Clustering of
Applications with Noise).
1
Like k-means, DBSCAN makes it possible to obtain for an observation matrix
n_samples x n_features a vector corresponding to the n_samples numbers of
the groups (cluster) for these observations. On the other hand, DBSCAN does not make it
possible to produce classifications for new data (it is necessary to retrain the model by
including new observations). The description of the DBSCAN implementation can be
found in https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html.
3. Classification with DBSCAN of generated data
Let's start by generating a dataset of 500 observations in 2D space. We will use a function
built into scikit-learn to produce circular point clouds.
import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets
from sklearn.utils import shuffle
Let's generate a scatter plot composed of two circles
The cloud contains 500 observations ('n samples) noisy by
# a Gaussian noise of standard deviation 0.1 ("noise").
The ratio between the radius of the small circle and the large circle
is 0.3 (factor').
data, labels = sklearn.datasets.make_circles (n_samples-500, noise-0.1,
factor-0.3, random_state=0)
print (data.shape)
Random permutation of the rows of the matrix (we mix the observations)
data, labels shuffle (data, labels)
#Point cloud display
plt.scatter (data[:,0], data[:,1], c-labels)
plt.show()
Transcribed Image Text:1. Introduction The objective of this final exam is to illustrate the practical application of automatic classification with DBSCAN. First, we will observe the behavior of the algorithm in a synthetic case involving k-means failure. We will see how to use classical heuristics to determine the parameters of the algorithm and how to interpret the partitioning obtained. Finally, we will apply DBSCAN on a real data set and we will illustrate its use for the detection of outliers. 2. Density classification in scikit-learn As we saw in the assignments, scikit-learn offers many automatic classification methods. For this final exam, we are interested in DBSCAN (Density-Based Spatial Clustering of Applications with Noise). 1 Like k-means, DBSCAN makes it possible to obtain for an observation matrix n_samples x n_features a vector corresponding to the n_samples numbers of the groups (cluster) for these observations. On the other hand, DBSCAN does not make it possible to produce classifications for new data (it is necessary to retrain the model by including new observations). The description of the DBSCAN implementation can be found in https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. 3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples) noisy by # a Gaussian noise of standard deviation 0.1 ("noise"). The ratio between the radius of the small circle and the large circle is 0.3 (factor'). data, labels = sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) #Point cloud display plt.scatter (data[:,0], data[:,1], c-labels) plt.show()
Expert Solution
steps

Step by step

Solved in 2 steps with 3 images

Blurred answer