Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ɛ and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) #equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions ✓ 3.JPG 3 i Q Q plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn = Nearest Neighbors (n_neighbors 4) .fit (data) distances, nn.kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees = np.sort (distances (:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$4$-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. ولا 1. Introduction The objective of this final exam is to illustrate the practical application of automatic classification with DBSCAN. First, we will observe the behavior of the algorithm in a synthetic case involving k-means failure. We will see how to use classical heuristics to determine the parameters of the algorithm and how to interpret the partitioning obtained. Finally, we will apply DBSCAN on a real data set and we will illustrate its use for the detection of outliers. 2. Density classification in scikit-learn As we saw in the assignments, scikit-learn offers many automatic classification methods. For this final exam, we are interested in DBSCAN (Density-Based Spatial Clustering of Applications with Noise). 1 Like k-means, DBSCAN makes it possible to obtain for an observation matrix n_samples x n_features a vector corresponding to the n_samples numbers of the groups (cluster) for these observations. On the other hand, DBSCAN does not make it possible to produce classifications for new data (it is necessary to retrain the model by including new observations). The description of the DBSCAN implementation can be found in https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. 3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples) noisy by # a Gaussian noise of standard deviation 0.1 ("noise"). The ratio between the radius of the small circle and the large circle is 0.3 (factor'). data, labels = sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) #Point cloud display plt.scatter (data[:,0], data[:,1], c-labels) plt.show()

Question 3: What are the default values for important DBSCAN parameters in scikit-learn (ɛ and m)? Let apply an automatic classification by DBSCAN on our dataset. As for k-means, it is possible to perform this step in two steps by calling fit () then accessing the labels_ attribute, or to do everything in one operation using the fit predict() method predictions = db.fit_predict (data) #equivalent to #db. fit (data) predictions = db.labels_ # Display of the scatter plot colored by the predictions ✓ 3.JPG 3 i Q Q plt.scatter (data[:,0], data[:,11, c-predictions) plt.show() Question 4: What do you notice? On what parameter is it probably necessary to play to improve this result? To refine our analysis, we will apply Schubert's heuristic, which exploits the k-distance graph in the observation cloud. We consider for the moment that min_samples is fixed at its default value, i.e. 5. We must therefore draw the graph of the 4-distances for our observation matrix. from sklearn.neighbors import Nearest Neighbors nn = Nearest Neighbors (n_neighbors 4) .fit (data) distances, nn.kneighbors (data) Question 5: Using the NearestNeighbours documentation in scikit-learn, explain what the code above does. We can now draw the 4-distance graph. To do this, we only keep the distance from each point to its fourth neighbor, then we sort this list in descending order. distances triees = np.sort (distances (:, -1]) [::-1] plt.plot (distances_triees) plt.xlabel("Number of points") plt.ylabel("$4$-distance") plt.show() Question 6: From the 4-distance graph, determine the appropriate eps value for this dataset using the current view heuristic. Reapply DBSCAN with these settings. Display the resulting point cloud. ولا 1. Introduction The objective of this final exam is to illustrate the practical application of automatic classification with DBSCAN. First, we will observe the behavior of the algorithm in a synthetic case involving k-means failure. We will see how to use classical heuristics to determine the parameters of the algorithm and how to interpret the partitioning obtained. Finally, we will apply DBSCAN on a real data set and we will illustrate its use for the detection of outliers. 2. Density classification in scikit-learn As we saw in the assignments, scikit-learn offers many automatic classification methods. For this final exam, we are interested in DBSCAN (Density-Based Spatial Clustering of Applications with Noise). 1 Like k-means, DBSCAN makes it possible to obtain for an observation matrix n_samples x n_features a vector corresponding to the n_samples numbers of the groups (cluster) for these observations. On the other hand, DBSCAN does not make it possible to produce classifications for new data (it is necessary to retrain the model by including new observations). The description of the DBSCAN implementation can be found in https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. 3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n samples) noisy by # a Gaussian noise of standard deviation 0.1 ("noise"). The ratio between the radius of the small circle and the large circle is 0.3 (factor'). data, labels = sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels shuffle (data, labels) #Point cloud display plt.scatter (data[:,0], data[:,1], c-labels) plt.show()