3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle #Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n_samples') noisy by a Gaussian noise of standard deviation 0.1 ('noise'). The ratio between the radius of the small circle and the large circle #is 0.3 (factor'). data, labels sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels = shuffle (data, labels) Point cloud display plt.scatter (data[:,0], data[:,11, c-labels) plt.show() Question 1: How many groups does this dataset have? Question 2: Perform a clustering of this dataset using k-means. What can we expect? What do you notice? 2 Since the two circles are separated by an area with no data, a density-based method seems appropriate. We can create a clustering model using DBSCAN by importing it from scikit- learn: from sklearn.cluster import DBSCAN db DBSCAN () The constructor arguments of DBSCAN are as follows: eps: the dimension of the neighborhood, i.e. the maximum distance between two observations allowing them to be considered as neighbors of each other, min_samples: the minimum number of neighbors that a central point must have, ⚫ metric: the distance to consider (by default, the Euclidean distance is used). You can call the following methods: ■ .fit (X): performs an automatic classification using the DBSCAN method on the observation matrix X. The results are stored in the .labels_attribute. .fit_predict (X): same as .fit (X) but returns group labels directly. The following attributes are available after calling the .fit() method: core_sample_indices_: the indices of the core points. labels: the group numbers of the points in the observation matrix.

3. Classification with DBSCAN of generated data Let's start by generating a dataset of 500 observations in 2D space. We will use a function built into scikit-learn to produce circular point clouds. import matplotlib.pyplot as plt import numpy as np import sklearn.datasets from sklearn.utils import shuffle #Let's generate a scatter plot composed of two circles The cloud contains 500 observations ('n_samples') noisy by a Gaussian noise of standard deviation 0.1 ('noise'). The ratio between the radius of the small circle and the large circle #is 0.3 (factor'). data, labels sklearn.datasets.make_circles (n_samples-500, noise-0.1, factor-0.3, random_state=0) print (data.shape) Random permutation of the rows of the matrix (we mix the observations) data, labels = shuffle (data, labels) Point cloud display plt.scatter (data[:,0], data[:,11, c-labels) plt.show() Question 1: How many groups does this dataset have? Question 2: Perform a clustering of this dataset using k-means. What can we expect? What do you notice? 2 Since the two circles are separated by an area with no data, a density-based method seems appropriate. We can create a clustering model using DBSCAN by importing it from scikit- learn: from sklearn.cluster import DBSCAN db DBSCAN () The constructor arguments of DBSCAN are as follows: eps: the dimension of the neighborhood, i.e. the maximum distance between two observations allowing them to be considered as neighbors of each other, min_samples: the minimum number of neighbors that a central point must have, ⚫ metric: the distance to consider (by default, the Euclidean distance is used). You can call the following methods: ■ .fit (X): performs an automatic classification using the DBSCAN method on the observation matrix X. The results are stored in the .labels_attribute. .fit_predict (X): same as .fit (X) but returns group labels directly. The following attributes are available after calling the .fit() method: core_sample_indices_: the indices of the core points. labels: the group numbers of the points in the observation matrix.