Having the following rdd that contains:('37.1242586781', '-119.424203758')('37.7374074392', '-121.790061298')('45.0929780547', '-117.473248072')('34.9786739919', '-111.700983823')('35.8736386531', '-120.421476019')('34.0912863966', '-118.144076285')('35.6720438532', '-120.416438975')('36.284116521', '-119.030325609')('44.4092246446', '-122.891970809')('38.4770167448', '-122.296973442')('38.4296857989', '-121.418135072')('37.3822266494', '-122.016158235') Implement an iterative algorithm (k-means) in Spark to calculate k-means fora set of points that are in a file(rdd), a k-means algorithm in python. Do not use use K-means in MLib of Spark to solve the problem. Follow this pattern: 1. Choose k = 5 random points as starting centers2. Find all points closest to each center (use groupByKey or reduceByKey)3. Find the new center (mean) of each cluster4. If the centers changed by more than convergeDist (e.g.convergeDist = 0.1), iterate again starting from step 2;otherwise, terminate For example, use (currentcenter_x_1, currentcenter_y_1), (currentcenter_x_2, currentcenter_y_2), ...,(currentcenter_x_5, currentcenter_y_5) to denote the current 5 center points of the 5 clusters, use(newcenter_x_1, newcenter_y_1), (newcenter_x_2, new_center_y_2), .... , (newcenter_x_5, newcenter_y_5)to denote the new center of each cluster, then in step 4, calculate the total of the squared distance betweenthe current centers and new centers. E.g.tempDist = (currentcenter_x_1 – newcenter_x_1) 2 + (currentcenter_y_1 – newcenter_y_1) 2 +(currentcenter_x_2 – newcenter_x_2) 2 + (currentcenter_y_2 – newcenter_y_2) 2 + ... +(currentcenter_x_5 – newcenter_x_5) 2 + (currentcenter_y_5 – newcenter_y_5) 2Update the centers of clusters using the new center points.If tempDist > convergeDist, iterate again starting from step 2; otherwise, terminate the loop.
Having the following rdd that contains:
('37.1242586781', '-119.424203758')
('37.7374074392', '-121.790061298')
('45.0929780547', '-117.473248072')
('34.9786739919', '-111.700983823')
('35.8736386531', '-120.421476019')
('34.0912863966', '-118.144076285')
('35.6720438532', '-120.416438975')
('36.284116521', '-119.030325609')
('44.4092246446', '-122.891970809')
('38.4770167448', '-122.296973442')
('38.4296857989', '-121.418135072')
('37.3822266494', '-122.016158235')
Implement an iterative algorithm (k-means) in Spark to calculate k-means for
a set of points that are in a file(rdd), a k-means algorithm in python. Do not use use K-means in MLib of Spark to solve the problem. Follow this pattern:
1. Choose k = 5 random points as starting centers
2. Find all points closest to each center (use groupByKey or reduceByKey)
3. Find the new center (mean) of each cluster
4. If the centers changed by more than convergeDist (e.g.
convergeDist = 0.1), iterate again starting from step 2;
otherwise, terminate
For example, use (currentcenter_x_1, currentcenter_y_1), (currentcenter_x_2, currentcenter_y_2), ...,
(currentcenter_x_5, currentcenter_y_5) to denote the current 5 center points of the 5 clusters, use
(newcenter_x_1, newcenter_y_1), (newcenter_x_2, new_center_y_2), .... , (newcenter_x_5, newcenter_y_5)
to denote the new center of each cluster, then in step 4, calculate the total of the squared distance between
the current centers and new centers. E.g.
tempDist = (currentcenter_x_1 – newcenter_x_1) 2 + (currentcenter_y_1 – newcenter_y_1) 2 +
(currentcenter_x_2 – newcenter_x_2) 2 + (currentcenter_y_2 – newcenter_y_2) 2 + ... +
(currentcenter_x_5 – newcenter_x_5) 2 + (currentcenter_y_5 – newcenter_y_5) 2
Update the centers of clusters using the new center points.
If tempDist > convergeDist, iterate again starting from step 2; otherwise, terminate the loop.
Trending now
This is a popular solution!
Step by step
Solved in 2 steps