Clustering is the process of grouping objects in such a way that objects in one group is much similar to objects of that group than to those in other groups. We first partition the set of data into groups based on data similarity and then assign the labels to the groups.
1. Scalability – The method should be able to scale in order to handle a large set of data.
2. Able to handle different data types
3. Able to deal with outliers and noise
4. Usability
5. Insensitive to the order of input records
K-mean is a simple algorithm that is used for data clustering. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes the different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid.
When no point is pending, the first step is completed and an early grouping is done. At this point, we need to re-calculate k new centroids as the barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated.
As a result of this loop, we may notice that the k centroids change their location step by step until no more changes are done. In other words, centroids do not move anymore.
Online Analytical Processing (OLAP) is a technology that is used to organize large business databases and support business intelligence. It performs a multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.