Data Mining


Clustering is the process of grouping objects in such a way that objects in one group is much similar to objects of that group than to those in other groups. We first partition the set of data into groups based on data similarity and then assign the labels to the groups.
K-mean is a simple algorithm that is used for data clustering. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.

The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes the different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid.

When no point is pending, the first step is completed and an early grouping is done. At this point, we need to re-calculate k new centroids as the barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated.

As a result of this loop, we may notice that the k centroids change their location step by step until no more changes are done. In other words, centroids do not move anymore.
Regression is a form of predictive modeling technique to determine the strength of the relationship between a dependent and independent variable. One of these variables is called a predictor variable whose value is gathered through experiments. The other variable is called the response variable whose value is derived from the predictor variable.

Y=aX+b

Linear regression (X is predictor variable and Y is response variable)
Regression is used for the following:-
-It is used to derive a significant relationship between variables
-Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities.
Basically there are two types of variables in a regression equation. The one whose value is found from experiments is called the predictor variable. The one whose value is to be calculated is the response variable. In linear regression, the strength of relationship between response variable and predictor variable is calculated.
Y=ax+b
Y is response variable
X= Predictor variable
A and b are constants