Data Mining


Multiple regression is an extension of linear regression into the relationship between more than two variables. In simple linear relation, we have one predictor and one response variable, but in multiple regression, we have more than one predictor variable and one response variable.

Y=a1x1+a2x2+..+b
Sometime a column can have an only Boolean variable in it(True/False or 0/1) etc. and you need to know the relationship between the response variable(Boolean value) with predictor variable. The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the value of the response variable based on the mathematical equation relating it with the predictor variables.

  y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...)
A decision tree is a graph to represent choices and their results in the form of a tree. The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.

An error estimate is made for the cases which were not used while building the tree. That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.