Statistics introduce you to the statistical techniques used in Data Science. It provides the best insight into how to use the data to solve a problem. The types of Statistics we use in Data Science are
1. Desciptive Statistics: It is for Analysing data, summarising data and Organizing data. Descriptive Statistics can be population and sample data.
2.Inferential Statistics: It is predicting the data’s answer through sample data. Inferential Statistics take sample data from population data and conclude the whole Decision.
Data Science Statistics mainly used in Exploratory Analysis involves Mean, Median and Mode, Standard deviation, Z-score, Outliers. Distributions in Statistics helps to get a knowledge of all the possible values or intervals of the data and how they occur. Statistical Hypothesis explains the relationship between two or more variables and assumes the whole data and there is also Hypothesis Testing to either accept or reject Hypothesis.
Linear Regression
Regression:
Regression is a form of predictive modelling technique based on correlation which investigates the relationship between a Dependent and Independent variable.
Linear Equation:
It expresses a linear relationship between variables X and Y.
Y=mX+c
- X represents any given score on X-axis.
- Y represents the score for Y-axis based on X.
- c is the Y-intercept which determines the value of Y equals when X = 0 and where the line crosses the Y-axis.
- b is the slope constant that shows how much the Y variable will change when X is increased by one point.
Linear Regression Line :
If you have an independent variable on the x-axis and dependent variable on the y-axis. If the dependent variable on x-axis increases and the independent variable on y-axis increases then we get a positive Linear Regression Line.
If the independent variable on x-axis increases and the dependent variable on y-axis decreases then we get a negative Linear Regression Line.
Where n is the number of values, y is the actual values and y(hat) is predicted values.
R-Square:
R-Square value is a statistical measure that shows how close the data are to the fitted regression line. In general, it is considered that a high R-Square value model is a good model.
Difference between distance actual mean – distance predicted mean.
Where yi is actual values, yi(hat) is predicted values and y(bar) is the mean value of actual values.
If you have an independent variable on the x-axis and dependent variable on the y-axis. If the dependent variable on x-axis increases and the independent variable on y-axis increases then we get a positive Linear Regression Line.
If the independent variable on x-axis increases and the dependent variable on y-axis decreases then we get a negative Linear Regression Line.
Best Fit Linear Regression Line:
The best fit Linear Regression Line would be the one which had the least Error. We have some methods to find the best fit line.
Mean Squared error:
In mean squared error we find the error between the actual values and the predicted values.
error= Y(actual)-Y(predicted)