If you just began studying statistics, one of the first concepts that you will learn is regression analysis.
What Is Regression?
Simply put, a regression is just a mathematical relationship between a dependent variable or the outcome and the independent variables which are called predictors.
The real relationship needs to be calculated based on the effect that these predictors have.
Let’s imagine that you are looking to determine if the time of day has a significant impact on the purchase of a snack. You could map your model the following way:
Number of Snacks purchased = b0+ b1(Time of Day) + ε
Looking At The Variables
In this example, we are interested in the number of snacks. So, this will be your continuous variable since it is a range of real numbers that can range to infinity. The predictor variable is the time of day, and once again, it’s a continuous variable.
The Other Stuff
When you look at the equation we created for the model, there are still some factors that you need to understand:
b0 = the intercept and it is interpreted as the prediction when the predictor is zero. In our example, it would be the number of snacks sold when the time of day is 00:00 hours.
b1 = actual relationship between the snacks sold and the time of day.
ε = represents the error that is always present in a regression. For example, some people will buy snacks for a variety of reasons other than the time of day. If the sample is truly random, ε takes those natural errors into account.
When you are performing a simple regression, this means that you are only looking at the relationships between a single predictor and a single outcome. When you are looking at this relationship, you should look at a plot of the data first:
These blue dots that you see in the image above represent the data in the form of Cartesian points.
To calculate the intercept and coefficient, you need to use the following equation:
b = represents the intercept and it measures the variance of each data (numerator) compared to the variance of the data set that is the product of the two data sets.
a = represents the effect of the difference between X and Y based on the sample size.
When you’re looking at regression analysis, you have not only the simple regression but the multiple regression as well.
As w already mentioned above, there’s certainly a lot more factors rather than just the time of the day that affects the purchase of snacks. So, in these cases, we need to use multiple regression since it takes into account that the predictors may also affect one another. The general model for multiple linear regression is:
Y = b0 + b1X1 + b2X2 + … + bnXn
Multiple linear regression takes into account that multiple variables not only affect the outcome but also affect one another. It is even possible to have two or more variables that interact with each other.