Grade 11 → Probability and Statistics → Statistics ↓
Correlation and Regression
Introduction
In statistics, it is important to understand the relationship between two variables. This can reveal how one variable can affect another. Two key concepts that help us understand these relationships are "correlation" and "regression". These concepts allow us to investigate whether variables are related to each other and how strongly. Let's discuss these interesting topics in depth!
Correlation
Correlation is a statistical measure that describes the size and direction of the relationship between two variables, usually denoted as X and Y. It tells us whether variables move together (and if they do, whether they move in the same or opposite directions) without implying a cause-effect relationship.
Understanding correlation
When two variables are correlated, it means there is a predictable pattern in the changes that occur between them. The correlation can be positive, negative, or zero.
- Positive correlation: As one variable increases, the other increases as well. For example, the relationship between the amount of time studied and the score obtained on an exam might exhibit a positive correlation.
- Negative correlation: As one variable increases, the other decreases. An example of this could be the relationship between the number of movies watched per week and time spent studying.
- No correlation (zero correlation): No predicted change connects the variables. For example, the relationship between eye color and intelligence level is expected to show no correlation.
Visual example of correlation
In a scatter plot, the correlation between two variables is displayed visually:
Expressing correlation mathematically
The most commonly used correlation coefficient is the Pearson correlation coefficient, denoted by r
. The formula to calculate it is as follows:
R = Σ((X_i - X̄)(Y_i - Ȳ)) / (√(Σ(X_i - X̄)² * Σ(Y_i - Ȳ)²))
Where:
X_i
andY_i
are different data points.X̄
is the mean of the X values andȲ
is the mean of the Y values.- The range of
r
is from -1 to +1.
If r = 1
, it indicates a perfect positive linear relationship. If r = -1
, it is a perfect negative linear relationship. When the value of r
is close to 0, it means that no linear relationship exists.
Example
Consider a simple dataset with two variables:
- X: 1, 2, 3, 4, 5
- Y: 2, 4, 5, 4, 5
To determine the correlation between X and Y, you need to apply the formula specified above.
Regression
While correlation measures the strength and direction of the relationship between two variables, regression is about predicting one variable based on another. It predicts the dependent variable (often denoted as Y) using the independent variable (X).
Understanding regression
Regression helps to understand how a specific value of a dependent variable changes when one of the independent variables is changed while the other independent variables remain constant. Its simplest form is linear regression, which is represented as a line.
Linear regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data. The equation of a line is usually presented as:
y = a + bx
Where:
Y
is the dependent variable we are trying to predict.X
is the independent variable that we are using for prediction.a
is the intercept, the value of Y when X=0.b
is the slope, which represents the change in Y for a one unit change in X.
Visual example of regression
Drawing a line back across data points can often be seen in a scatter plot as follows:
The red line is called the line of best fit or regression line. It minimizes the distance of all points from the line which is known as the least squares method.
Finding the regression line mathematically
The formulas to calculate the slope b
and intercept a
are given as:
B = Σ((X_i - X̄)(Y_i - Ȳ)) / Σ((X_i - X̄)²) a = Ȳ − bx̄
These formulas arise from minimizing the squared difference of the observed values from the line.
Example
Using the first dataset with variables X: [1, 2, 3, 4, 5] and Y: [2, 4, 5, 4, 5].
- First calculate
X̄
andȲ
. - Then, using the above formula, determine
b
anda
.
After calculation:
b = 0.6 a = 2.2 Y = 2.2 + 0.6X
Thus, your regression equation becomes Y = 2.2 + 0.6X
.
Key differences and summary
- Purpose: Correlation measures the direction and strength of a relationship. However, regression models and predicts one variable from another.
- Dependence: Correlation does not depend on cause and effect. Regression, theoretically, assumes a dependent direction.
- Symmetry: The correlation is symmetric because
corr(X, Y) = corr(Y, X)
. The regression changes direction becauseY = a + bX
is not identical toX = c + dY
.
In conclusion, correlation and regression provide valuable insight into the relationships between variables. Understanding these concepts is crucial for data analysis in many fields and provides an important foundation for advanced statistical modeling.