Grade 11 → Probability and Statistics → Statistics ↓

Correlation and Regression

Introduction

In statistics, it is important to understand the relationship between two variables. This can reveal how one variable can affect another. Two key concepts that help us understand these relationships are "correlation" and "regression". These concepts allow us to investigate whether variables are related to each other and how strongly. Let's discuss these interesting topics in depth!

Correlation

Correlation is a statistical measure that describes the size and direction of the relationship between two variables, usually denoted as X and Y. It tells us whether variables move together (and if they do, whether they move in the same or opposite directions) without implying a cause-effect relationship.

Understanding correlation

When two variables are correlated, it means there is a predictable pattern in the changes that occur between them. The correlation can be positive, negative, or zero.

Positive correlation: As one variable increases, the other increases as well. For example, the relationship between the amount of time studied and the score obtained on an exam might exhibit a positive correlation.
Negative correlation: As one variable increases, the other decreases. An example of this could be the relationship between the number of movies watched per week and time spent studying.
No correlation (zero correlation): No predicted change connects the variables. For example, the relationship between eye color and intelligence level is expected to show no correlation.

Visual example of correlation

In a scatter plot, the correlation between two variables is displayed visually:

Expressing correlation mathematically

The most commonly used correlation coefficient is the Pearson correlation coefficient, denoted by r. The formula to calculate it is as follows:

R = Σ((X_i - X̄)(Y_i - Ȳ)) / (√(Σ(X_i - X̄)² * Σ(Y_i - Ȳ)²))

Where:

X_i and Y_i are different data points.
X̄ is the mean of the X values and Ȳ is the mean of the Y values.
The range of r is from -1 to +1.

If r = 1, it indicates a perfect positive linear relationship. If r = -1, it is a perfect negative linear relationship. When the value of r is close to 0, it means that no linear relationship exists.

Example

Consider a simple dataset with two variables:

X: 1, 2, 3, 4, 5
Y: 2, 4, 5, 4, 5

To determine the correlation between X and Y, you need to apply the formula specified above.

Regression

While correlation measures the strength and direction of the relationship between two variables, regression is about predicting one variable based on another. It predicts the dependent variable (often denoted as Y) using the independent variable (X).

Understanding regression

Regression helps to understand how a specific value of a dependent variable changes when one of the independent variables is changed while the other independent variables remain constant. Its simplest form is linear regression, which is represented as a line.

Linear regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data. The equation of a line is usually presented as:

y = a + bx

Where:

Y is the dependent variable we are trying to predict.
X is the independent variable that we are using for prediction.
a is the intercept, the value of Y when X=0.
b is the slope, which represents the change in Y for a one unit change in X.

Visual example of regression

Drawing a line back across data points can often be seen in a scatter plot as follows:

The red line is called the line of best fit or regression line. It minimizes the distance of all points from the line which is known as the least squares method.

Finding the regression line mathematically

The formulas to calculate the slope b and intercept a are given as:

B = Σ((X_i - X̄)(Y_i - Ȳ)) / Σ((X_i - X̄)²)
a = Ȳ − bx̄

These formulas arise from minimizing the squared difference of the observed values from the line.

Example

Using the first dataset with variables X: [1, 2, 3, 4, 5] and Y: [2, 4, 5, 4, 5].

First calculate X̄ and Ȳ.
Then, using the above formula, determine b and a.

After calculation:

b = 0.6
a = 2.2
Y = 2.2 + 0.6X

Thus, your regression equation becomes Y = 2.2 + 0.6X.

Key differences and summary

Purpose: Correlation measures the direction and strength of a relationship. However, regression models and predicts one variable from another.
Dependence: Correlation does not depend on cause and effect. Regression, theoretically, assumes a dependent direction.
Symmetry: The correlation is symmetric because corr(X, Y) = corr(Y, X). The regression changes direction because Y = a + bX is not identical to X = c + dY.

In conclusion, correlation and regression provide valuable insight into the relationships between variables. Understanding these concepts is crucial for data analysis in many fields and provides an important foundation for advanced statistical modeling.

Mark as read

Grade 11 → 6.4.4

username

completed in Grade 11

← Prev (6.4.3)

Data Collection and Representation

Next (6.4.5) →

Sampling Techniques

Correlation and Regression

Introduction

Correlation

Understanding correlation

Visual example of correlation

Expressing correlation mathematically

Example

Regression

Understanding regression

Linear regression

Visual example of regression

Finding the regression line mathematically

Example

Key differences and summary

Comments

Correlation and Regression