Grade 11

Grade 11Probability and StatisticsStatistics


Correlation and Regression


Introduction

In statistics, it is important to understand the relationship between two variables. This can reveal how one variable can affect another. Two key concepts that help us understand these relationships are "correlation" and "regression". These concepts allow us to investigate whether variables are related to each other and how strongly. Let's discuss these interesting topics in depth!

Correlation

Correlation is a statistical measure that describes the size and direction of the relationship between two variables, usually denoted as X and Y. It tells us whether variables move together (and if they do, whether they move in the same or opposite directions) without implying a cause-effect relationship.

Understanding correlation

When two variables are correlated, it means there is a predictable pattern in the changes that occur between them. The correlation can be positive, negative, or zero.

  • Positive correlation: As one variable increases, the other increases as well. For example, the relationship between the amount of time studied and the score obtained on an exam might exhibit a positive correlation.
  • Negative correlation: As one variable increases, the other decreases. An example of this could be the relationship between the number of movies watched per week and time spent studying.
  • No correlation (zero correlation): No predicted change connects the variables. For example, the relationship between eye color and intelligence level is expected to show no correlation.

Visual example of correlation

In a scatter plot, the correlation between two variables is displayed visually:

Positive correlation Negative correlation No correlation

Expressing correlation mathematically

The most commonly used correlation coefficient is the Pearson correlation coefficient, denoted by r. The formula to calculate it is as follows:

R = Σ((X_i - X̄)(Y_i - Ȳ)) / (√(Σ(X_i - X̄)² * Σ(Y_i - Ȳ)²))

Where:

  • X_i and Y_i are different data points.
  • is the mean of the X values and Ȳ is the mean of the Y values.
  • The range of r is from -1 to +1.

If r = 1, it indicates a perfect positive linear relationship. If r = -1, it is a perfect negative linear relationship. When the value of r is close to 0, it means that no linear relationship exists.

Example

Consider a simple dataset with two variables:

  • X: 1, 2, 3, 4, 5
  • Y: 2, 4, 5, 4, 5

To determine the correlation between X and Y, you need to apply the formula specified above.

Regression

While correlation measures the strength and direction of the relationship between two variables, regression is about predicting one variable based on another. It predicts the dependent variable (often denoted as Y) using the independent variable (X).

Understanding regression

Regression helps to understand how a specific value of a dependent variable changes when one of the independent variables is changed while the other independent variables remain constant. Its simplest form is linear regression, which is represented as a line.

Linear regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data. The equation of a line is usually presented as:

y = a + bx

Where:

  • Y is the dependent variable we are trying to predict.
  • X is the independent variable that we are using for prediction.
  • a is the intercept, the value of Y when X=0.
  • b is the slope, which represents the change in Y for a one unit change in X.

Visual example of regression

Drawing a line back across data points can often be seen in a scatter plot as follows:

Line of best fit

The red line is called the line of best fit or regression line. It minimizes the distance of all points from the line which is known as the least squares method.

Finding the regression line mathematically

The formulas to calculate the slope b and intercept a are given as:

B = Σ((X_i - X̄)(Y_i - Ȳ)) / Σ((X_i - X̄)²)
a = Ȳ − bx̄

These formulas arise from minimizing the squared difference of the observed values from the line.

Example

Using the first dataset with variables X: [1, 2, 3, 4, 5] and Y: [2, 4, 5, 4, 5].

  • First calculate and Ȳ.
  • Then, using the above formula, determine b and a.

After calculation:

b = 0.6
a = 2.2
Y = 2.2 + 0.6X

Thus, your regression equation becomes Y = 2.2 + 0.6X.

Key differences and summary

  • Purpose: Correlation measures the direction and strength of a relationship. However, regression models and predicts one variable from another.
  • Dependence: Correlation does not depend on cause and effect. Regression, theoretically, assumes a dependent direction.
  • Symmetry: The correlation is symmetric because corr(X, Y) = corr(Y, X). The regression changes direction because Y = a + bX is not identical to X = c + dY.

In conclusion, correlation and regression provide valuable insight into the relationships between variables. Understanding these concepts is crucial for data analysis in many fields and provides an important foundation for advanced statistical modeling.


Grade 11 → 6.4.4


U
username
0%
completed in Grade 11


Comments