A Simple Explanation to Covariance and Correlation


One of the most common questions that arise while working with data is how the variables are dependent, varying with each other and how they are linked.

Definitions

1.) "Covariance is a measure of how much two random variables vary together. "
2.) "Correlation is a statistic that measures the degree to which two variables move in relation to each other."

Both covariance and correlations are about how the variable is relevant to each other. Covariance can have a range of minus infinity to positive infinity. If the values are positive, it means that the variable moves in the same direction. If the value is negative, it means both variable moves in the opposite direction. Correlation is a measurement that describes the strength of the relationship between each other. Correlations can vary between minus 1 to positive 1. Minus and, Plus sign gives the same meaning as the covariance.
These two concepts are convenient when doing feature engineering.

Example

We got a basic idea of both terms. Now its time to moving into an example. I will use the popular IRIS dataset. I will only elect one species named Iris-versicolor, which is commonly known as the Northern blue flag flower.

First, let us take a look at the sample data frame.

cov_cor.py
import pandas as pd
import math
data = pd.read_csv('../input/iris/Iris.csv')
data = data[data['Species'] == 'Iris-versicolor'] 
data.head()

Let us choose PetalLengthCm and PetalLengthCm for the analysis. First, let us visualize the data. I will use regplot on the seaborn library. It is a combination of scatter plot and linear regression.

cov_cor.py
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.regplot(x="PetalLengthCm", y="PetalWidthCm", data=data)

When we look into the plot, we can recognise that there is a high correlation. Let us calculate the values to check the assumption is right or wrong.

Covariance

This is the formula for calculating the covariance;

$$cov_{x,y}=\frac{\sum_(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}$$

covx, y = covariance between variable x and y
xi = data value of x
yi = data value of y
x̅ = mean of x
y̅ = mean of y
N = number of data values

Let us implement this using python.

cov_cor.py
def calc_covariance(x, y):

    x_mean = sum(x) / float(len(x))
    y_mean = sum(y) / float(len(y))
    x_sub = [i - x_mean for i in x]
    y_sub = [i - y_mean for i in y]
    dividend = sum([x_sub[i] * y_sub[i] for i in range(len(x_sub))])
    divisor = len(x) - 1
    cov = dividend / divisor

    return cov

print(calc_covariance(data.PetalLengthCm, data.PetalWidthCm))

We got positive value, which means both variable moves the equivalent direction.

Correlation

There several ways to calculate the correlation. The Pearson correlation coefficient is seemingly the most broadly utilised measure for linear connections between two normal distributed variables and thus often just called "correlation coefficient".

Now let us calculate the correlation;

$$r_{x,y} = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \overline{y})^2}}$$

  • The dividend is the covariance multiple by the number of elements.
  • The divisor is the individual standard deviations of x and y.

Let us implement this using python.

cov_cor.py
def calc_correlation(x, y):

    x_mean = sum(x) / float(len(x))
    y_mean = sum(y) / float(len(y))
    x_sub = [i - x_mean for i in x]
    y_sub = [i - y_mean for i in y]
    dividend = sum([x_sub[i] * y_sub[i] for i in range(len(x_sub))])
    x_std_deviation = sum([x_sub[i]**2.0 for i in range(len(x_sub))])
    y_std_deviation = sum([y_sub[i]**2.0 for i in range(len(y_sub))])
    divisor = math.sqrt(x_std_deviation) * math.sqrt(y_std_deviation)
    correlation = dividend / divisor

    return correlation

print(calc_correlation(data.PetalLengthCm, data.PetalWidthCm))

Here, our variables are positively and strongly correlated since the value is positive and near to 1.

In this section, we studied covariance and correlation. We learned how correlation and covariation are represented in Mathematics. Also, we implemented those are in Python.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.