Write a Python function that calculates the covariance matrix from a list of vectors. Assume that the input list represents a dataset where each vector is a feature, and vectors are of equal length.
Example
Example:
input: vectors = [[1, 2, 3], [4, 5, 6]]
output: [[1.0, 1.0], [1.0, 1.0]]
reasoning: The dataset has two features with three observations each. The covariance between each pair of features (including covariance with itself) is calculated and returned as a 2x2 matrix.
Calculate Covariance Matrix
The covariance matrix is a fundamental concept in statistics, illustrating how much two random variables change together. It's essential for understanding the relationships between variables in a dataset.
For a dataset with \(n\) features, the covariance matrix is an \(n \times n\) square matrix where each element (i, j) represents the covariance between the \(i^{th}\) and \(j^{th}\) features. Covariance is defined by the formula:
\[
\text{cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}
\]
Where:
- \(X\) and \(Y\) are two random variables (features),
- \(x_i\) and \(y_i\) are individual observations of \(X\) and \(Y\),
- \(\bar{x}\) (x-bar) and \(\bar{y}\) (y-bar) are the means of \(X\) and \(Y\),
- \(n\) is the number of observations.
In the covariance matrix:
- The diagonal elements (where \(i = j\)) indicate the variance of each feature.
- The off-diagonal elements show the covariance between different features. This matrix is symmetric, as the covariance between \(X\) and \(Y\) is equal to the covariance between \(Y\) and \(X\), denoted as \(\text{cov}(X, Y) = \text{cov}(Y, X)\).
def calculate_covariance_matrix(vectors: list[list[float]]) -> list[list[float]]:
n_features = len(vectors)
n_observations = len(vectors[0])
covariance_matrix = [[0 for _ in range(n_features)] for _ in range(n_features)]
means = [sum(feature) / n_observations for feature in vectors]
for i in range(n_features):
for j in range(i, n_features):
covariance = sum((vectors[i][k] - means[i]) * (vectors[j][k] - means[j]) for k in range(n_observations)) / (n_observations - 1)
covariance_matrix[i][j] = covariance_matrix[j][i] = covariance
return covariance_matrix