Summarizing Data

Single Sample

Given a set of observations, $x_{1},x_{2},\cdots,x_{n}$

The sample mean is the informative measure of centrality.. It is denoted by $\bar{x}$ $\bar{x} = \frac{\displaystyle{\sum_{i=1}^{n}x_{i}}}{n}.$
Geometric mean and harmonic mean
- The geometric mean is useful for averaging growths factors. For example, if $x_{1}=1.03$ , $x_{2}=1.05$ , $x_{3} = 1.07$ are growth factors, the geometric mena, $\bar{x}_{g} = 1.049714$ is a good summary stastic of the "average growth factor". This is because the growth factor obtained by $\bar{x}_{g}^{3}$ equals the growth factor $x_{1}\cdot x_{2}\cdot x_{3}$ , such as: $\text{Value after three periods} = 100\cdot x_{1}\cdot x_{2} \cdot x_{3} = 100\cdot \bar{x}_{g}^{3} = 115.7205$
- The harmonic mean is useful for averaging rates or speeds. For example assume that you are on a brisk hike, walking 5km up a mountain and then 5km back down. Say your speed going up is $x_{1} = 5\text{ km/h}$ and your speed going down is $x_{2} = 10\text{ km/h}$ . What is your "average speed" for the whole journey? You travel up for $1 \text{ h}$ and down for $0.5 \text{ h}$ and hence your total travel time is $1.5 \text{ h}$ . Hence the average speed is $10/1.5 = 6.6\bar{6}\text{ km/h} = \frac{2}{\frac{1}{5}+\frac{1}{10}}$ .
Note that for any dataset $\bar{x}_{h}\le \bar{x}_{g}\le \bar{x}$ . Here, the inequalities become equalities only if all observations are equal.
A different breed of descriptive statistics is based on order statistics. This term is used to describe the sorted sample, and is sometimes denoted by
- Based on the order statistics, we can define a variety of statistics such as the minimum, $x_{(1)}$ , the maximum, $x_{(n)}$ , and the median, which in the case of $n$ beging odd is $x_{((n+1)/2)}$ and in case of $n$ being even is the arithmetic mean of $x_{(n/2)}$ and $x_{(n/2+1)}$ .
- Related statistics are the ${\alpha-quantile}$ , for $\alpha\in[0,1]$ which is effectively $x_{\widetilde{an}}$ , where $\widetilde{an}$ denotes a rounding of $an$ to the nearest element of $\{1,\cdots,n\}$ . For $\alpha=0.25$ and $\alpha=0.75$ , these values are known as the first quartile and third quartile respectively.
The inter quartile range (IQR) is the difference between these two quartiles and the range is , which are the measure of dispersion. When dealing with measures of dispersion, the most popular and useful measure is the sample variance, ( is called as pupulation variance)
- If all observations are constant then $s^{2}=0$ , otherwise, $s^{2}>0$ , and the bigger it is, the more dispersion we have in the data.
- A related quantity is the sample standard deviation $s = \sqrt{s^{2}}$
- Standard error $\text{S.E.} = s/\sqrt{n}$
Compute in Julia
- var(a::Array): Sample variance
- std(a::Array): Sample standard deviation
- iqr(a::Array): Interquartile range
- percentile(a::Array, n::Int): Get the nth percentile of data
- quantile(a:Array, n::Float): Get the n quantile of data
- summarystats(a::Array): Get the basic descrptive statistics

using CSV, Statistics, StatsBase, DataFrames
data = CSV.read("../data/temperatures.csv", DataFrame)[:,4]

println("Sample Mean: ", mean(data))
println("Harmonic <= Geometric <= Arithemetic ",
    (harmmean(data), geomean(data), mean(data)))
println("Sample Variance: ", var(data))
println("Sample Standard Deviation: ", std(data))
println("Minimum: ", minimum(data))
println("Maximum: ", maximum(data))
println("Median: ", median(data))
println("95th percentile: ", percentile(data, 95))
println("0.95 quantile: ", quantile(data, 0.95))
println("Interquartile range: ", iqr(data), "\n")

summarystats(data)

Observations in Pairs

Data is configured iin the form of pairs, $(x_{1}, y_{1}), \cdots, (x_{n},y_{n})$

We often consider the sample covariance, which is given by,
- A positive covariance indicates a positive linear relationship meaning that when $x$ is larger than its means, we expect $y$ to be larger than its mean, and similarly when $x$ is small then $y$ is small.
- A negative covariance indicates a negative linear relationship meaning that hwen $x$ is large then $y$ is small.
- If the covariance is $0$ or near $0$ , it is an indication that no such relationship holds.
However, like the variance, the covariance is not a normalized quantity. For this reason, we define another useful statistics, the sample correlation coefficient
- $\hat{\rho}_{x,y} \in [0,1]$ , The sign of $\hat{\rho}_{x,y}$ agrees with the sign of $\widehat{\text{cov}}_{x,y}$ , however importantly its magnitude is meaningful. Having $\vert \hat{\rho}_{x,y} \vert$ near $0$ implies little or no linear relationship, while $\vert \hat{\rho}_{x,y} \vert$ closer to $1$ implies a stronger linear relationship, which is either positive or negative dependinig on the sign of $\hat{\rho}_{x,y}$ .
It is often useful to represent the variances and covariances in the sample covariance matrix as: $\hat{\Sigma} = \left[ \begin{matrix} \widehat{\text{cov}}_{x,x} & \widehat{\text{cov}}_{x,y} \\ \widehat{\text{cov}}_{x,y} & \widehat{\text{cov}}_{y,y} \\ \end{matrix} \right] = \left[ \begin{matrix} s_{x}^{2} & \hat{\rho}_{x,y}s_{x}s_{y} \\ \hat{\rho}_{x,y}s_{x}s_{y} & s_{y}^{2}\\ \end{matrix} \right]$
Compute in Julia

using DataFrames, CSV, Statistics

data = CSV.read("../data/temperatures.csv", DataFrame, copycols=true)
brisT = data.Brisbane
gcT = data.GoldCoast

sigB = std(brisT)
sigG = std(gcT)
covBG = cov(brisT, gcT)

meanVect = [mean(brisT), mean(gcT)]
covMat = [sigB^2 covBG;
    covBG sigG^2]

outfile = open("../data/mvParams.jl", "w")
write(outfile, "meanVect = $meanVect \ncovMat = $covMat")
close(outfile)
println(read("../data/mvParams.jl", String))

Observations in Vectors

Considering data that consists of $n$ vectors. The $i$ 'th data vector represents a tuple of values, $(x_{i1}, \cdots, x_{ip})$ . In this case, the data can be represented by a $n\times p$ data matrix, $\pmb{X}$ , where the rows are observations (data vectors) and each column represents a different variable, feature or attribute

In summarizing the data , a few basic objects arise. These include the sample mean vector, sample standard deviation vector, sample covariance matric and the sample correlation matrix.
- The sample mean vector is simply a vector of length $p$ where the $j$ 'th entry, $\bar{x}_{j}$ is the sample mean of $(x_{1j}, \cdots, x_{nj})$ , based on the $j$ 'th column of $\pmb{X}$ .
- The sample standard deviation vector has a $j$ 'th entries, $s_{j}$ , which is the sample standard deviation of $(x_{1j},\cdots,x_{nj})$
With these we often standardize the data by creating a new matrix , with entries,
- It can be creat via,
  - Where $\pmb{1}$ is a column vector of 1's ( $n$ rows)
  - $\pmb{\bar{x}}$ is the mean vector
  - $\text{diag}(\pmb{s})$ is a diagonal matrix which is created from standard deviation vector $\pmb{s}$
- The standarized data has the attribute that each column, $(z_{1j},\cdots,z_{nj})^{\top}$ , has a $0$ sample mean and a unit standard deviation. Hence, the first- and second-order information of the $j$ 'th feature is lost when moving from the data matrix $\pmb{X}$ to the standarized matrix $\pmb{Z}$ . Nevertheless, relationships between features are still captured in $\pmb{Z}$ and can be easily calculated. Most notably, the sample correlation between feature $j_{1}$ and feather $j_{2}$ , denoted by $\hat{\rho}_{j_{1},j_{2}}$ is simply calculatd via $\hat{\rho}_{j_{1},j_{2}} = \frac{\sum_{i=1}^{n}z_{ij1}z_{ij2}}{n-1} = \bigg[\frac{1}{n-1}\pmb{Z}^{\top}\pmb{Z}\bigg]_{j_{1},j_{2}}$
- Without resorting to standarization, it is often of interest to calculate the sample covariance matrix
  - proof $\begin{array}{ll} \because & \pmb{\bar{x}} = n^{-1}\pmb{1}^{\top}\pmb{X} \\ \therefore & \hat{\pmb{\Sigma}} = (\pmb{X} - n^{-1}\pmb{11}^{\top}\pmb{X})^{\top}(\pmb{X} - n^{-1}\pmb{11}^{\top}\pmb{X}) \\ & = (\pmb{X}^{\top} - \pmb{X}^{\top}n^{-1}\pmb{11}^{\top})(\pmb{X} - n^{-1}\pmb{11}^{\top}\pmb{X}) \\ & = \pmb{X}^{\top}(\pmb{I} - n^{-1}\pmb{11}^{\top})(\pmb{I} - n^{-1}\pmb{11}^{\top})\pmb{X} \\ & = \pmb{X}^{\top}(\pmb{I} - 2n^{-1}\pmb{11}^{\top} + n^{-2}\pmb{11}^{\top}\pmb{11}^{\top}) \pmb{X} \\ \because & \pmb{1}^{\top}\pmb{1} = n \\ \therefore & \hat{\pmb{\Sigma}} = \pmb{X}^{\top}(\pmb{I} - n^{-1}\pmb{11}^{\top}) \pmb{X} \\ \end{array}$
Compute in Julia
- zscore(a::Array): convert array to z-score

using Statistics, StatsBase, LinearAlgebra, DataFrames, CSV
df = CSV.read("../data/3featureData.csv",DataFrame ,header=false)
n,p = size(df)
println("Number of features: ", p)
println("Number of observations: ", n)
X = Matrix{Float64}(df)
println("Dimensions of data matrix: ", size(X))

xbarA = (1/n) * X' * ones(n)
xbarB = [mean(X[:,i]) for i in 1:p]
xbarC = sum(X, dims=1) ./ n
println("\nAlernative calculations of (sample) mean vector: ")
@show(xbarA), @show(xbarB), @show(xbarC)

Y = (I - ones(n,n)/n) * X
println("Y is the de-meaned data: ", mean(Y, dims=1))

covA = (X .- xbarA')'*(X .- xbarA')/(n-1)
covB = Y'*Y/(n-1)
covC = [cov(X[:,i],X[:, j]) for i in 1:p, j in 1:p]
covD = [cor(X[:,i],X[:, j]) * std(X[:,i]) * std(X[:,j]) for i in 1:p, j in 1:p]
covE = cov(X)
println("\nAlernative calculations of (sample) covariance matrix: ")
@show(covA), @show(covB), @show(covC), @show(covD), @show(covE)

ZmatA = [(X[i,j] - xbarA[j])/sqrt(covA[j,j]) for i in 1:n, j in 1:p]
ZmatB = (X .- xbarA') * sqrt.(Diagonal(covA))^(-1)
ZmatC = hcat([zscore(X[:,j]) for j in 1:p]...)
println("\nAlernative computation of z-score yieds same matrix: ",
    (maximum(norm(ZmatA - ZmatB)), maximum(norm(ZmatC - ZmatB)), maximum(norm(ZmatA - ZmatC))))
Z = ZmatA

corA = covA ./ [std(X[:,i])*std(X[:,j]) for i in 1:p, j in 1:p]
corB = covA ./ (std(X, dims=1)' * std(X, dims=1))
corC = [cor(X[:,i], X[:,j]) for i in 1:p, j in 1:p]
corD = Z' * Z ./ (n-1)
corE = cov(Z)
corF = cor(X)
println("\nAlernative calculations of (sample) correlation matrix: ")
@show(corA), @show(corB), @show(corC), @show(corD), @show(corE), @show(corF)

Summarizing Data

Summarizing Data

Single Sample

Observations in Pairs

Observations in Vectors

results matching ""

No results matching ""