Summarizing Data
Single Sample
Given a set of observations,
- The sample mean is the informative measure of centrality.. It is denoted by
- Geometric mean and harmonic mean
- The geometric mean is useful for averaging growths factors. For example, if , , are growth factors, the geometric mena, is a good summary stastic of the "average growth factor". This is because the growth factor obtained by equals the growth factor , such as:
- The harmonic mean is useful for averaging rates or speeds. For example assume that you are on a brisk hike, walking 5km up a mountain and then 5km back down. Say your speed going up is and your speed going down is . What is your "average speed" for the whole journey? You travel up for and down for and hence your total travel time is . Hence the average speed is .
- Note that for any dataset . Here, the inequalities become equalities only if all observations are equal.
- A different breed of descriptive statistics is based on order statistics. This term is used to describe the sorted sample, and is sometimes denoted by
- Based on the order statistics, we can define a variety of statistics such as the minimum, , the maximum, , and the median, which in the case of beging odd is and in case of being even is the arithmetic mean of and .
- Related statistics are the , for which is effectively , where denotes a rounding of to the nearest element of . For and , these values are known as the first quartile and third quartile respectively.
- The inter quartile range (IQR) is the difference between these two quartiles and the range is , which are the measure of dispersion. When dealing with measures of dispersion, the most popular and useful measure is the sample variance, ( is called as pupulation variance)
- If all observations are constant then , otherwise, , and the bigger it is, the more dispersion we have in the data.
- A related quantity is the sample standard deviation
- Standard error
- Compute in Julia
var(a::Array): Sample variancestd(a::Array): Sample standard deviationiqr(a::Array): Interquartile rangepercentile(a::Array, n::Int): Get thenth percentile of dataquantile(a:Array, n::Float): Get thenquantile of datasummarystats(a::Array): Get the basic descrptive statistics
using CSV, Statistics, StatsBase, DataFrames
data = CSV.read("../data/temperatures.csv", DataFrame)[:,4]
println("Sample Mean: ", mean(data))
println("Harmonic <= Geometric <= Arithemetic ",
(harmmean(data), geomean(data), mean(data)))
println("Sample Variance: ", var(data))
println("Sample Standard Deviation: ", std(data))
println("Minimum: ", minimum(data))
println("Maximum: ", maximum(data))
println("Median: ", median(data))
println("95th percentile: ", percentile(data, 95))
println("0.95 quantile: ", quantile(data, 0.95))
println("Interquartile range: ", iqr(data), "\n")
summarystats(data)
Observations in Pairs
Data is configured iin the form of pairs,
- We often consider the sample covariance, which is given by,
- A positive covariance indicates a positive linear relationship meaning that when is larger than its means, we expect to be larger than its mean, and similarly when is small then is small.
- A negative covariance indicates a negative linear relationship meaning that hwen is large then is small.
- If the covariance is or near , it is an indication that no such relationship holds.
- However, like the variance, the covariance is not a normalized quantity. For this reason, we define another useful statistics, the sample correlation coefficient
- , The sign of agrees with the sign of , however importantly its magnitude is meaningful. Having near implies little or no linear relationship, while closer to implies a stronger linear relationship, which is either positive or negative dependinig on the sign of .
- It is often useful to represent the variances and covariances in the sample covariance matrix as:
- Compute in Julia
using DataFrames, CSV, Statistics
data = CSV.read("../data/temperatures.csv", DataFrame, copycols=true)
brisT = data.Brisbane
gcT = data.GoldCoast
sigB = std(brisT)
sigG = std(gcT)
covBG = cov(brisT, gcT)
meanVect = [mean(brisT), mean(gcT)]
covMat = [sigB^2 covBG;
covBG sigG^2]
outfile = open("../data/mvParams.jl", "w")
write(outfile, "meanVect = $meanVect \ncovMat = $covMat")
close(outfile)
println(read("../data/mvParams.jl", String))
Observations in Vectors
Considering data that consists of vectors. The 'th data vector represents a tuple of values, . In this case, the data can be represented by a data matrix, , where the rows are observations (data vectors) and each column represents a different variable, feature or attribute
- In summarizing the data , a few basic objects arise. These include the sample mean vector, sample standard deviation vector, sample covariance matric and the sample correlation matrix.
- The sample mean vector is simply a vector of length where the 'th entry, is the sample mean of , based on the 'th column of .
- The sample standard deviation vector has a 'th entries, , which is the sample standard deviation of
- With these we often standardize the data by creating a new matrix , with entries,
- It can be creat via,
- Where is a column vector of 1's ( rows)
- is the mean vector
- is a diagonal matrix which is created from standard deviation vector
- The standarized data has the attribute that each column, , has a sample mean and a unit standard deviation. Hence, the first- and second-order information of the 'th feature is lost when moving from the data matrix to the standarized matrix . Nevertheless, relationships between features are still captured in and can be easily calculated. Most notably, the sample correlation between feature and feather , denoted by is simply calculatd via
- Without resorting to standarization, it is often of interest to calculate the sample covariance matrix
- proof
- It can be creat via,
- Compute in Julia
zscore(a::Array): convert array to z-score
using Statistics, StatsBase, LinearAlgebra, DataFrames, CSV
df = CSV.read("../data/3featureData.csv",DataFrame ,header=false)
n,p = size(df)
println("Number of features: ", p)
println("Number of observations: ", n)
X = Matrix{Float64}(df)
println("Dimensions of data matrix: ", size(X))
xbarA = (1/n) * X' * ones(n)
xbarB = [mean(X[:,i]) for i in 1:p]
xbarC = sum(X, dims=1) ./ n
println("\nAlernative calculations of (sample) mean vector: ")
@show(xbarA), @show(xbarB), @show(xbarC)
Y = (I - ones(n,n)/n) * X
println("Y is the de-meaned data: ", mean(Y, dims=1))
covA = (X .- xbarA')'*(X .- xbarA')/(n-1)
covB = Y'*Y/(n-1)
covC = [cov(X[:,i],X[:, j]) for i in 1:p, j in 1:p]
covD = [cor(X[:,i],X[:, j]) * std(X[:,i]) * std(X[:,j]) for i in 1:p, j in 1:p]
covE = cov(X)
println("\nAlernative calculations of (sample) covariance matrix: ")
@show(covA), @show(covB), @show(covC), @show(covD), @show(covE)
ZmatA = [(X[i,j] - xbarA[j])/sqrt(covA[j,j]) for i in 1:n, j in 1:p]
ZmatB = (X .- xbarA') * sqrt.(Diagonal(covA))^(-1)
ZmatC = hcat([zscore(X[:,j]) for j in 1:p]...)
println("\nAlernative computation of z-score yieds same matrix: ",
(maximum(norm(ZmatA - ZmatB)), maximum(norm(ZmatC - ZmatB)), maximum(norm(ZmatA - ZmatC))))
Z = ZmatA
corA = covA ./ [std(X[:,i])*std(X[:,j]) for i in 1:p, j in 1:p]
corB = covA ./ (std(X, dims=1)' * std(X, dims=1))
corC = [cor(X[:,i], X[:,j]) for i in 1:p, j in 1:p]
corD = Z' * Z ./ (n-1)
corE = cov(Z)
corF = cor(X)
println("\nAlernative calculations of (sample) correlation matrix: ")
@show(corA), @show(corB), @show(corC), @show(corD), @show(corE), @show(corF)