Summarizing Data

Single Sample

Given a set of observations,

  • The sample mean is the informative measure of centrality.. It is denoted by
  • Geometric mean and harmonic mean
    • The geometric mean is useful for averaging growths factors. For example, if , , are growth factors, the geometric mena, is a good summary stastic of the "average growth factor". This is because the growth factor obtained by equals the growth factor , such as:
    • The harmonic mean is useful for averaging rates or speeds. For example assume that you are on a brisk hike, walking 5km up a mountain and then 5km back down. Say your speed going up is and your speed going down is . What is your "average speed" for the whole journey? You travel up for and down for and hence your total travel time is . Hence the average speed is .
  • Note that for any dataset . Here, the inequalities become equalities only if all observations are equal.
  • A different breed of descriptive statistics is based on order statistics. This term is used to describe the sorted sample, and is sometimes denoted by
    • Based on the order statistics, we can define a variety of statistics such as the minimum, , the maximum, , and the median, which in the case of beging odd is and in case of being even is the arithmetic mean of and .
    • Related statistics are the , for which is effectively , where denotes a rounding of to the nearest element of . For and , these values are known as the first quartile and third quartile respectively.
  • The inter quartile range (IQR) is the difference between these two quartiles and the range is , which are the measure of dispersion. When dealing with measures of dispersion, the most popular and useful measure is the sample variance, ( is called as pupulation variance)
    • If all observations are constant then , otherwise, , and the bigger it is, the more dispersion we have in the data.
    • A related quantity is the sample standard deviation
    • Standard error
  • Compute in Julia
    • var(a::Array): Sample variance
    • std(a::Array): Sample standard deviation
    • iqr(a::Array): Interquartile range
    • percentile(a::Array, n::Int): Get the nth percentile of data
    • quantile(a:Array, n::Float): Get the n quantile of data
    • summarystats(a::Array): Get the basic descrptive statistics
using CSV, Statistics, StatsBase, DataFrames
data = CSV.read("../data/temperatures.csv", DataFrame)[:,4]

println("Sample Mean: ", mean(data))
println("Harmonic <= Geometric <= Arithemetic ",
    (harmmean(data), geomean(data), mean(data)))
println("Sample Variance: ", var(data))
println("Sample Standard Deviation: ", std(data))
println("Minimum: ", minimum(data))
println("Maximum: ", maximum(data))
println("Median: ", median(data))
println("95th percentile: ", percentile(data, 95))
println("0.95 quantile: ", quantile(data, 0.95))
println("Interquartile range: ", iqr(data), "\n")

summarystats(data)

Observations in Pairs

Data is configured iin the form of pairs,

  • We often consider the sample covariance, which is given by,
    • A positive covariance indicates a positive linear relationship meaning that when is larger than its means, we expect to be larger than its mean, and similarly when is small then is small.
    • A negative covariance indicates a negative linear relationship meaning that hwen is large then is small.
    • If the covariance is or near , it is an indication that no such relationship holds.
  • However, like the variance, the covariance is not a normalized quantity. For this reason, we define another useful statistics, the sample correlation coefficient
    • , The sign of agrees with the sign of , however importantly its magnitude is meaningful. Having near implies little or no linear relationship, while closer to implies a stronger linear relationship, which is either positive or negative dependinig on the sign of .
  • It is often useful to represent the variances and covariances in the sample covariance matrix as:
  • Compute in Julia
using DataFrames, CSV, Statistics

data = CSV.read("../data/temperatures.csv", DataFrame, copycols=true)
brisT = data.Brisbane
gcT = data.GoldCoast

sigB = std(brisT)
sigG = std(gcT)
covBG = cov(brisT, gcT)

meanVect = [mean(brisT), mean(gcT)]
covMat = [sigB^2 covBG;
    covBG sigG^2]

outfile = open("../data/mvParams.jl", "w")
write(outfile, "meanVect = $meanVect \ncovMat = $covMat")
close(outfile)
println(read("../data/mvParams.jl", String))

Observations in Vectors

Considering data that consists of vectors. The 'th data vector represents a tuple of values, . In this case, the data can be represented by a data matrix, , where the rows are observations (data vectors) and each column represents a different variable, feature or attribute

  • In summarizing the data , a few basic objects arise. These include the sample mean vector, sample standard deviation vector, sample covariance matric and the sample correlation matrix.
    • The sample mean vector is simply a vector of length where the 'th entry, is the sample mean of , based on the 'th column of .
    • The sample standard deviation vector has a 'th entries, , which is the sample standard deviation of
  • With these we often standardize the data by creating a new matrix , with entries,
    • It can be creat via,
      • Where is a column vector of 1's ( rows)
      • is the mean vector
      • is a diagonal matrix which is created from standard deviation vector
    • The standarized data has the attribute that each column, , has a sample mean and a unit standard deviation. Hence, the first- and second-order information of the 'th feature is lost when moving from the data matrix to the standarized matrix . Nevertheless, relationships between features are still captured in and can be easily calculated. Most notably, the sample correlation between feature and feather , denoted by is simply calculatd via
    • Without resorting to standarization, it is often of interest to calculate the sample covariance matrix
      • proof
  • Compute in Julia
    • zscore(a::Array): convert array to z-score
using Statistics, StatsBase, LinearAlgebra, DataFrames, CSV
df = CSV.read("../data/3featureData.csv",DataFrame ,header=false)
n,p = size(df)
println("Number of features: ", p)
println("Number of observations: ", n)
X = Matrix{Float64}(df)
println("Dimensions of data matrix: ", size(X))

xbarA = (1/n) * X' * ones(n)
xbarB = [mean(X[:,i]) for i in 1:p]
xbarC = sum(X, dims=1) ./ n
println("\nAlernative calculations of (sample) mean vector: ")
@show(xbarA), @show(xbarB), @show(xbarC)

Y = (I - ones(n,n)/n) * X
println("Y is the de-meaned data: ", mean(Y, dims=1))

covA = (X .- xbarA')'*(X .- xbarA')/(n-1)
covB = Y'*Y/(n-1)
covC = [cov(X[:,i],X[:, j]) for i in 1:p, j in 1:p]
covD = [cor(X[:,i],X[:, j]) * std(X[:,i]) * std(X[:,j]) for i in 1:p, j in 1:p]
covE = cov(X)
println("\nAlernative calculations of (sample) covariance matrix: ")
@show(covA), @show(covB), @show(covC), @show(covD), @show(covE)

ZmatA = [(X[i,j] - xbarA[j])/sqrt(covA[j,j]) for i in 1:n, j in 1:p]
ZmatB = (X .- xbarA') * sqrt.(Diagonal(covA))^(-1)
ZmatC = hcat([zscore(X[:,j]) for j in 1:p]...)
println("\nAlernative computation of z-score yieds same matrix: ",
    (maximum(norm(ZmatA - ZmatB)), maximum(norm(ZmatC - ZmatB)), maximum(norm(ZmatA - ZmatC))))
Z = ZmatA

corA = covA ./ [std(X[:,i])*std(X[:,j]) for i in 1:p, j in 1:p]
corB = covA ./ (std(X, dims=1)' * std(X, dims=1))
corC = [cor(X[:,i], X[:,j]) for i in 1:p, j in 1:p]
corD = Z' * Z ./ (n-1)
corE = cov(Z)
corF = cor(X)
println("\nAlernative calculations of (sample) correlation matrix: ")
@show(corA), @show(corB), @show(corC), @show(corD), @show(corE), @show(corF)

results matching ""

    No results matching ""