Plots for Multivariate and High-Dimensional Data

Consider the vectors of observations, , where is the number of observations and is the number of variables, or features. In cases where is large the data is called high dimensional.

Scatter Plot Matrix

  • The basic scatter plot is very common in data visualization. In its simplest form, when considering pairs of observations it is a plot of these coordinates on the Cartesian plain.
  • When moving from pairs to higher dimensions, each observation is represented as the vector or tuple . If one may still try to illustrate a point cloud, however for higher dimensions this isn't possible. In this case, one of the most popular plots for visualizing relationships is the scatter plot matrix. It consist of taking each possible pair of variables and plotting a scatter variables, there are total plots, where of the plots are redundant because they plot a vairable aginst itself, and the other plots each contain a duplicate of plots. Hence, for example, if , there are important plots in the scatter plot matrix even though the matrix has plots in total.
  • Plot in Julia
using RDatasets, Plots, Measures; pyplot()

data = dataset("datasets", "iris")
println("Number of rows: ", nrow(data))

insertSpace(name) = begin
    i = findlast(isuppercase, name)
    name[1:i-1] * " " * name[i:end]
end

featureNames = insertSpace.(string.(names(data)))[1:4]
println("Names of features:\n\t", featureNames)

speciesNames = unique(data.Species)
speciesFreqs = [sn => sum(data.Species .== sn) for sn in speciesNames]
println("Frequency per species:\n\t",speciesFreqs)

default(msw=0, ms=3)

scatters = [
    scatter(data[:,i], data[:,j], c=[:blue :red :green], group=data.Species,
        xlabel=featureNames[i], ylabel=featureNames[j], legend = i==1 && j==1)
    for i in 1:4, j in 1:4]

plot(scatters..., size=(1200,800), margin=4mm)

Heat Map with Marginals

  • The heat map consists of a grid of shaded cells. Another name for it is a matrix plot. The colors of the cells indicate the magnitude, where typically, the "warmer" the color, the higher the value.
  • In cases of pairs of observations , the bivariate data can be constructed into a bivariate histogram in a manner similar to the histogram. In the bivariate case, we partition the Cartesian plain , into a grid of bins for and . Then we count the frequency of observations per bin via
  • Plot in Julia
    • marginalhist() from StatsPlots
using StatsPlots, Distributions, CSV, DataFrames, Measures; gr()

realData = CSV.read("../data/temperatures.csv", DataFrame)

N = 10^5
include("../data/mvParams.jl")
biNormal = MvNormal(meanVect, covMat)
syntheticData = DataFrame(Matrix{Float64}(rand(MvNormal(meanVect, covMat),N)'), :auto)
rename!(syntheticData, [:x1 => :Brisbane, :x2 => :GoldCoast])

default(c = cgrad([:blue,:red]),
    xlabel="Brisbane Temperature",
    ylabel="Gold Coast Temperature")

p1 = marginalhist(realData.Brisbane, realData.GoldCoast, bins=10:45)
p2 = marginalhist(syntheticData.Brisbane, syntheticData.GoldCoast, bins=10:0.5:45)

plot(p1, p2, size=(1000,500), margin=10mm)

Andrews Plot

  • The idea is to represent a data vector via a real-valued function. For any individual vector, such a transformation cannot be generally useful, when comparing groups of vectors, it may yield a way to visualize structural different in the data.
  • The specific transformation rule that we present here creates a plot known as Andrews plot. Here, for the 'th data vector , we create the function defined on via,
    • with the last term involving a if is even and a if is odd. The for , the functions are plotted.
  • Plot in Julia
using RDatasets, StatsPlots; pyplot()

iris = dataset("datasets", "iris")
@df iris andrewsplot(:Species, cols(1:4),
    line=(fill=[:blue :red :green]), legend=:topleft)

results matching ""

    No results matching ""