Working with Data Frames

Introducing the Data Frame

  • The dataset has four fields: Name, Date, Grade and Price.
  • Any variable in a dataset can be classified as either a numerical variable or categorical vairable.
    • Categorical vairbles can be further split into nominal variables, ordinal variables.
  • In Julia
    • size(df::DataFrame): return the size of DataFrame
    • names(df::DataFrame): return the col names of DataFrame
    • first(df::DataFrame, n::Int): return the first n rows of DataFrame
    • describe(df::DataFrame): describe the DataFrame
using DataFrames, CSV
data = CSV.read("../data/purchaseData.csv", DataFrame, copycols = true)

println(size(data), "\n")
println(names(data), "\n")
println(first(data,6), "\n")
println(describe(data), "\n")

Referencing Data

using DataFrames, CSV
data = CSV.read("../data/purchaseData.csv", DataFrame, copycols = true)

println("Grade of person 1: ", data[1,3],
    ", ", data[1,:Grade],
    ", ", data.Grade[1], "\n")
println(data[[1,2,4],:], "\n")
println(data[13:15, :Name], "\n")
println(data.Name[13:15], "\n")
println(data[13:15, [:Name]])

Modifying Data

  • Functions performed on a data frame will return a copy of that data frame. In other words, no underlying change will be made to the data frame object, but rather a shallow copy will be made and returned as output.
  • By default, the columns of a data frame are immutable, which means that the values within them cannot be changed. In order to make changes to a column, the column must first be mutable. One way to do this is by including copycols=true when a data frame is created from a csv file. This argument has the effect of making all columns mutable. Another way is by using ! when referencing the rows of a data frame. For example, df[!, :X] references the underlying data in column :X, while df[:,:X] simply references a shallow copy.
using DataFrames, CSV
data1 = CSV.read("../data/purchaseData.csv", DataFrame)
data2 = CSV.read("../data/purchaseData.csv", DataFrame, copycols=true)

try data1[1,:name] = "YARDEN" catch; @warn "Cannot: data1 is immutable" end;

data2[1,:Name] = "YARDEN"
println("\n",first(data2,3),"\n")

data1[!, :Price] ./= 1000
rename!(data1, :Price=>Symbol("Price(000's)"))
println(first(data1,3), "\n")

replace!(data2[!,:Grade], ["E"=>"F", "D"=>"E"]...)
println(first(data2,3),"\n")

Copying a Data Frame

using DataFrames, CSV
data1 = CSV.read("../data/purchaseData.csv", DataFrame, copycols=true)
println("Original value: ", data1.Name[1], "\n")

data2 = data1
data2.Name[1] = "EMILY"
@show data1.Name[1]

data1 = CSV.read("../data/purchaseData.csv", DataFrame, copycols=true)
data2 = copy(data1)
data2.Name[1] = "EMILY"
@show data1.Name[1]
println()

data1 = DataFrame()
data1.X = [[0,1], [100, 101]]
data2 = copy(data1)
data2.X[1][1] = -1
@show data1.X[1][1]

data1.X = [[0,1], [100, 101]]
data2 = deepcopy(data1)
data2.X[1][1] = -1
@show data1.X[1][1];

Handling Missing Entries

  • skipmissing(): skip all missing values
  • coalesce.(dfc::DataFrameColumns, string::String): replace all missing with string
  • dropmissing(df::DataFrame, colName::Symbol): drop all cases which colName is missing value
  • completecases(df::DataFrame): judge whether a case contains missing or not
using Statistics, DataFrames, CSV
data = CSV.read("../data/purchaseData.csv", DataFrame, copycols=true)

println(mean(data.Price),"\n")
println(mean(skipmissing(data.Price)), "\n")
println(coalesce.(data.Grade, "QQ")[1:4], "\n")
println(first(dropmissing(data, :Price), 4), "\n")
println(sum(ismissing.(data.Name)), "\n")
println(findall(completecases(data))[1:4])

Reshaping, Joining, and Manipulating Data Frames

  • Joins' style like SQL
    • innerjoin()
    • outerjoin()
    • rightjoin()
    • leftjoin() SQL Joins
using Statistics, DataFrames, CSV
data = CSV.read("../data/purchaseData.csv", DataFrame, copycols=true)

newCol = DataFrame(Validated=ones(Int, size(data,1)))
newRow = DataFrame([["JOHN", "JACK"] [123456, 909595]],[:Name, :PhoneNo])
newData = DataFrame(Name=["JOHN", "ASHELY", "MARYANNA"],
    Job=["Lawyer", "Doctor", "Lawyer"])

data = hcat(data, newCol)
println(first(data,3), "\n")

data = vcat(data, newRow, cols=:union)
println(last(data,3), "\n")

data = innerjoin(data, newData, on = :Name, matchmissing=:equal)
println(data,"\n")

select!(data, [:Name, :Job])
println(data, "\n")

unique!(data, :Job)
println(data, "\n")

Useful Operations for Data Frames

using DataFrames, CSV, Dates, Statistics, CategoricalArrays
data = dropmissing(CSV.read("../data/purchaseData.csv", DataFrame, copycols=true))

data[!, :Date] = Date.(data[!,:Date], "d/m/y")
println(first(sort(data, :Date), 3), "\n")

println(first(filter(row -> row[:Price] > 5000, data), 3), "\n")

data[!,:Grade] = categorical(data[!,:Grade])
println(first(data, 3), "\n")

println(
    combine(groupby(data, :Grade), 
    :Price => (x->length(x)) => :NumSold, 
    :Price=> (x->mean(x)) => :AvgPrice)
)

A Cleaning and Imputation Example

  • Instead of simply deleting rows, one way of dealing with missing values is to use imputation. This involves substituting missing entries with velues based on either the data observed or according to some other logic.
  • Example
using DataFrames, CSV, Statistics
data = CSV.read("../data/purchaseData.csv", DataFrame)

rowsKeep = .!(ismissing.(data.Grade) .& ismissing.(data.Price))
data = data[rowsKeep, :]

replace!(x -> ismissing(x) ? "QQ" : x, data.Name)
replace!(x -> ismissing(x) ? "31/06/2008" : x, data.Date)

grPr = combine(groupby(dropmissing(data) ,:Grade),
    :Price => (x -> round(mean(x), digits=-3)) => :AvgPrice)

d = Dict(grPr[:,1] .=> grPr[:,2])
nearIndx(v,x) = findmin(abs.(v.-x))[2]
for i in 1:nrow(data)
    if ismissing(data[i, :Price])
        data[i, :Price] = d[data[i, :Grade]]
    end
    if ismissing(data[i, :Grade])
        data[i, :Grade] = grPr[nearIndx(grPr[:,2], data[i, :Price]), :Grade]
    end
end

println(first(data,5),"\n")
println(describe(data))

results matching ""

    No results matching ""