4 DataFrames.jl

    Here, the dots mean that this could be a very long table and we only show a few rows. While analyzing data, often we come up with interesting questions about the data, also known as data queries. For large tables, computers would be able to answer these kinds of questions much quicker than you could do it by hand. Some examples of these so-called queries for this data could be:

    • Which TV show has the highest rating?
    • Which TV shows were produced in the United States?
    • Which TV shows were produced in the same country?

    But, as a researcher, real science often starts with having multiple tables or data sources. For example, if we also have data from someone else’s ratings for the TV shows (Table 2):

    Table 2: Ratings.
    namerating
    Game of Thrones7
    Friends6.4

    Now, questions that we could ask ourselves could be:

    • What is Game of Thrones’ average rating?
    • Who gave the highest rating for Friends?
    • What TV shows were rated by you but not by the other person?

    In the rest of this chapter, we will show you how you can easily answer these questions in Julia. To do so, we first show why we need a Julia package called . In the next sections, we show how you can use this package and, finally, we show how to write fast data transformations (Section ).

    Let’s look at a table of grades like the one in Table 3:

    Table 3: Grades for 2020.
    nameagegrade_2020
    Bob175.0
    Sally181.0
    Alice208.5
    Hank194.0

    So far, this book has only handled Julia’s basics. These basics are great for many things, but not for tables. To show that we need more, lets try to store the tabular data in arrays:

    Now, the data is stored in so-called column-major form, which is cumbersome when we want to get data from a row:

    1. function second_row()
    2. name, age, grade_2020 = grades_array()
    3. i = 2
    4. row = (name[i], age[i], grade_2020[i])
    5. end
    1. ("Sally", 18, 1.0)

    Or, if you want to have the grade for Alice, you first need to figure out in what row Alice is:

    1. function row_alice()
    2. names = grades_array().name
    3. i = findfirst(names .== "Alice")
    4. end
    5. row_alice()

    and then we can get the value:

    1. 8.5

    DataFrames.jl can easily solve these kinds of issues. You can start by loading DataFrames.jl with using:

    1. using DataFrames

    With DataFrames.jl, we can define a DataFrame to hold our tabular data:

    1. names = ["Sally", "Bob", "Alice", "Hank"]
    2. grades = [1, 5, 8.5, 4]

    which gives us a variable df containing our data in table format.

    Let’s do the same thing as before but now in a function:

    1. function grades_2020()
    2. name = ["Sally", "Bob", "Alice", "Hank"]
    3. DataFrame(; name, grade_2020)
    4. end
    5. grades_2020()
    Table 4: Grades 2020.
    namegrade_2020
    Sally1.0
    Bob5.0
    Alice8.5
    Hank4.0

    Note that name and grade_2020 are destroyed after the function returns, that is, they are only available in the function. There are two other benefits of doing this. First, it is now clear to the reader where name and grade_2020 belong to: they belong to the grades of 2020. Second, it is easy to determine what the output of grades_2020() would be at any point in the book. For example, we can now assign the data to a variable df:

    namegrade_2020
    Sally1.0
    Bob5.0
    Alice8.5
    Hank4.0

    Change the contents of df:

    1. df = DataFrame(name = ["Malice"], grade_2020 = ["10"])

    And still recover the original data back without any problem:

    1. df = grades_2020()
    namegrade_2020
    Sally1.0
    Bob5.0
    Alice8.5
    Hank4.0

    Of course, this assumes that the function is not re-defined. We promise to not do that in this book, because it is a bad idea exactly for this reason. Instead of “changing” a function, we will make a new one and give it a clear name.

    So, back to the DataFrames constructor. As you might have seen, the way to create one is simply to pass vectors as arguments into the DataFrame constructor. You can come up with any valid Julia vector and it will work as long as the vectors have the same length. Duplicates, Unicode symbols and any sort of numbers are fine:

    1. DataFrame = ["a", "a", "a"], δ = [π, π/2, π/3])
    σδ
    a3.141592653589793
    a1.5707963267948966
    a1.0471975511965976

    Typically, in your code, you would create a function which wraps around one or more DataFrames’ functions. For example, we can make a function to get the grades for one or more names:

    1. function grades_2020(names::Vector{Int})
    2. df = grades_2020()
    3. df[names, :]
    4. end

    This way of using functions to wrap around basic functionality in programming languages and packages is quite common. Basically, you can think of Julia and as providers of building blocks. They provide very generic building blocks which allow you to build things for your specific use case like this grades example. By using the blocks, you can make a data analysis script, control a robot or whatever you like to build.