1. Basic frameworks and mechanisms
  2. Basics of working with data tables
  • (Just enough) Julia for scientific informatics, modeling, and reasoning
  • Introduction
  • Basic frameworks and mechanisms
    • Orientation
    • Basics of setting up and running Julia
    • Basics of visualizing mathematical models
    • Basics of working with randomness and probabilities
    • Basics of working with data tables
  • Basics of specialized workflows
    • Basics of paleobiological fossil collection analyses
    • Basics of agent-based modeling: spatial epidemic dynamics with Agents.jl
      • Basics of agent-based modeling: spatial epidemic dynamics with Agents.jl
    • Basics of species distribution modeling
  • Primers
    • Bernoulli trial
    • Pathogen fitness as a function of virulence (Frank, 1996)
    • Virulence-transmission trade-off (Frank, 1996)
    • Julia – Environments – Global vs project
    • Julia: Functions, methods, and signatures
    • Markov property
    • Probabilty distributions–Essential concepts
    • Pseudo-random number generators
    • Pseudo-random number generators: best practices
    • Pseudo-random number generators: continuous values from discrete machines

On this page

  • 1 What is tabular data?
  • 2 File formats for tabular data
    • 2.1 CSV and its variants
    • 2.2 Excel files
  • 3 Setting up your Julia environment for tabular data
    • 3.1 Install globally
    • 3.2 Install for a project
  • 4 The DataFrame type
    • 4.1 Creating a DataFrame manually from column vectors
  • 5 Reading tabular data from files
    • 5.1 Reading CSV files with CSV.jl
    • 5.2 Reading Excel files with XLSX.jl
  • 6 Inspecting a DataFrame
    • 6.1 Dimensions
    • 6.2 Column names and types
    • 6.3 Previewing rows
  • 7 Selecting columns
    • 7.1 Selecting a single column as a Vector
    • 7.2 Selecting multiple columns as a new DataFrame
  • 8 Filtering rows
    • 8.1 Boolean indexing
    • 8.2 Filtering with filter
  • 9 Adding and modifying columns
    • 9.1 Adding a new column
    • 9.2 Adding a column derived from existing columns using transform
    • 9.3 Selecting and transforming simultaneously
  • 10 Exercises
  1. Basic frameworks and mechanisms
  2. Basics of working with data tables

Basics of working with data tables

Author

Jeet Sukumaran

Proficiencies
  • Background concepts:
    • What tabular data is and why it is the dominant format for scientific datasets
    • Common file formats for tabular data: CSV, TSV, and Excel (.xlsx)
    • How to identify a file’s format from its extension and from inspecting its raw content
  • Core skills:
    • Loading the DataFrames, CSV, and XLSX packages
    • Creating a DataFrame manually from column vectors
    • Reading a CSV file into a DataFrame using CSV.read
    • Reading an Excel worksheet into a DataFrame using XLSX.readtable
    • Inspecting a DataFrame: size, nrow, ncol, names, describe, first, last
    • Selecting one or more columns by name and by position
    • Filtering rows using logical conditions with boolean indexing
    • Adding a new column to a DataFrame
    • Selecting subsets of columns and rows simultaneously

1 What is tabular data?

Datasets are often organized as tables: rectangular grids of values where each row represents one observation (a specimen, a time-point, a site, a sequencing read) and each column represents one variable or attribute measured for every observation (body mass, collection date, latitude, base-call quality score).

TipRows are observations; columns are variables

This convention — one row per observation, one column per variable — is called tidy data. It is the standard expected by virtually every data-analysis and plotting library you will encounter. Keeping your own data in this form from the start will save a great deal of reformatting later.

A small example of a tabular dataset:

specimen_id species body_mass_g collection_year
BM_001 Peromyscus maniculatus 21.4 2019
BM_002 Peromyscus leucopus 23.1 2019
BM_003 Mus musculus 18.7 2020

Three rows, four columns. Each row is one specimen; each column is one thing we recorded about every specimen.

2 File formats for tabular data

NoteSample files in different formats
  • Comma-separated value (“CSV”) files, with extension .csv: specimens_10.csv
  • Tab-separated values (“TSV”) files, with extension .tsv: specimens_10.tsv
  • Excel files: specimens_10.xlsx

Tabular datasets are stored in a variety of file formats. The format you receive will determine which Julia function you use to load it, so it is worth being able to identify them.

2.1 CSV and its variants

The most common plain-text format for tabular data is CSV (comma-separated values). A CSV file stores the table as lines of text. The first line is typically a header listing the column names; every subsequent line is one row of data. Values within each line are separated by a delimiter — most often a comma (,), giving the format its name, though tabs (\t) and semicolons (;) are also common.

NoteInspecting a CSV file directly

Because CSV files are plain text, you can read them in any text editor or print their first few lines in the shell:

head -5 specimens.csv

or in Julia:

open("specimens.csv") do f
    for _ in 1:5
        println(readline(f))
    end
end

This is useful for confirming the delimiter, checking whether there is a header row, and spotting any obvious formatting issues before you try to parse the file.

A CSV file corresponding to the table above would look like this:

specimen_id,species,body_mass_g,collection_year
BM_001,Peromyscus maniculatus,21.4,2019
BM_002,Peromyscus leucopus,23.1,2019
BM_003,Mus musculus,18.7,2020

A TSV (tab-separated values) file is identical in structure, but with tab characters between values rather than commas:

specimen_id⟶species⟶body_mass_g⟶collection_year
BM_001⟶Peromyscus maniculatus⟶21.4⟶2019

Both extensions — .csv and .tsv — are strong signals of the format, but always confirm by opening the file.

2.2 Excel files

Excel workbooks (.xlsx, and the older .xls) store tables in a binary format: you cannot read them as plain text. They can contain multiple worksheets in a single file. When working with Excel data in Julia, you will need to specify both the file path and which worksheet you want.

3 Setting up your Julia environment for tabular data

Before we can work with tabular data, we need to install the DataFrames.jl, which gives us the DataFrame object type, In addition, we will install CSV.jl and XLSX.jl to allow us to access data stored in some common file formats. Installation is done through Julia’s built-in package manager, Pkg.

Installations can be done either globally, across your entire Julia ecosystem, or restricted to specific project environments.

3.1 Install globally

A global install makes the package available in any Julia session on your machine, regardless of which directory you are in or which project environment you have activated.

using Pkg
Pkg.add(["DataFrames", "CSV", "XLSX"])

You only need to do this once per Julia version. After installation, using DataFrames will work in any session.

3.2 Install for a project

If you are working inside a project directory — one that contains a Project.toml file — activate the project environment first, then add the packages. This records the dependency in Project.toml and pins exact versions in Manifest.toml, so anyone who clones the project gets the same package versions you used.

using Pkg
Pkg.activate(".")
Pkg.add(["DataFrames", "CSV", "XLSX"])

Once the environment is active, all subsequent Pkg operations in that session apply to the project rather than to the global environment.

4 The DataFrame type

In Julia, the standard representation of a data table is a DataFrame object from the DataFrames package.

A DataFrame holds a collection of named columns, where each column is a typed Vector. This means every value in a given column must be of the same type (all integers, all floating-point numbers, all strings, and so on), but different columns can have different types — exactly as in a real dataset.

using DataFrames

4.1 Creating a DataFrame manually from column vectors

The most direct way to construct a DataFrame is to pass named column vectors to the DataFrame constructor:

using DataFrames

specimens = DataFrame(
    specimen_id      = ["BM_001", "BM_002", "BM_003"],
    species          = ["Peromyscus maniculatus", "Peromyscus leucopus", "Mus musculus"],
    body_mass_g      = [21.4, 23.1, 18.7],
    collection_year  = [2019, 2019, 2020]
)
3×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019
3 BM_003 Mus musculus 18.7 2020

Each keyword argument becomes a column: the argument name is the column name, and the vector on the right is the column’s data. All vectors must have the same length.

TipColumn types are inferred automatically

Julia infers the element type of each column from the vector you supply. ["BM_001", "BM_002", "BM_003"] becomes a Vector{String} column; [21.4, 23.1, 18.7] becomes a Vector{Float64} column. You can confirm the types of all columns at once with describe(df), discussed below.

5 Reading tabular data from files

In practice you will almost never type your data in by hand. Most of the time you will load it from a file written by an instrument, a database export, a collaborator’s spreadsheet, or a public data repository.

5.1 Reading CSV files with CSV.jl

The CSV package provides the CSV.read function, which parses a delimited text file and, when combined with DataFrame, returns the result as a DataFrame:

using CSV, DataFrames

specimens = CSV.read("specimens.csv", DataFrame)

CSV.read takes two required arguments: the path to the file, and the sink type that determines what kind of object the parsed data is placed into. Passing DataFrame as the sink is by far the most common usage.

NoteCommon keyword arguments for CSV.read

CSV.read accepts a large number of optional keyword arguments for handling real-world files that do not follow a perfectly clean format:

# File uses semicolons as the delimiter
CSV.read("data.csv", DataFrame; delim = ';')

# File uses tabs as the delimiter
CSV.read("data.tsv", DataFrame; delim = '\t')

# Skip the first two lines (e.g., a metadata header)
CSV.read("data.csv", DataFrame; skipto = 3)

# Treat "NA" and "" as missing values
CSV.read("data.csv", DataFrame; missingstring = ["NA", ""])

# File has no header row; supply column names manually
CSV.read("data.csv", DataFrame; header = ["id", "mass", "year"])

When a file refuses to parse cleanly, these keyword arguments are the first place to look.

For a TSV file, specify the delimiter explicitly:

specimens = CSV.read("specimens.tsv", DataFrame; delim = '\t')

5.2 Reading Excel files with XLSX.jl

The XLSX package provides access to Excel workbooks. The XLSX.readtable function reads one worksheet and returns it in a form that can be passed directly to the DataFrame constructor:

using XLSX, DataFrames

specimens = DataFrame(XLSX.readtable("specimens.xlsx", "Sheet1"))

The second argument to XLSX.readtable is the name of the worksheet to read.

TipListing the worksheets in a workbook

If you do not know the names of the worksheets in an Excel file, you can inspect them:

xf = XLSX.readxlsx("specimens.xlsx")
XLSX.sheetnames(xf)

This returns a Vector{String} listing the names of all worksheets in the workbook.

6 Inspecting a DataFrame

Before doing anything else with a newly loaded dataset, it is good practice to inspect it: confirm that the right number of rows and columns were read, that the column names are what you expect, and that the types were inferred correctly.

6.1 Dimensions

nrow(specimens)   # number of rows
3
ncol(specimens)   # number of columns
4
size(specimens)   # (nrow, ncol) as a tuple
(3, 4)

6.2 Column names and types

names(specimens)  # returns a Vector{String} of column names
4-element Vector{String}:
 "specimen_id"
 "species"
 "body_mass_g"
 "collection_year"
describe(specimens)
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 specimen_id BM_001 BM_003 0 String
2 species Mus musculus Peromyscus maniculatus 0 String
3 body_mass_g 21.0667 18.7 21.4 23.1 0 Float64
4 collection_year 2019.33 2019 2019.0 2020 0 Int64

describe returns a summary DataFrame with one row per column, showing the column name, element type, and basic summary statistics (minimum, mean, median, maximum, number of missing values). It is one of the most useful first-look tools available.

6.3 Previewing rows

first(specimens, 5)   # first 5 rows
3×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019
3 BM_003 Mus musculus 18.7 2020
last(specimens, 3)    # last 3 rows
3×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019
3 BM_003 Mus musculus 18.7 2020
NoteDisplaying the full DataFrame

When a DataFrame has many columns, Julia will automatically abbreviate the display. To print every column without truncation:

show(specimens, allcols = true)

To control the maximum number of rows shown:

show(specimens; allrows = true)

7 Selecting columns

Once you have a DataFrame in memory, the most fundamental operation is selecting the columns you want to work with.

7.1 Selecting a single column as a Vector

Use the . accessor or square-bracket syntax with a Symbol or String:

specimens.body_mass_g          # returns a Vector{Float64}
3-element Vector{Float64}:
 21.4
 23.1
 18.7
specimens[!, :body_mass_g]     # identical result
3-element Vector{Float64}:
 21.4
 23.1
 18.7
specimens[!, "body_mass_g"]    # also identical
3-element Vector{Float64}:
 21.4
 23.1
 18.7
Tip! vs : as the row selector

In DataFrames indexing, the first argument selects rows and the second selects columns. Using ! as the row selector means all rows, without copying — you get a direct reference to the underlying column vector. Using : also selects all rows, but returns a copy. For read-only inspection, ! is fine and slightly more efficient. When you intend to modify the values in the returned vector and do not want to affect the original DataFrame, use :.

7.2 Selecting multiple columns as a new DataFrame

Pass a Vector of column names to get a new DataFrame containing only those columns:

specimens[!, ["specimen_id", "body_mass_g"]]
3×2 DataFrame
Row specimen_id body_mass_g
String Float64
1 BM_001 21.4
2 BM_002 23.1
3 BM_003 18.7

Or use the equivalent select function, which has a richer syntax:

select(specimens, :specimen_id, :body_mass_g)
3×2 DataFrame
Row specimen_id body_mass_g
String Float64
1 BM_001 21.4
2 BM_002 23.1
3 BM_003 18.7

select does not modify the original DataFrame; it returns a new one.

8 Filtering rows

To extract the subset of rows that satisfy some condition, use boolean indexing.

8.1 Boolean indexing

Constructing a logical (boolean) condition over a column produces a Vector{Bool} — a vector of true/false values, one per row:

specimens.collection_year .== 2019
3-element BitVector:
 1
 1
 0

Passing this boolean vector as the row index returns only the rows where the condition is true:

specimens[specimens.collection_year .== 2019, :]
2×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019
NoteBroadcasting with .

The . before comparison operators (.==, .>, .<=, and so on) is Julia’s broadcasting syntax. It applies the operator element-wise across the entire vector, rather than trying to compare the whole vector to a scalar at once. You will need this . prefix whenever applying a scalar comparison across a column.

# ✓ correct: element-wise comparison
specimens.body_mass_g .> 20.0

# ✗ will error: comparing a Vector to a scalar without broadcasting
specimens.body_mass_g > 20.0

Combining multiple conditions uses & (and) and | (or), with each condition wrapped in parentheses:

specimens[(specimens.body_mass_g .> 20.0) .& (specimens.collection_year .== 2019), :]
2×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019

8.2 Filtering with filter

The filter function provides a more readable alternative for row-filtering. It takes a column name (as a Symbol using =>), a function that returns true or false, and the DataFrame:

filter(:body_mass_g => m -> m > 20.0, specimens)
2×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019
filter(:species => s -> startswith(s, "Peromyscus"), specimens)
2×4 DataFrame
Row specimen_id species body_mass_g collection_year
String String Float64 Int64
1 BM_001 Peromyscus maniculatus 21.4 2019
2 BM_002 Peromyscus leucopus 23.1 2019

filter returns a new DataFrame and does not modify the original.

9 Adding and modifying columns

9.1 Adding a new column

Assign a new vector to a column name that does not yet exist:

specimens.body_mass_kg = specimens.body_mass_g ./ 1000
specimens
3×5 DataFrame
Row specimen_id species body_mass_g collection_year body_mass_kg
String String Float64 Int64 Float64
1 BM_001 Peromyscus maniculatus 21.4 2019 0.0214
2 BM_002 Peromyscus leucopus 23.1 2019 0.0231
3 BM_003 Mus musculus 18.7 2020 0.0187
TipIn-place modification vs. returning a new DataFrame

Assignment with df.new_column = ... modifies the DataFrame in place, changing the original object. Many DataFrames functions — select, filter, transform — return a new DataFrame instead of modifying the original. This is the standard functional style and is generally preferred because it makes it easy to trace where each transformation happened. Use in-place assignment sparingly and deliberately.

9.2 Adding a column derived from existing columns using transform

The transform function is the idiomatic way to derive new columns from existing ones without modifying the original:

transform(specimens, :body_mass_g => (m -> m ./ 1000) => :body_mass_kg)
3×5 DataFrame
Row specimen_id species body_mass_g collection_year body_mass_kg
String String Float64 Int64 Float64
1 BM_001 Peromyscus maniculatus 21.4 2019 0.0214
2 BM_002 Peromyscus leucopus 23.1 2019 0.0231
3 BM_003 Mus musculus 18.7 2020 0.0187

The column specification follows the pattern source_column => transformation_function => new_column_name.

9.3 Selecting and transforming simultaneously

select and transform can be combined with ! for in-place modification, or composed as a pipeline with Julia’s |> operator:

specimens |>
    df -> filter(:collection_year => y -> y == 2020, df) |>
    df -> select(df, :specimen_id, :species, :body_mass_g)
1×3 DataFrame
Row specimen_id species body_mass_g
String String Float64
1 BM_003 Mus musculus 18.7

10 Exercises


Exercise 1

Create the following DataFrame manually using the DataFrame constructor:

gene chromosome length_bp gc_content
BRCA1 17 81189 0.421
TP53 17 19149 0.488
FOXP2 7 2473559 0.389
GAPDH 12 4741 0.641

Once constructed:

  1. Print the dimensions of the DataFrame.
  2. Use describe to inspect the column types.
  3. Select only the gene and gc_content columns.
  4. Filter to retain only genes on chromosome 17.

Exercise 2

Download the publicly available Palmer Penguins dataset as a CSV file.

  1. Load it into a DataFrame using CSV.read.
  2. Inspect it with first, last, and describe.
  3. How many rows and columns does it have?
  4. Filter the dataset to retain only rows for the species "Adelie".
  5. From that filtered subset, select only bill_length_mm, bill_depth_mm, and body_mass_g.
  6. Add a new column, bill_ratio, defined as bill_length_mm / bill_depth_mm.
Back to top
Basics of working with randomness and probabilities
Basics of specialized workflows
  • © Jeet Sukumaran

Please share or adapt under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).