Basics of working with data tables

Author

Jeet Sukumaran

Proficiencies

Background concepts:
- What tabular data is and why it is the dominant format for scientific datasets
- Common file formats for tabular data: CSV, TSV, and Excel (.xlsx)
- How to identify a file’s format from its extension and from inspecting its raw content
Core skills:
- Loading the DataFrames, CSV, and XLSX packages
- Creating a DataFrame manually from column vectors
- Reading a CSV file into a DataFrame using CSV.read
- Reading an Excel worksheet into a DataFrame using XLSX.readtable
- Inspecting a DataFrame: size, nrow, ncol, names, describe, first, last
- Selecting one or more columns by name and by position
- Filtering rows using logical conditions with boolean indexing
- Adding a new column to a DataFrame
- Selecting subsets of columns and rows simultaneously

1 What is tabular data?

Datasets are often organized as tables: rectangular grids of values where each row represents one observation (a specimen, a time-point, a site, a sequencing read) and each column represents one variable or attribute measured for every observation (body mass, collection date, latitude, base-call quality score).

Rows are observations; columns are variables

This convention — one row per observation, one column per variable — is called tidy data. It is the standard expected by virtually every data-analysis and plotting library you will encounter. Keeping your own data in this form from the start will save a great deal of reformatting later.

A small example of a tabular dataset:

specimen_id	species	body_mass_g	collection_year
BM_001	Peromyscus maniculatus	21.4	2019
BM_002	Peromyscus leucopus	23.1	2019
BM_003	Mus musculus	18.7	2020

Three rows, four columns. Each row is one specimen; each column is one thing we recorded about every specimen.

2 File formats for tabular data

Sample files in different formats

Comma-separated value (“CSV”) files, with extension .csv: specimens_10.csv
Tab-separated values (“TSV”) files, with extension .tsv: specimens_10.tsv
Excel files: specimens_10.xlsx

Tabular datasets are stored in a variety of file formats. The format you receive will determine which Julia function you use to load it, so it is worth being able to identify them.

2.1 CSV and its variants

The most common plain-text format for tabular data is CSV (comma-separated values). A CSV file stores the table as lines of text. The first line is typically a header listing the column names; every subsequent line is one row of data. Values within each line are separated by a delimiter — most often a comma (,), giving the format its name, though tabs (\t) and semicolons (;) are also common.

Inspecting a CSV file directly

Because CSV files are plain text, you can read them in any text editor or print their first few lines in the shell:

head -5 specimens.csv

or in Julia:

open("specimens.csv") do f
    for _ in 1:5
        println(readline(f))
    end
end

This is useful for confirming the delimiter, checking whether there is a header row, and spotting any obvious formatting issues before you try to parse the file.

A CSV file corresponding to the table above would look like this:

specimen_id,species,body_mass_g,collection_year
BM_001,Peromyscus maniculatus,21.4,2019
BM_002,Peromyscus leucopus,23.1,2019
BM_003,Mus musculus,18.7,2020

A TSV (tab-separated values) file is identical in structure, but with tab characters between values rather than commas:

specimen_id⟶species⟶body_mass_g⟶collection_year
BM_001⟶Peromyscus maniculatus⟶21.4⟶2019

Both extensions — .csv and .tsv — are strong signals of the format, but always confirm by opening the file.

2.2 Excel files

Excel workbooks (.xlsx, and the older .xls) store tables in a binary format: you cannot read them as plain text. They can contain multiple worksheets in a single file. When working with Excel data in Julia, you will need to specify both the file path and which worksheet you want.

3 Setting up your Julia environment for tabular data

Before we can work with tabular data, we need to install the DataFrames.jl, which gives us the DataFrame object type, In addition, we will install CSV.jl and XLSX.jl to allow us to access data stored in some common file formats. Installation is done through Julia’s built-in package manager, Pkg.

Installations can be done either globally, across your entire Julia ecosystem, or restricted to specific project environments.

3.1 Install globally

A global install makes the package available in any Julia session on your machine, regardless of which directory you are in or which project environment you have activated.

using Pkg
Pkg.add(["DataFrames", "CSV", "XLSX"])

You only need to do this once per Julia version. After installation, using DataFrames will work in any session.

3.2 Install for a project

If you are working inside a project directory — one that contains a Project.toml file — activate the project environment first, then add the packages. This records the dependency in Project.toml and pins exact versions in Manifest.toml, so anyone who clones the project gets the same package versions you used.

using Pkg
Pkg.activate(".")
Pkg.add(["DataFrames", "CSV", "XLSX"])

Once the environment is active, all subsequent Pkg operations in that session apply to the project rather than to the global environment.

4 The `DataFrame` type

In Julia, the standard representation of a data table is a DataFrame object from the DataFrames package.

A DataFrame holds a collection of named columns, where each column is a typed Vector. This means every value in a given column must be of the same type (all integers, all floating-point numbers, all strings, and so on), but different columns can have different types — exactly as in a real dataset.

using DataFrames

4.1 Creating a `DataFrame` manually from column vectors

The most direct way to construct a DataFrame is to pass named column vectors to the DataFrame constructor:

using DataFrames

specimens = DataFrame(
    specimen_id      = ["BM_001", "BM_002", "BM_003"],
    species          = ["Peromyscus maniculatus", "Peromyscus leucopus", "Mus musculus"],
    body_mass_g      = [21.4, 23.1, 18.7],
    collection_year  = [2019, 2019, 2020]
)

3×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019
3	BM_003	Mus musculus	18.7	2020

Each keyword argument becomes a column: the argument name is the column name, and the vector on the right is the column’s data. All vectors must have the same length.

Column types are inferred automatically

Julia infers the element type of each column from the vector you supply. ["BM_001", "BM_002", "BM_003"] becomes a Vector{String} column; [21.4, 23.1, 18.7] becomes a Vector{Float64} column. You can confirm the types of all columns at once with describe(df), discussed below.

5 Reading tabular data from files

In practice you will almost never type your data in by hand. Most of the time you will load it from a file written by an instrument, a database export, a collaborator’s spreadsheet, or a public data repository.

5.1 Reading CSV files with `CSV.jl`

The CSV package provides the CSV.read function, which parses a delimited text file and, when combined with DataFrame, returns the result as a DataFrame:

using CSV, DataFrames

specimens = CSV.read("specimens.csv", DataFrame)

CSV.read takes two required arguments: the path to the file, and the sink type that determines what kind of object the parsed data is placed into. Passing DataFrame as the sink is by far the most common usage.

Common keyword arguments for CSV.read

CSV.read accepts a large number of optional keyword arguments for handling real-world files that do not follow a perfectly clean format:

# File uses semicolons as the delimiter
CSV.read("data.csv", DataFrame; delim = ';')

# File uses tabs as the delimiter
CSV.read("data.tsv", DataFrame; delim = '\t')

# Skip the first two lines (e.g., a metadata header)
CSV.read("data.csv", DataFrame; skipto = 3)

# Treat "NA" and "" as missing values
CSV.read("data.csv", DataFrame; missingstring = ["NA", ""])

# File has no header row; supply column names manually
CSV.read("data.csv", DataFrame; header = ["id", "mass", "year"])

When a file refuses to parse cleanly, these keyword arguments are the first place to look.

For a TSV file, specify the delimiter explicitly:

specimens = CSV.read("specimens.tsv", DataFrame; delim = '\t')

5.2 Reading Excel files with `XLSX.jl`

The XLSX package provides access to Excel workbooks. The XLSX.readtable function reads one worksheet and returns it in a form that can be passed directly to the DataFrame constructor:

using XLSX, DataFrames

specimens = DataFrame(XLSX.readtable("specimens.xlsx", "Sheet1"))

The second argument to XLSX.readtable is the name of the worksheet to read.

Listing the worksheets in a workbook

If you do not know the names of the worksheets in an Excel file, you can inspect them:

xf = XLSX.readxlsx("specimens.xlsx")
XLSX.sheetnames(xf)

This returns a Vector{String} listing the names of all worksheets in the workbook.

6 Inspecting a `DataFrame`

Before doing anything else with a newly loaded dataset, it is good practice to inspect it: confirm that the right number of rows and columns were read, that the column names are what you expect, and that the types were inferred correctly.

6.1 Dimensions

nrow(specimens)   # number of rows

ncol(specimens)   # number of columns

size(specimens)   # (nrow, ncol) as a tuple

(3, 4)

6.2 Column names and types

names(specimens)  # returns a Vector{String} of column names

4-element Vector{String}:
 "specimen_id"
 "species"
 "body_mass_g"
 "collection_year"

describe(specimens)

4×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	specimen_id		BM_001		BM_003	0	String
2	species		Mus musculus		Peromyscus maniculatus	0	String
3	body_mass_g	21.0667	18.7	21.4	23.1	0	Float64
4	collection_year	2019.33	2019	2019.0	2020	0	Int64

describe returns a summary DataFrame with one row per column, showing the column name, element type, and basic summary statistics (minimum, mean, median, maximum, number of missing values). It is one of the most useful first-look tools available.

6.3 Previewing rows

first(specimens, 5)   # first 5 rows

3×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019
3	BM_003	Mus musculus	18.7	2020

last(specimens, 3)    # last 3 rows

3×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019
3	BM_003	Mus musculus	18.7	2020

Displaying the full DataFrame

When a DataFrame has many columns, Julia will automatically abbreviate the display. To print every column without truncation:

show(specimens, allcols = true)

To control the maximum number of rows shown:

show(specimens; allrows = true)

7 Selecting columns

Once you have a DataFrame in memory, the most fundamental operation is selecting the columns you want to work with.

7.1 Selecting a single column as a `Vector`

Use the . accessor or square-bracket syntax with a Symbol or String:

specimens.body_mass_g          # returns a Vector{Float64}

3-element Vector{Float64}:
 21.4
 23.1
 18.7

specimens[!, :body_mass_g]     # identical result

3-element Vector{Float64}:
 21.4
 23.1
 18.7

specimens[!, "body_mass_g"]    # also identical

3-element Vector{Float64}:
 21.4
 23.1
 18.7

! vs : as the row selector

In DataFrames indexing, the first argument selects rows and the second selects columns. Using ! as the row selector means all rows, without copying — you get a direct reference to the underlying column vector. Using : also selects all rows, but returns a copy. For read-only inspection, ! is fine and slightly more efficient. When you intend to modify the values in the returned vector and do not want to affect the original DataFrame, use :.

7.2 Selecting multiple columns as a new `DataFrame`

Pass a Vector of column names to get a new DataFrame containing only those columns:

specimens[!, ["specimen_id", "body_mass_g"]]

3×2 DataFrame

Row	specimen_id	body_mass_g
	String	Float64
1	BM_001	21.4
2	BM_002	23.1
3	BM_003	18.7

Or use the equivalent select function, which has a richer syntax:

select(specimens, :specimen_id, :body_mass_g)

3×2 DataFrame

Row	specimen_id	body_mass_g
	String	Float64
1	BM_001	21.4
2	BM_002	23.1
3	BM_003	18.7

select does not modify the original DataFrame; it returns a new one.

8 Filtering rows

To extract the subset of rows that satisfy some condition, use boolean indexing.

8.1 Boolean indexing

Constructing a logical (boolean) condition over a column produces a Vector{Bool} — a vector of true/false values, one per row:

specimens.collection_year .== 2019

3-element BitVector:
 1
 1
 0

Passing this boolean vector as the row index returns only the rows where the condition is true:

specimens[specimens.collection_year .== 2019, :]

2×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019

Broadcasting with .

The . before comparison operators (.==, .>, .<=, and so on) is Julia’s broadcasting syntax. It applies the operator element-wise across the entire vector, rather than trying to compare the whole vector to a scalar at once. You will need this . prefix whenever applying a scalar comparison across a column.

# ✓ correct: element-wise comparison
specimens.body_mass_g .> 20.0

# ✗ will error: comparing a Vector to a scalar without broadcasting
specimens.body_mass_g > 20.0

Combining multiple conditions uses & (and) and | (or), with each condition wrapped in parentheses:

specimens[(specimens.body_mass_g .> 20.0) .& (specimens.collection_year .== 2019), :]

2×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019

8.2 Filtering with `filter`

The filter function provides a more readable alternative for row-filtering. It takes a column name (as a Symbol using =>), a function that returns true or false, and the DataFrame:

filter(:body_mass_g => m -> m > 20.0, specimens)

2×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019

filter(:species => s -> startswith(s, "Peromyscus"), specimens)

2×4 DataFrame

Row	specimen_id	species	body_mass_g	collection_year
	String	String	Float64	Int64
1	BM_001	Peromyscus maniculatus	21.4	2019
2	BM_002	Peromyscus leucopus	23.1	2019

filter returns a new DataFrame and does not modify the original.

9 Adding and modifying columns

9.1 Adding a new column

Assign a new vector to a column name that does not yet exist:

specimens.body_mass_kg = specimens.body_mass_g ./ 1000
specimens

3×5 DataFrame

Row	specimen_id	species	body_mass_g	collection_year	body_mass_kg
	String	String	Float64	Int64	Float64
1	BM_001	Peromyscus maniculatus	21.4	2019	0.0214
2	BM_002	Peromyscus leucopus	23.1	2019	0.0231
3	BM_003	Mus musculus	18.7	2020	0.0187

In-place modification vs. returning a new DataFrame

Assignment with df.new_column = ... modifies the DataFrame in place, changing the original object. Many DataFrames functions — select, filter, transform — return a new DataFrame instead of modifying the original. This is the standard functional style and is generally preferred because it makes it easy to trace where each transformation happened. Use in-place assignment sparingly and deliberately.

9.2 Adding a column derived from existing columns using `transform`

The transform function is the idiomatic way to derive new columns from existing ones without modifying the original:

transform(specimens, :body_mass_g => (m -> m ./ 1000) => :body_mass_kg)

3×5 DataFrame

Row	specimen_id	species	body_mass_g	collection_year	body_mass_kg
	String	String	Float64	Int64	Float64
1	BM_001	Peromyscus maniculatus	21.4	2019	0.0214
2	BM_002	Peromyscus leucopus	23.1	2019	0.0231
3	BM_003	Mus musculus	18.7	2020	0.0187

The column specification follows the pattern source_column => transformation_function => new_column_name.

9.3 Selecting and transforming simultaneously

select and transform can be combined with ! for in-place modification, or composed as a pipeline with Julia’s |> operator:

specimens |>
    df -> filter(:collection_year => y -> y == 2020, df) |>
    df -> select(df, :specimen_id, :species, :body_mass_g)

1×3 DataFrame

Row	specimen_id	species	body_mass_g
	String	String	Float64
1	BM_003	Mus musculus	18.7

10 Exercises

Exercise 1

Create the following DataFrame manually using the DataFrame constructor:

gene	chromosome	length_bp	gc_content
BRCA1	17	81189	0.421
TP53	17	19149	0.488
FOXP2	7	2473559	0.389
GAPDH	12	4741	0.641

Once constructed:

Print the dimensions of the DataFrame.
Use describe to inspect the column types.
Select only the gene and gc_content columns.
Filter to retain only genes on chromosome 17.

Exercise 2

Download the publicly available Palmer Penguins dataset as a CSV file.

Load it into a DataFrame using CSV.read.
Inspect it with first, last, and describe.
How many rows and columns does it have?
Filter the dataset to retain only rows for the species "Adelie".
From that filtered subset, select only bill_length_mm, bill_depth_mm, and body_mass_g.
Add a new column, bill_ratio, defined as bill_length_mm / bill_depth_mm.

1 What is tabular data?

2 File formats for tabular data

2.1 CSV and its variants

2.2 Excel files

3 Setting up your Julia environment for tabular data

3.1 Install globally

3.2 Install for a project

4 The DataFrame type

4.1 Creating a DataFrame manually from column vectors

5 Reading tabular data from files

5.1 Reading CSV files with CSV.jl

5.2 Reading Excel files with XLSX.jl

6 Inspecting a DataFrame

6.1 Dimensions

6.2 Column names and types

6.3 Previewing rows

7 Selecting columns

7.1 Selecting a single column as a Vector

7.2 Selecting multiple columns as a new DataFrame

8 Filtering rows

8.1 Boolean indexing

8.2 Filtering with filter

9 Adding and modifying columns

9.1 Adding a new column

9.2 Adding a column derived from existing columns using transform

9.3 Selecting and transforming simultaneously

10 Exercises

4 The `DataFrame` type

4.1 Creating a `DataFrame` manually from column vectors

5.1 Reading CSV files with `CSV.jl`

5.2 Reading Excel files with `XLSX.jl`

6 Inspecting a `DataFrame`

7.1 Selecting a single column as a `Vector`

7.2 Selecting multiple columns as a new `DataFrame`

8.2 Filtering with `filter`

9.2 Adding a column derived from existing columns using `transform`