What tabular data is and why it is the dominant format for scientific datasets
Common file formats for tabular data: CSV, TSV, and Excel (.xlsx)
How to identify a file’s format from its extension and from inspecting its raw content
Core skills:
Loading the DataFrames, CSV, and XLSX packages
Creating a DataFrame manually from column vectors
Reading a CSV file into a DataFrame using CSV.read
Reading an Excel worksheet into a DataFrame using XLSX.readtable
Inspecting a DataFrame: size, nrow, ncol, names, describe, first, last
Selecting one or more columns by name and by position
Filtering rows using logical conditions with boolean indexing
Adding a new column to a DataFrame
Selecting subsets of columns and rows simultaneously
1 What is tabular data?
Datasets are often organized as tables: rectangular grids of values where each row represents one observation (a specimen, a time-point, a site, a sequencing read) and each column represents one variable or attribute measured for every observation (body mass, collection date, latitude, base-call quality score).
TipRows are observations; columns are variables
This convention — one row per observation, one column per variable — is called tidy data. It is the standard expected by virtually every data-analysis and plotting library you will encounter. Keeping your own data in this form from the start will save a great deal of reformatting later.
A small example of a tabular dataset:
specimen_id
species
body_mass_g
collection_year
BM_001
Peromyscus maniculatus
21.4
2019
BM_002
Peromyscus leucopus
23.1
2019
BM_003
Mus musculus
18.7
2020
Three rows, four columns. Each row is one specimen; each column is one thing we recorded about every specimen.
2 File formats for tabular data
NoteSample files in different formats
Comma-separated value (“CSV”) files, with extension .csv: specimens_10.csv
Tab-separated values (“TSV”) files, with extension .tsv: specimens_10.tsv
Tabular datasets are stored in a variety of file formats. The format you receive will determine which Julia function you use to load it, so it is worth being able to identify them.
2.1 CSV and its variants
The most common plain-text format for tabular data is CSV (comma-separated values). A CSV file stores the table as lines of text. The first line is typically a header listing the column names; every subsequent line is one row of data. Values within each line are separated by a delimiter — most often a comma (,), giving the format its name, though tabs (\t) and semicolons (;) are also common.
NoteInspecting a CSV file directly
Because CSV files are plain text, you can read them in any text editor or print their first few lines in the shell:
head-5 specimens.csv
or in Julia:
open("specimens.csv") do ffor _ in1:5println(readline(f))endend
This is useful for confirming the delimiter, checking whether there is a header row, and spotting any obvious formatting issues before you try to parse the file.
A CSV file corresponding to the table above would look like this:
Both extensions — .csv and .tsv — are strong signals of the format, but always confirm by opening the file.
2.2 Excel files
Excel workbooks (.xlsx, and the older .xls) store tables in a binary format: you cannot read them as plain text. They can contain multiple worksheets in a single file. When working with Excel data in Julia, you will need to specify both the file path and which worksheet you want.
3 Setting up your Julia environment for tabular data
Before we can work with tabular data, we need to install the DataFrames.jl, which gives us the DataFrame object type, In addition, we will install CSV.jl and XLSX.jl to allow us to access data stored in some common file formats. Installation is done through Julia’s built-in package manager, Pkg.
Installations can be done either globally, across your entire Julia ecosystem, or restricted to specific project environments.
3.1 Install globally
A global install makes the package available in any Julia session on your machine, regardless of which directory you are in or which project environment you have activated.
usingPkgPkg.add(["DataFrames", "CSV", "XLSX"])
You only need to do this once per Julia version. After installation, using DataFrames will work in any session.
3.2 Install for a project
If you are working inside a project directory — one that contains a Project.toml file — activate the project environment first, then add the packages. This records the dependency in Project.toml and pins exact versions in Manifest.toml, so anyone who clones the project gets the same package versions you used.
Once the environment is active, all subsequent Pkg operations in that session apply to the project rather than to the global environment.
4 The DataFrame type
In Julia, the standard representation of a data table is a DataFrame object from the DataFrames package.
A DataFrame holds a collection of named columns, where each column is a typed Vector. This means every value in a given column must be of the same type (all integers, all floating-point numbers, all strings, and so on), but different columns can have different types — exactly as in a real dataset.
usingDataFrames
4.1 Creating a DataFrame manually from column vectors
The most direct way to construct a DataFrame is to pass named column vectors to the DataFrame constructor:
Each keyword argument becomes a column: the argument name is the column name, and the vector on the right is the column’s data. All vectors must have the same length.
TipColumn types are inferred automatically
Julia infers the element type of each column from the vector you supply. ["BM_001", "BM_002", "BM_003"] becomes a Vector{String} column; [21.4, 23.1, 18.7] becomes a Vector{Float64} column. You can confirm the types of all columns at once with describe(df), discussed below.
5 Reading tabular data from files
In practice you will almost never type your data in by hand. Most of the time you will load it from a file written by an instrument, a database export, a collaborator’s spreadsheet, or a public data repository.
5.1 Reading CSV files with CSV.jl
The CSV package provides the CSV.read function, which parses a delimited text file and, when combined with DataFrame, returns the result as a DataFrame:
CSV.read takes two required arguments: the path to the file, and the sink type that determines what kind of object the parsed data is placed into. Passing DataFrame as the sink is by far the most common usage.
NoteCommon keyword arguments for CSV.read
CSV.read accepts a large number of optional keyword arguments for handling real-world files that do not follow a perfectly clean format:
# File uses semicolons as the delimiterCSV.read("data.csv", DataFrame; delim =';')# File uses tabs as the delimiterCSV.read("data.tsv", DataFrame; delim ='\t')# Skip the first two lines (e.g., a metadata header)CSV.read("data.csv", DataFrame; skipto =3)# Treat "NA" and "" as missing valuesCSV.read("data.csv", DataFrame; missingstring = ["NA", ""])# File has no header row; supply column names manuallyCSV.read("data.csv", DataFrame; header = ["id", "mass", "year"])
When a file refuses to parse cleanly, these keyword arguments are the first place to look.
The XLSX package provides access to Excel workbooks. The XLSX.readtable function reads one worksheet and returns it in a form that can be passed directly to the DataFrame constructor:
This returns a Vector{String} listing the names of all worksheets in the workbook.
6 Inspecting a DataFrame
Before doing anything else with a newly loaded dataset, it is good practice to inspect it: confirm that the right number of rows and columns were read, that the column names are what you expect, and that the types were inferred correctly.
6.1 Dimensions
nrow(specimens) # number of rows
3
ncol(specimens) # number of columns
4
size(specimens) # (nrow, ncol) as a tuple
(3, 4)
6.2 Column names and types
names(specimens) # returns a Vector{String} of column names
describe returns a summary DataFrame with one row per column, showing the column name, element type, and basic summary statistics (minimum, mean, median, maximum, number of missing values). It is one of the most useful first-look tools available.
6.3 Previewing rows
first(specimens, 5) # first 5 rows
3×4 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
String
String
Float64
Int64
1
BM_001
Peromyscus maniculatus
21.4
2019
2
BM_002
Peromyscus leucopus
23.1
2019
3
BM_003
Mus musculus
18.7
2020
last(specimens, 3) # last 3 rows
3×4 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
String
String
Float64
Int64
1
BM_001
Peromyscus maniculatus
21.4
2019
2
BM_002
Peromyscus leucopus
23.1
2019
3
BM_003
Mus musculus
18.7
2020
NoteDisplaying the full DataFrame
When a DataFrame has many columns, Julia will automatically abbreviate the display. To print every column without truncation:
show(specimens, allcols =true)
To control the maximum number of rows shown:
show(specimens; allrows =true)
7 Selecting columns
Once you have a DataFrame in memory, the most fundamental operation is selecting the columns you want to work with.
7.1 Selecting a single column as a Vector
Use the . accessor or square-bracket syntax with a Symbol or String:
specimens.body_mass_g # returns a Vector{Float64}
3-element Vector{Float64}:
21.4
23.1
18.7
specimens[!, :body_mass_g] # identical result
3-element Vector{Float64}:
21.4
23.1
18.7
specimens[!, "body_mass_g"] # also identical
3-element Vector{Float64}:
21.4
23.1
18.7
Tip! vs : as the row selector
In DataFrames indexing, the first argument selects rows and the second selects columns. Using ! as the row selector means all rows, without copying — you get a direct reference to the underlying column vector. Using : also selects all rows, but returns a copy. For read-only inspection, ! is fine and slightly more efficient. When you intend to modify the values in the returned vector and do not want to affect the original DataFrame, use :.
7.2 Selecting multiple columns as a new DataFrame
Pass a Vector of column names to get a new DataFrame containing only those columns:
specimens[!, ["specimen_id", "body_mass_g"]]
3×2 DataFrame
Row
specimen_id
body_mass_g
String
Float64
1
BM_001
21.4
2
BM_002
23.1
3
BM_003
18.7
Or use the equivalent select function, which has a richer syntax:
select(specimens, :specimen_id, :body_mass_g)
3×2 DataFrame
Row
specimen_id
body_mass_g
String
Float64
1
BM_001
21.4
2
BM_002
23.1
3
BM_003
18.7
select does not modify the original DataFrame; it returns a new one.
8 Filtering rows
To extract the subset of rows that satisfy some condition, use boolean indexing.
8.1 Boolean indexing
Constructing a logical (boolean) condition over a column produces a Vector{Bool} — a vector of true/false values, one per row:
specimens.collection_year .==2019
3-element BitVector:
1
1
0
Passing this boolean vector as the row index returns only the rows where the condition is true:
specimens[specimens.collection_year .==2019, :]
2×4 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
String
String
Float64
Int64
1
BM_001
Peromyscus maniculatus
21.4
2019
2
BM_002
Peromyscus leucopus
23.1
2019
NoteBroadcasting with .
The . before comparison operators (.==, .>, .<=, and so on) is Julia’s broadcasting syntax. It applies the operator element-wise across the entire vector, rather than trying to compare the whole vector to a scalar at once. You will need this . prefix whenever applying a scalar comparison across a column.
# ✓ correct: element-wise comparisonspecimens.body_mass_g .>20.0# ✗ will error: comparing a Vector to a scalar without broadcastingspecimens.body_mass_g >20.0
Combining multiple conditions uses & (and) and | (or), with each condition wrapped in parentheses:
The filter function provides a more readable alternative for row-filtering. It takes a column name (as a Symbol using =>), a function that returns true or false, and the DataFrame:
filter(:body_mass_g => m -> m >20.0, specimens)
2×4 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
String
String
Float64
Int64
1
BM_001
Peromyscus maniculatus
21.4
2019
2
BM_002
Peromyscus leucopus
23.1
2019
filter(:species => s ->startswith(s, "Peromyscus"), specimens)
2×4 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
String
String
Float64
Int64
1
BM_001
Peromyscus maniculatus
21.4
2019
2
BM_002
Peromyscus leucopus
23.1
2019
filter returns a new DataFrame and does not modify the original.
9 Adding and modifying columns
9.1 Adding a new column
Assign a new vector to a column name that does not yet exist:
TipIn-place modification vs. returning a new DataFrame
Assignment with df.new_column = ... modifies the DataFramein place, changing the original object. Many DataFrames functions — select, filter, transform — return a newDataFrame instead of modifying the original. This is the standard functional style and is generally preferred because it makes it easy to trace where each transformation happened. Use in-place assignment sparingly and deliberately.
9.2 Adding a column derived from existing columns using transform
The transform function is the idiomatic way to derive new columns from existing ones without modifying the original:
transform(specimens, :body_mass_g => (m -> m ./1000) =>:body_mass_kg)
3×5 DataFrame
Row
specimen_id
species
body_mass_g
collection_year
body_mass_kg
String
String
Float64
Int64
Float64
1
BM_001
Peromyscus maniculatus
21.4
2019
0.0214
2
BM_002
Peromyscus leucopus
23.1
2019
0.0231
3
BM_003
Mus musculus
18.7
2020
0.0187
The column specification follows the pattern source_column => transformation_function => new_column_name.
9.3 Selecting and transforming simultaneously
select and transform can be combined with ! for in-place modification, or composed as a pipeline with Julia’s |> operator:
specimens |> df ->filter(:collection_year => y -> y ==2020, df) |> df ->select(df, :specimen_id, :species, :body_mass_g)
1×3 DataFrame
Row
specimen_id
species
body_mass_g
String
String
Float64
1
BM_003
Mus musculus
18.7
10 Exercises
Exercise 1
Create the following DataFrame manually using the DataFrame constructor: