fst (code here) provides a fast, easy and flexible way to serialize data frames. It allows for fast compression and decompression and has the ability to access stored frames randomly. With access speeds of above 1 GB/s,
fst is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. The figure below compares the read and write performance of the
fst package to various alternatives.
|Method||Format||Time (s)||Size (MB)||Speed (MB/s)||N|
These benchmarks were performed on a Xeon E5 CPU @2.5GHz (more systems will follow). Parameter speed was calculated by dividing the in-memory size of the data frame by the measured time. The results are also visualized in the figure below.
fst outperforms the
data.table packages as well as the base
writeRDS functions for uncompressed reads and writes. But it also offers additional features such as very fast compression and random access (columns and rows) to the stored data.
The easiest way to install the package is from CRAN:
You can also use the development version from GitHub:
# install.packages("devtools") devtools::install_github("fstPackage/fst")
fst is extremely simple. Data can be stored and retrieved using methods
# Generate a random data frame with 10 million rows and various column types nrOfRows <- 1e7 x <- data.frame( Integers = 1:nrOfRows, # integer Logicals = sample(c(TRUE, FALSE, NA), nrOfRows, replace = TRUE), # logical Text = factor(sample(state.name, nrOfRows, replace = TRUE)), # text Numericals = runif(nrOfRows, 0.0, 100), # numericals stringsAsFactors = FALSE) # Store it write.fst(x, "dataset.fst") # Retrieve it y <- read.fst("dataset.fst")
read.fst you can access a selection of rows from the stored data frame by specifying a range:
read.fst("dataset.fst", from = 2000, to = 4990) # subset rows
You will notice that the read times for this small subset are very short because
read.fst (almost) only touches the on-disk data from within the selected range. Specific columns can be selected with:
read.fst("dataset.fst", c("Logicals", "Text"), 2000, 4990) # subset rows and columns
Here, only data from the selected rows and columns are deserialized from file.
For compression the excellent and speedy LZ4 and ZSTD compression algorithms are used. These compressors in combination with type-specific bit and byte filters, enable
fst to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):
write.fst(x, "dataset.fst", 100) # use maximum compression
For this particular data frame the on-disk size of
x is less than 35 percent of the in-memory size (
object.size(x)) when full compression is used. The figure below shows the compression ratio depending on the settings used and compares them to the ratio’s achieved by the
data.table packages (that offer no additional compression) and method
saveRDS (gzip mode) from base R:
Note that the on-disk size of a csv file is usually larger than the in-memory size (here with a factor of about 2). Obviously, csv files are text-based and it’s no surprise that deserializing from text-based sources takes more time (due to neccessary text parsing). For a non-binary-file writer, method
fwrite from the
data.table package actually has a very impressive performance! There are only 10 settings for compression in
saveRDS (gzip mode), which have been scaled up from 0 (uncompressed) to 100 (setting 9) for easier comparison.
The read and write speeds measured in the same benchmark:
The read and write speeds reported in the figure are calculated by dividing the in-memory size of the data frame by the measured elapsed time for a read or write operation (more details will follow). As you can see,
fst achieves very high read and write speeds, even for compressed data. For this benchmark, a Intel Xeon E5 CPU was used, running at 2.5 GHz. To cope with the large IO speeds, a OCZ RevoDrive 350 was used with maximum read and write speeds of around 1750 MB/s.
As you can see from the figure, the benchmark measures higher speeds than the reported maximum SSD speed for low compression settings. This is due to the fact that the on-disk filesize of the data frame used is actually significantly smaller than the in-memory size, even at a compression setting of zero. Package
fst employs several type-specific byte- and bit- shifters that are used even in uncompressed mode. For example, a single logical value in R takes up 32 bits in memory, and only 2 bits on-disk (there are 3 possible values:
Note to users: The binary format used for data storage by the package (the ‘fst file format’) is expected to evolve in the coming months. Therefore,
fstshould not be used for long-term data storage.