Feather is a fast, lightweight binary format for data frames with R and Python implementation. The original RStudio announcement is here. And sure, the speed improvement is impressive. See my numbers for saving ~ 100 million probabilities:
# haplotype probs: 192 animals x 8 x 64000 markers
> format(object.size(probs), units="Mb")
[1] "750 Mb"
# saveRDS or save needs almost a minute to write probs to disk
> system.time(saveRDS(dprobs, file="DO192_probs.rds"))
user system elapsed
50.701 0.574 51.678
# write_feather needs 6-7 seconds
> system.time(write_feather(dprobs, file="DO192_probs.feather"))
user system elapsed
1.344 1.051 6.272
1.344 1.051 6.272
Feather is even better if you compare it to traditional text formats like CSV. As David Smith explains in his blog, one of the reasons is traditional formats are row-oriented while internal R's storage is column-oriented.
Diagram credit: Hadley Wickham |
I have one more reason to use Feather. If you have datasets with many columns (e.g. genes in human/mouse genome) and you need fast access to just one column (e.g. Shiny app), then Feather is ideal because its columns are automatically indexed.
read_feather("DO192_probs.feather",
column = "19_48310898")
user system elapsed
0.068 0.000 0.069Sure, there are other solutions, like rhdf5 or RSQLite, but Feather is the easiest to use, at least for me, at least in R. See David Smith (Microsoft R) for more details: http://blog.revolutionanalytics.com/2016/05/feather-package.html
No comments:
Post a Comment