Thursday, July 14, 2016

One more reason to use Feather

It was quiet here for some time. Not because I have nothing to blog but because I have no time to do so properly. One of those almost forgotten posts was about Feather package / module.

Feather is a fast, lightweight binary format for data frames with R and Python implementation. The original RStudio announcement is here. And sure, the speed improvement is impressive. See my numbers for saving ~ 100 million probabilities:

# haplotype probs: 192 animals x 8 x 64000 markers
> format(object.size(probs), units="Mb")
[1] "750 Mb"

# saveRDS or save needs almost a minute to write probs to disk
> system.time(saveRDS(dprobs, file="DO192_probs.rds"))
   user  system elapsed
 50.701   0.574  51.678

# write_feather needs 6-7 seconds
> system.time(write_feather(dprobs, file="DO192_probs.feather"))
   user  system elapsed  
  1.344   1.051   6.272

Feather is even better if you compare it to traditional text formats like CSV. As David Smith explains in his blog, one of the reasons is traditional formats are row-oriented while internal R's storage is column-oriented.

Diagram credit: Hadley Wickham

I have one more reason to use Feather. If you have datasets with many columns (e.g. genes in human/mouse genome) and you need fast access to just one column (e.g. Shiny app), then Feather is ideal because its columns are automatically indexed.

read_feather("DO192_probs.feather", column = "19_48310898")
   user  system elapsed
 0.068   0.000   0.069

Sure, there are other solutions, like rhdf5 or RSQLite, but Feather is the easiest to use, at least for me, at least in R. See David Smith (Microsoft R) for more details:

Wednesday, May 20, 2015

Teaching R course? Use analogsea to run your customized RStudio in Digital Ocean!

Two years ago I taught an introductory R/Shiny course here at The Jackson Lab. We all learnt a lot. Unfortunately not about Shiny itself, but rather about incompatibilities between its versions and trouble with its installation to some machines.

And it is not only my experience. If you look into forums of Rafael Irizarry MOOC courses, so many questions are just about installation / incompatibilities of R packages. The solution exists for a long time: run your R in a cloud. However, customization of virtual machines (like Amazon EC2) used to be a nontrivial task.

In this post I will show how a few lines of R code can start a customized RStudio docklet in a cloud and email login credentials to course participants. So, the participant do not need to install R and the required packages. Moreover, it is guaranteed they all run exactly the same software. All they need is a decent web browser to access RStudio server.

RStudio server login

Running RStudio in Digital Ocean with R/analogsea

So how complicated is it today to start your RStudio on clouds? It is (almost) a one-liner:
  1. If you do not have  Digital Ocean account, get one. You should receive a promotional credit $10 (= 1 regular machine running without interruption for 1 month):
    (full disclosure: if you create your account using the link above I might get an extra credit)
  2. Install analogsea package from Github. Make sure to create Digital Ocean personal access token and in R set DO_PAT environment variable. Also create your personal SSH key and upload it to Digital Ocean.
  3. And now it is really easy:

    # Sys.setenv(DO_PAT = "*****") set access token

    # start your machine in Digital Ocean
    d <- docklet_create(size = getOption("do_size", "512mb"))
    # run RStudio on machine 'd' (rocker/rstudio docker image)
    d %>% docklet_rstudio()
The last line should open your browser with RStudio login page (user "rstudio", password "rstudio"). If not, use summary(d) to get the IP address of your machine and go to http://your_machine:8787

It will cost you ~$0.01 per hour ($5 per month, May 2015). When you are done, do not forget to stop your Digital Ocean machine (droplet_delete(d)). At the end, make sure that you successfully killed all your machines - either log in to Digital Ocean or by calling droplets() in R.

Customized RStudio images

What if the default RStudio image is not good enough for you because you insist that your package needs to be pre-installed. For example, your package has many dependencies, like DOQTL, that needs long time to be downloaded (,, ...).

You can still use analogsea to run your Digital Ocean machines but in advance you need to prepare your own customized docker image. First create an account on and get yourself introduce to Dockerfile syntax. Then link your Docker account to your Github as described here.

I has been afraid of that because my knowledge of docker is somehow limited. It was actually far easier than I expected: See a dockerfile for RStudio with DOQTL pre-installed.

Also, see Dockerfile of  rocker/hadleyverse image with Hadley Wickham's packages preinstalled to get more inspiration.

Start virtual machine, pull and run customized RStudio image, email credintials

Finally, suppose you created your customized docker image (like simecek/doqtl-docker). For each participant of your course, you want to start a virtual machine, pull this image, run it and email IP (and credentials) to the participant.

The code below is doing just that. There are several ways to send emails in R and this program utilizes sendmailR package. I split the code into several for-loops, so if something goes wrong there is a better chance to catch it.

Final thoughts and links

Docker can be installed to many systems (including Windows). So, the course participants should be able to use your customized RStudio image even after the end of the course on their own servers or laptops. Another advantage is a fully reproducible code - R syntax will change, packages will come and go but your docker image will be functional as long as some docker client will exist.

If you feel overwhelmed and all you need is to run RStudio in the cloud (no analogsea, no customization), I would recommend RStudio in the cloud for dummies, 2014/2015 edition as a good start and a tool sufficient for the most of practical applications.

Of course, teaching R course is not the only reason to run R on Digital Ocean. I started my first droplet 3 month ago to host my Shiny apps. Digital Ocean gives you for $5-$10/month similar functionality as for $99/month. If interested, see How to get your very own RStudio Server and Shiny Server with DigitalOcean.

UPDATE 10/14/2015: We used Docker / Digital Ocean for teaching Short Course on Systems Genetics. All data, scripts (incl. data transfer, printing DO machine table and emailing participants) and 3 docker images are available at

UPDATE 9/9/2015: We used Docker / Digital Ocean for a tutorial of kallisto, DOQTL, Deseq2. All materials available at 

Wednesday, February 11, 2015

Make your R plots interactive


As a part of my daily job, I draw scatterplots, lots of them. And because there are thousands of genes expressed in any mouse or human tissue, my typical plot looks something like this (code). (Actually, it is a comparison of variance that can be attributed to "sex" factor in mRNA vs. protein expression.)

The question is - what is a gene in the top right corner? And what is the one next to him? And this one? And that?

What we need is a clickable scatterplot. Something I can point with a mouse and get a corresponding gene name. Then click and my browser would open a window with the gene's details (for example in Ensemble). And thanks to ggvis package, it is now possible. Try the following code:

In the case above there must be R running somewhere behind, translating points to gene symbols and Ensemble links. How to share this figure with somebody who does not have R on his or her machine? One possibility is to run it on or your private shiny server.

The second way to add interactive annotation is iplot function in the new Karl Broman's interactive testjs package. It produces Javascipt widget that can be easily saved and share as html page (code).

And with a little help of shiny, we can do even more: Mike Love has a nice example (mtcars_demo in his Github repository), see the widget below. Click on a dot in the left panel makes the corresponding dot in the right panel highlighted (the trick is clickId option in  shiny::plotOutput).

Mike's approach is similar to identify function in basic graphics. The output is a position on an image and you need to find the nearest point afterwards.

I tried something similar with ggvis's SVG objects, so you can set up "on_click" or "on_hoover" events directly (see mtcars_demo here). Click either on left or right panel and the corresponding dot in the opposite panel gets highlighted.

For more complex visualisations, D3.js is a must. However, if you just need to add a bit of interactivity to your R plots, thanks to ggvis, shiny, htmlwidgets, ... it is now possible.

PS. Matt Sundquist sent me the first graph made by plotly that is very cool. More about that maybe later. The trick to edit "on_hover" text is described here.

Variability Explained by Sex

Wednesday, November 5, 2014

Tweeting at #IMGC14 conference

The 28th annual International Mammalian Genome Conference was held over the last week in Bar Harbor, MA. For the first time, the official conference hashtag #IMGC14 was introduced. Twitter shares plummeted 9% next day. Pure coincide? I do not think so!

Totally, 79 participants contributed 1546 tweets. Guess who was the Twitter evangelist?

The distribution of tweets in time reveals when the lobster was served as a conference dinner.

The most retweeted status was Thomas Keane's announcement about the new strains in Mouse Genomes Project:

The most favorited status was Kärt Tomberg's typo:

And finally, this is the #IMGC14 wordcloud. It was a great time. Thank you all! By the way, my official Twitter account is "simecek42".

Data were downloaded from Twitter as described here. For figures, use the following R code:

Tuesday, May 13, 2014

RStudio: Pushing to Github with ssh-authentication

If RStudio prompts you for a username and password every time you try to push your project to Github, open the shell (Git menu: More/Shel...) and do the following:

1) Set username and email (if you did not do that before)
git config --global "your_username"
git config --global ""

2) Create SSH key
ssh-keygen -t rsa -C "" 

In RStudio, go to menu Tools / Global options / Git SVN / View public key and copy the key to your Github account setting (Edit profile / SSH keys / Add SSH key).

To check that ssh-authentication works, try to run
ssh -T

and you should get something like

Hi your_username! You've successfully authenticated, but GitHub does not provide shell access. 

3) Change remote.origin.url from HTTPS to HTTP 

It might be Windows specific, but after 1)+2) RStudio still asks me for user name and password. After a long Google search, I have found a solution here and that is
git config remote.origin.url

Hip, Hip, Hurrah!

If it was trivial for you, I do apologize. I am still very bad in guessing what could be useful for somebody and what not so much. That is why I have this blog and Github account in the first place.

One example, last year I published a paper in JSPI journal that improves a test for interaction in some very specific 2-way ANOVA situation (just one observation per group). The paper submission was an odyssey, mostly because of me. In one moment I doubted whether to retract the paper or not and I even did not upload the package to CRAN at first, just put it on Github.

Then I discovered that some guys found it and had built their package using it. They presented the results at UseR! 2013 conference. I might have met one of those biologists but I am sure I never mentioned my package to them. Finally, - and this is a bit embarrassing - I received an email from Fernando Tusell that I misspelled his name in one of my functions.

In summary, even if you see your work as non-essential from your perceptive, the others may have different view. Just do your best and share your results. Github is a perfect place for this.

Wednesday, January 1, 2014

Shiny Year 2014 (source on Github)

"Pour Felicitér" is a French phrase used by Czech and Slovak to wish a happy new year, but not in French speaking countries. It is dating back to the beginning of 19. century when French was popular in Czech/Austrian city population similar way as English is today.

I originally wanted to make it a snow flake to share a plenty of snow we have in Maine. But then I discovered this Xmas R post and made it Shiny.

Let your 2014 be shiny as well!

Monday, September 23, 2013

The Kaufman Decimals

In his brilliant post (read it!), Ben Orlin introduced Kaufman decimals as follows, if

0.(4) = 0.444444444444444444444444444444444444444....,

is a number where the fourths go on forever, then

0.(4)1 = 0.44444444444444444444444444444444444444....1,

is a number where the fourths go on forever and, afterwards, there’s a one.

Imagine writing the number 0.(4)1 into a grid. You fill the first line with fourths and then write "1" into the second line like this

Obviously, 0.(4) equals 0.(44) or 0.(444) (=one line of 4s) but not 0.(4)(4) (=two lines of 4s). For complex decimals with repetitions of repetitions, like 0.(1(2(3)))((4))56, you need N-dimensional grid, but the principle is analogous.

Ben asked the question whether Kaufman decimals can be totally ordered. Sure - just look for the first digit that differs. To do this precisely, one needs ordinal indices - Mariano Chouza provides a proof on his blog. It has been a long time since I have seen a set-theory stuff like this. I more or less guess than really understand it but the main idea is easy to comprehend.

So, how to compare Kaufman decimals on a computer? Jeff Kaufman upload some Python code to Github that is not really working (0.(81)>0.89 and other issues). I cloned his project and here is my own attempt:
  1. The first difference matters. So I implemented "split" function that returns the first omega^k digits (k=0 one digit, k=1 a line, k=2 plane, ...) and the rest of the number
  2. I start from the beginning, a comparison of 0.(4715) and 0.471548 goes like this:
    • 0.(4715) = 0.4(7154) has the same first digit as 0.471548, cut it off from both
    • 0.(7154) = 0.7(1547)  has the same first digit as 0.71548, cut it off
    • 0.(1547) = 0.1(5471)  has the same first digit as 0.1548, cut it off
    • 0.(5471) = 0.5(4715)  has the same first digit as 0.548, cut it off
    • 0.(4715) = 0.4(7154)  has the same first digit as 0.48, cut it off
    • 0.(7154) = 0.7(7154)  has the same first digit lower in to 0.8, so 0.(4715) < 0.471548
  3. The situation might be simply more complicated inside the repetition, say 0.47(4747)8 and 0.(474747)8, then it goes like this:
    • 0.(474747)8 = 0.4(747474)8 has the same first digit as 0.47(4747)8, cut it off from both
    • 0.(747474)8 0.7(474747)8  has the same first digit as 0.7(4747)8cut it off
    • 0.(474747) and 0.(4747) are both one line numbers (same order of infinity), compare them
      • the first digits equal, cut it off
      • the second digits equal, cut it off
      • hey, I was in this comparison before, so 0.(474747) = 0.(4747)
    • 0.8 = 0.8, hence 0.47(4747)8 = 0.(474747)8
Most likely, it is not good for anything practical but it brought me back to the good old high school years (and first years at the university), i.e. my algebraic era. And now, back to statistics...

Wednesday, April 24, 2013

Facebook brainstorming


This blog has been sleeping for a long time. I moved from Prague (Czech Rep.) to Bar Harbor (Maine, US) and spent the last month doing paper work and settling down.

I am glad to note that Mining Facebook Data: Most "Liked" Status and Friendship Network was the 56th most read post on R-bloggers last year and its coverage on Revolution Analytics' blog Visualize your Facebook friends network with R made it to Top 10 most popular 2012 posts. Hope, one day my research papers will receive the same level of publicity.

Meanwhile, I think about other low-hanging fruits:
  • Quantify who "likes" your posts most, use it as a distance in the Friendship Network
  • Based on TED Talks that you liked in past, predict whether you will like next one (instead of TED Talks, one can look for movies or anything similar)
  • Automatic birthday wish posting
Any other idea? I will implement one or two of those and post here the codes.

Wednesday, August 8, 2012

Get a path to your Dropbox folder

I am currently designing my RStudio - Dropbox - Mardown/Knitter/Wordpress - Github workflow. One problem is that working on multiple machines with different version of Windows means I somehow need to tell R where my Dropbox folder is located.

I used to set the working directory at the beginning of my R scripts but it became tedious to change the path all the time. The solution might be to add definition of 'dropbox.folder.path' variable to .Rprofile files. Or there is a hard way - to write a script detecting Dropbox location automatically.

I have found this hint and created the script below. It is for Windows only (because I do not have Dropbox on the cluster). However, it should be easy to modify it for linux / MacOS if needed (see this shell script).

Tuesday, May 29, 2012

Mystery of mysteries

And now for something completely different to Facebook mining. Our paper "DISSECTING THE GENETIC ARCHITECTURE OF F1 HYBRID STERILITY IN HOUSE MICE" has just appeared at Evolution Journal. Let me give a brief explanation - and keep in mind that my major is math, not genetics.

From statistical point of view it is just an application of Karl Broman's R/qtl package. From biological point of view it is "mystery of mysteries", the term used by Darwin to refer to the mechanism by which two groups of animals become genetically incompatible.

A house mouse is a nice model of that phenomenon because its subspecies diverged relatively recently and you can still force them to breed if you want to. However, as described in the paper, male offspring of certain combination of parents are then unable to reproduce. Our long-term goal is to describe what is behind this sterility.

Two "ingredients" are already known: Prdm9 gene on Chr17 and "something" hidden in 4.5Mb region on ChrX. If any these two are not present in the required combination of alleles, then the mouse is fertile (or at least has normal testes weight and sperm count). However, as we know, Chr17 and ChrX are not the full story. There is some "secret ingredient(s)" needed to reach the full sterility and we have been unsuccessful in mapping it. Maybe there are just too many genes involved, i.e. our tests are underpowered.

The second possible explanation (submitted to CTC Meeting in Paris) is that the "secret ingredient(s)" might not to be a gene or anything you can assign to specific genomic location. Prdm9 (Chr17) is famous for playing role in a genetic recombination, shuffling of genetic information during sperm/eggs production. And recently, we verified that 4.5Mb ChrX region is also taking part in this process. Specifically, Prdm9 determines the location of recombination events and ChrX region influences the recombination frequency. So, is the "secret ingredient" mobile elements, repetitive sequences, heterochromatin or Bigfoot? One day we hope to know.

Selected papers: