Tuesday, May 13, 2014

RStudio: Pushing to Github with ssh-authentication

If RStudio prompts you for a username and password every time you try to push your project to Github, open the shell (Git menu: More/Shel...) and do the following:

1) Set username and email (if you did not do that before)
git config --global user.name "your_username"
git config --global user.email "your_email@example.com"

2) Create SSH key
ssh-keygen -t rsa -C "your_email@example.com" 

In RStudio, go to menu Tools / Global options / Git SVN / View public key and copy the key to your Github account setting (Edit profile / SSH keys / Add SSH key).

To check that ssh-authentication works, try to run
ssh -T git@github.com

and you should get something like

Hi your_username! You've successfully authenticated, but GitHub does not provide shell access. 

3) Change remote.origin.url from HTTPS to HTTP 

It might be Windows specific, but after 1)+2) RStudio still asks me for user name and password. After a long Google search, I have found a solution here and that is
git config remote.origin.url git@github.com:your_username/your_project.git

Hip, Hip, Hurrah!



If it was trivial for you, I do apologize. I am still very bad in guessing what could be useful for somebody and what not so much. That is why I have this blog and Github account in the first place.

One example, last year I published a paper in JSPI journal that improves a test for interaction in some very specific 2-way ANOVA situation (just one observation per group). The paper submission was an odyssey, mostly because of me. In one moment I doubted whether to retract the paper or not and I even did not upload the package to CRAN at first, just put it on Github.

Then I discovered that some guys found it and had built their package using it. They presented the results at UseR! 2013 conference. I might have met one of those biologists but I am sure I never mentioned my package to them. Finally, - and this is a bit embarrassing - I received an email from Fernando Tusell that I misspelled his name in one of my functions.

In summary, even if you see your work as non-essential from your perceptive, the others may have different view. Just do your best and share your results. Github is a perfect place for this.


Wednesday, January 1, 2014

Shiny Year 2014

http://simecek.shinyapps.io/pf2014en/
(backup http://glimmer.rstudio.com/simecek/pf2014en/, source on Github)

"Pour Felicitér" is a French phrase used by Czech and Slovak to wish a happy new year, but not in French speaking countries. It is dating back to the beginning of 19. century when French was popular in Czech/Austrian city population similar way as English is today.

I originally wanted to make it a snow flake to share a plenty of snow we have in Maine. But then I discovered this Xmas R post and made it Shiny.

Let your 2014 be shiny as well!


Monday, September 23, 2013

The Kaufman Decimals

In his brilliant post (read it!), Ben Orlin introduced Kaufman decimals as follows, if

0.(4) = 0.444444444444444444444444444444444444444....,

is a number where the fourths go on forever, then

0.(4)1 = 0.44444444444444444444444444444444444444....1,

is a number where the fourths go on forever and, afterwards, there’s a one.

Imagine writing the number 0.(4)1 into a grid. You fill the first line with fourths and then write "1" into the second line like this


Obviously, 0.(4) equals 0.(44) or 0.(444) (=one line of 4s) but not 0.(4)(4) (=two lines of 4s). For complex decimals with repetitions of repetitions, like 0.(1(2(3)))((4))56, you need N-dimensional grid, but the principle is analogous.

Ben asked the question whether Kaufman decimals can be totally ordered. Sure - just look for the first digit that differs. To do this precisely, one needs ordinal indices - Mariano Chouza provides a proof on his blog. It has been a long time since I have seen a set-theory stuff like this. I more or less guess than really understand it but the main idea is easy to comprehend.

So, how to compare Kaufman decimals on a computer? Jeff Kaufman upload some Python code to Github that is not really working (0.(81)>0.89 and other issues). I cloned his project and here is my own attempt:
  1. The first difference matters. So I implemented "split" function that returns the first omega^k digits (k=0 one digit, k=1 a line, k=2 plane, ...) and the rest of the number
  2. I start from the beginning, a comparison of 0.(4715) and 0.471548 goes like this:
    • 0.(4715) = 0.4(7154) has the same first digit as 0.471548, cut it off from both
    • 0.(7154) = 0.7(1547)  has the same first digit as 0.71548, cut it off
    • 0.(1547) = 0.1(5471)  has the same first digit as 0.1548, cut it off
    • 0.(5471) = 0.5(4715)  has the same first digit as 0.548, cut it off
    • 0.(4715) = 0.4(7154)  has the same first digit as 0.48, cut it off
    • 0.(7154) = 0.7(7154)  has the same first digit lower in to 0.8, so 0.(4715) < 0.471548
  3. The situation might be simply more complicated inside the repetition, say 0.47(4747)8 and 0.(474747)8, then it goes like this:
    • 0.(474747)8 = 0.4(747474)8 has the same first digit as 0.47(4747)8, cut it off from both
    • 0.(747474)8 0.7(474747)8  has the same first digit as 0.7(4747)8cut it off
    • 0.(474747) and 0.(4747) are both one line numbers (same order of infinity), compare them
      • the first digits equal, cut it off
      • the second digits equal, cut it off
      • hey, I was in this comparison before, so 0.(474747) = 0.(4747)
    • 0.8 = 0.8, hence 0.47(4747)8 = 0.(474747)8
Most likely, it is not good for anything practical but it brought me back to the good old high school years (and first years at the university), i.e. my algebraic era. And now, back to statistics...

Wednesday, April 24, 2013

Facebook brainstorming

THIS POST IS OBSOLETE, SEE RFACEBOOK PACKAGE, MANY THANKS TO ITS DEVELOPERS

This blog has been sleeping for a long time. I moved from Prague (Czech Rep.) to Bar Harbor (Maine, US) and spent the last month doing paper work and settling down.

I am glad to note that Mining Facebook Data: Most "Liked" Status and Friendship Network was the 56th most read post on R-bloggers last year and its coverage on Revolution Analytics' blog Visualize your Facebook friends network with R made it to Top 10 most popular 2012 posts. Hope, one day my research papers will receive the same level of publicity.

Meanwhile, I think about other low-hanging fruits:
  • Quantify who "likes" your posts most, use it as a distance in the Friendship Network
  • Based on TED Talks that you liked in past, predict whether you will like next one (instead of TED Talks, one can look for movies or anything similar)
  • Automatic birthday wish posting
Any other idea? I will implement one or two of those and post here the codes.


Wednesday, August 8, 2012

Get a path to your Dropbox folder

I am currently designing my RStudio - Dropbox - Mardown/Knitter/Wordpress - Github workflow. One problem is that working on multiple machines with different version of Windows means I somehow need to tell R where my Dropbox folder is located.

I used to set the working directory at the beginning of my R scripts but it became tedious to change the path all the time. The solution might be to add definition of 'dropbox.folder.path' variable to .Rprofile files. Or there is a hard way - to write a script detecting Dropbox location automatically.

I have found this hint and created the script below. It is for Windows only (because I do not have Dropbox on the cluster). However, it should be easy to modify it for linux / MacOS if needed (see this shell script).


Tuesday, May 29, 2012

Mystery of mysteries

And now for something completely different to Facebook mining. Our paper "DISSECTING THE GENETIC ARCHITECTURE OF F1 HYBRID STERILITY IN HOUSE MICE" has just appeared at Evolution Journal. Let me give a brief explanation - and keep in mind that my major is math, not genetics.

From statistical point of view it is just an application of Karl Broman's R/qtl package. From biological point of view it is "mystery of mysteries", the term used by Darwin to refer to the mechanism by which two groups of animals become genetically incompatible.

A house mouse is a nice model of that phenomenon because its subspecies diverged relatively recently and you can still force them to breed if you want to. However, as described in the paper, male offspring of certain combination of parents are then unable to reproduce. Our long-term goal is to describe what is behind this sterility.

Two "ingredients" are already known: Prdm9 gene on Chr17 and "something" hidden in 4.5Mb region on ChrX. If any these two are not present in the required combination of alleles, then the mouse is fertile (or at least has normal testes weight and sperm count). However, as we know, Chr17 and ChrX are not the full story. There is some "secret ingredient(s)" needed to reach the full sterility and we have been unsuccessful in mapping it. Maybe there are just too many genes involved, i.e. our tests are underpowered.

The second possible explanation (submitted to CTC Meeting in Paris) is that the "secret ingredient(s)" might not to be a gene or anything you can assign to specific genomic location. Prdm9 (Chr17) is famous for playing role in a genetic recombination, shuffling of genetic information during sperm/eggs production. And recently, we verified that 4.5Mb ChrX region is also taking part in this process. Specifically, Prdm9 determines the location of recombination events and ChrX region influences the recombination frequency. So, is the "secret ingredient" mobile elements, repetitive sequences, heterochromatin or Bigfoot? One day we hope to know.

Selected papers:

Wednesday, January 18, 2012

Mining Facebook Data: Most "Liked" Status and Friendship Network


UPDATE 05/2014: The text is now obsolete. Use Rfacebook package instead, see examples here and there.

Professional R Enthusiast published a quick manual how to use Facebook Graph API. I particularly like a trick to obtain an access-token using Graph API Explorer.

Now, you can easily employ R to get your most "Liked" Facebook status ever. For me it was this photo followed by a lot of posts about my kids. The same code can be applied to Facebook Group or Page. For example the most popular videos, that appeared on TED Page last year, were the following:
See the code, it is not so long.

Now let us try something more sophisticated. Before Xmas a lot of my friends tested MyFnetwork app to visualize their friendship networks (see my network below). Surely, this is not the first app doing this. However, this might be the first one really useful. I can see groups of my friends separated (colleagues vs. friends of my wife vs. high school classmates vs. university classmates). Highlighting tries to emphasize the key persons in each group but unfortunately it misses an adjustment for total number of friends (Facebook enthusiasts like PetrC or LenkaZ seem to be more special than they really are). 

Original myFnetwork graph

So how difficult would it be to produce a similar graph with R? Actually, as you can see it is just a few lines of code. First I scraped the list of friends, then for each of them I got the list of mutual friends and finally Rgraphviz package does the plotting stuff. 

R/Graphviz plot with initials


As you can see the graphs are pretty similar (most likely, MyFnetwork also uses some port of Graphviz code). Of course, there exists endless list of modifications. For example, you can first download friends' profile pictures and then use custom node plotting function to produce something like the following:
R/Graphviz plot with profile pictures
Now you can guess who is my wife and who is the problematic friend from the previous post :-) Anyway, myFnetwork claims to get over 1.3 million users in 6 weeks. How difficult could it be make R Facebook/app?


Romain Francois: Crawling facebook with R

Update: I am getting comments about your installation problems with RCurl and Rgraphviz packages. Honestly, I am not administrator of my Ubuntu Linux Server and I have only a limited knowledge about possible issues. RCurl seems to be ok even on my Win32 machine - read the FAQ. Rgraphviz is a bit more tricky: see How to install it under Windows but I would recommend you a decent linux distribution for this work.

Wednesday, January 11, 2012

Toying with Google Apps Script

Google offers an access to its services with Apps Scripts (JavaScript). That gives you a possibility to connect your spreadsheet to a fascinating variety of tools like geocoder, stock info, language translator, or email.

My java-scripting abilities are rather limited but just playing with tutorial examples I was quickly able to produce a script analyzing time distribution of received emails. It looks through your Gmail for the given contact and record the times of emails sent by it.

So this is email behaviour of my wife. You can see a peak at the noon when babies are having a nap and another local maximum at the end of the day when they are finally sleeping in beds.

The second example is my friend, very, very bad email responder (chance for a reply is ~50%). As you can read from the graph most essential emails are answered when the day start, then a few after a lunch... and finally evening responses (usually short and strict).



To try the script yourselves, open a new spreadsheet in GoogleDocs, select Tools -> Script Editor and copy the code below.


Run it, return to the spreadsheet and enter an email address of the contact of interest. The process takes a minute of two, be patient. You may want to add a command MailApp.sendEmail("your.name@yourmail.com", "Finished", "Google Apps Script"); before the last "}"  to be warned by mail when the job is finished.

After this, the first column is filled with time stamps information and we need a way to visualize it. Google App Script have a charting module but I found more convenient to use R/ggplot2 plotting services. A few lines of code are given below (before that the Google spreadsheet was published to web and the link was copied into, alternatively you could use RGoogleDocs).


Gmail blog: Gmail Snooze with Apps Script

Wednesday, December 7, 2011

UseR! 2011 slides are now available

I have just realized that UseR! 2011 presentation slides are now available from the conference web site.

Unfortunately, no big surprise this year. Or maybe this is good news as it means that I have all the important stuff in my RSS Reader. And by the way, this blog is now listed on www.r-bloggers.com.

To be fair there was a couple of interesting talks (reaching my attention before the slides were published), namely

Paul Marrel: Vector Image Processing
Paul is converting static PDF image into dynamic SVG graphics. This is not so much about R but it is really cool.

Markus Gesmann: Using the Google Visualisation API with R
Easy way how to put dynamic images on your web page. Particularly useful, if you work with GPS tracking data. Verified :-)

Susan Ranney: It's a Boy! An Analysis of Tens of Millions of Birth Records Using R
RevoScaleR functionality demonstrated on 3.1GB dataset of birth records.

Ulrike Grömping: Design of Experiments in R
This is a bit personal. I have very little (read "no") experience with designing experiments (DoE). Back in 2008 I met my wife's supervisor that was an author of an old commercial DoE software. We both realized that R missed a lot of DoE methods. The problem was my lack of theoretical knowledge in this area and his style of work. I gave up but I am happy that they finally finished the job without me - the book was published and the package is available on CRAN. Looking into DoE CRAN Task View - we were not alone.







Wednesday, November 23, 2011

Prague Half Marathon Ranking: 2% or 25% missing?


I am a regular participant of Prague International Half Marathon. In a mass event like this the horde of runners needs a long time to reach the starting line. To make the times mutually comparable the “start time” is measured and afterwards subtracted from the “finish time”. Also the crowd is organized to corridors in such a way that faster runners are ahead of the slower ones.

Sometimes everything goes wrong and that was the case of the year 2010. Imagine yourself to train for months then make your best – just to discover that your time of start was not recorded. Organizers apologized but claimed that only less than 2% of the participants were affected. Really?

Let us use R to scrape and compare histograms of 2009, 2010 and 2011 start times to see the truth (red dashed line at 20 is approximate capacity of starting line):



See? The peaks in 2010 data are actually a nice try of organizers to do some statistics and correct for missing measurements. Based on starting number mirrorring both the expected time and the position in corridors they tried to make estimates for each corridor starting time. The averages were imputed into ~25% of observations that were actually missing.

Why is this so wrong? Because the ordering of runners was not under control. In 2009 and 2010 runners went wherever they wanted as you can see seen on the following graphs. Actually, in 2010 the slow runners just behind the Kenyans caused the jam. 





Good news at the end? Yes! Even organizers were denying the truth they learned a lesson from their mistakes. In 2011 an extra care was devoted to time measuring and as you can see ordering to corridors got much better. 



Finally, the code of all above: