Rohit Singh

Rohit Singh: {Software}

Web-services and Community Resources

Schema: a Python library for the synthesis and integration of heterogeneous single-cell modalities. (http://schema.csail.mit.edu)
D-SCRIPT: a Python library for predicting a physical interaction between two proteins given just their sequences. (http://dscript.csail.mit.edu)
IsoBase: a web-database of functional ortholog predictions, using IsoRank/IsoRankN (http://isobase.csail.mit.edu)
Struct2Net: a web-service for predicting interaction between two proteins, given just sequence data (http://struct2net.csail.mit.edu)
RNAiCut: a web-service for choosing the right cut-offs in the results from RNAi gene-perturbation assays (http://rnaicut.csail.mit.edu)
IsoRank and IsoRank-N: binaries for the IsoRank/IsoRankN programs, along with some test data (http://isorank.csail.mit.edu)

Awk/Shell one-liners

Extracting unique PPIs from BIOGRID .tab files
- cat BIOGRID-ORGANISM-Saccharomyces_cerevisiae-2.0.53.tab.txt | perl -ne 'BEGIN { $x=0;} if (/^\s*INTERACTOR_A\s+INTERACTOR_B/) { $x=1;} { $x && print;}' | awk 'NR>1 { if ($2 <= $1) { a[$1 "," $2]=1;} else { a[$2 "," $1]=1;}} END {for (x in a) { print x;}}'
More to come...

R can be slow in reading large CSV files. This is usually because it's trying to guess what column types your file has. If you can specify those column types, R can read the file a lot quicker. The code below tries to do this automatically for the case where you only have "numeric" and "character" columns
- d <- read.csv(fname, as.is=T, nrows=5); colC <- rep("numeric",ncol(d)); for (i in 1:ncol(d)) { if (is.character(d[1,i])) { colC[i]="character";}; d <- read.csv(fname, colClasses=colC);