Top 5 bash commands for handling data

I have a confession to make. I don’t really like the command line. Or at least I didn’t like it before. Bash commands used to simply terrify me and I would opt out for other tools whenever I could.

After moving to bioinformatics and data analysis though, I started to realise, that command line tools and tricks are not only error prone and hard

to remember, but also true efficiency maximisers if used for the right tasks.

Here are my Top 5 bash commands for handling data, running a bioinformatics pipeline and generally for making life of a data scientist better.

1. How does my data look like?  cat, zcat, head, tail, grep, wc 

cat, head and tail were probably some of the first commands I learnt when starting with the Unix environment. But the real power comes when you combine them with grep to only look

at the data you’re interested in and with wc to count how many data points you have.

Say that you want to quickly check how many lines in your file contain a certain flag. Doing this in bash is beautifully simple and fast:

zcat  data_chrX.gz  | grep "DEBUG" | wc -l

Also, zcat is a real saviour if you work with gzipped files a lot. Explore them like this:

zcat  data_chrX.gz | less


zcat  data_chrX.gz | head -n 20

to see just the first 20 lines.

2. Shortcuts for moving the cursor

Typing the commands in your command line can get pretty tedious especially if you use full file paths as input or the script you are launching has many options.

These simple command line shortcuts I just learnt recently and don’t know how I did anything without them, now using these all the time.

ctrl-E # move cursor to end of line
ctrl-A # move cursor to beginning of line

3. Screen for working on a remote server

Have you ever ended up in a situation where you ssh to your super powerful remote machine, launch your script, go get coffee

while waiting for the results, and then come back to “broken pipe”. This is a common beginner mistake, don’t worry, just add screen

to your data analysis toolbox! With screen you can create as many remote terminal windows on the remote machine as you like,

detach from them and your jobs will still be running.

A shortcut to start a program in a new virtual terminal and detach from it is this:

screen -m -d <your command>
screen -ls # will display all current windows you have running
screen -r  <screenID> # switch back to the detached screen

Then if you by mistake created too many windows or just want to kill them all:

killall -15 screen

4. Check how many results have completed with ls and wc

Now that you are running your script you might want to check how many results have already
completed. I usually output to a dedicated directory called something like results/experimentID/…

ls -l results/experimentID | wc -l

But sometimes the results directory has subdirectories with, for example, plots or old results.

To count only the actual files I found this great trick:

find targetdir -maxdepth 1 -type f | wc -l

5. Perlk – Perl perk: Renaming multiple files with a one liner

Rename command is actually a part of perl, but on a Linux machine, you’ll have it out of the box anyway in your command line.  With perl rename you can change the extension of the files on in any other way efficiently rename several files at once. You might want to read up on perl regular expressions first if you have some advanced renaming in mind.

rename 's/\.txt.txt\z/.txt/' *.txt.txt

Bonus command line trick for bioinformaticians:

Split one FASTA file into several with awk:

awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' dna.fa

This command will split the files at the point of finding  “>” character and name them based on their FASTA header. There are many tools that can achieve the same thing, but I fin this the most elegant and handy so far. Awk is in general very powerful and versatile, although it seems to have a steep learning curve 😉


There it is, happy hacking in the command line!



FANTOM 5 satellites: evolution of human cells, transcription factors and expression breadth.

FANTOM is a large scientific consortium with over 500 members all over the world and it’s data is based on cap analysis of gene expression (CAGE). CAGE is a pretty advanced and unique next-generation sequencing technique that allows analysing transcriptional start sites across the entire genome with amazing resolution. FANTOM 5 dataset is also the most comprehensive expression atlas that exists today, including 952 human and 396 mouse tissues, primary cells and cancer cell-lines.

I have started working with the FANTOM 5 data over two years ago, at the beginning of my PhD.  So I’m very happy that this data is now finally released and out there in the open for everyone to explore: Including the results of the work we have been doing on it all this time.

A lot of people (including one of my supervisors, – L.H.)  spent a lot of sleepless nights trying to polish these results, check from all possible angles, then waiting for reviews, emails and finally an acceptance letter.

Our focus was following the evolution of different types of mammalian cells and tissues, fate of gene duplicates and what all this new data of  promoter architecture can tell us about gene expression evolution. If you are interested in these topics as well, I would humbly recommend the following order of approaching the papers:

1. A promoter-level mammalian expression atlas

First, of course, have a look at the main FANTOM5 paper in Nature. It gives a good overview of what this data set offers and contains some truly stunning visualisations, like this one of promoter expression:

Coexpression clustering of human promoters in FANTOM5. Figure 4 from the manuscript.

2. The Evolution of Human Cells in Terms of Protein Innovation (Open Access)

Next in line is a paper by Gough’s group that is in many ways aligned with our work and results, but they were first getting it into a neat story, by creating a timeline of cell evolution. Unfortunately, the figure demonstrating that timeline is completely unreadable (what were you thinking, MBE?) But authors offer a little more detailed figure to explore on their website:

FANTOM 5 allows to look especially in detail into different cell types in the human brain and this paper demonstrates the very curious process of evolutionary accumulation of novel cell functions that form our brains ever since the Fungi/Metazoa divergence. The most interesting observation here, in my opinion, is that the brain cells evolve under the same selective pressure that the spleen and thymus, which demonstrates how intervened the nervous and immune systems are in the light of evolution.

Authors also use an interesting way of looking at the evolutionary profile of different cells.  On the following figure is, for example, the evolutionary profile for T-cells and bars represent how different protein domain architectures appearing at certain evolutionary time.

FIG. 3. in the paper, click to read more

3. A simple metric of promoter architecture robustly predicts expression breadth of
human genes suggesting that most transcription factors are positive regulators.

At last comes the paper by L.Hurst and L. Huminiecki with my name somewhere on the list of authors as well.

I think now it clearly demonstrates an interesting observation that became apparent from the FANTOM5 data. The idea is really simple: the number of binding transcription factors found on the promoter predicts the expression breadth of this gene. This can be seen when one looks at the expression breadth of paralogs in the human genome. Another important conclusion is that the number of TFs defines where the gene is expressed, but not at what level. The HTML version of the paper is still in production, but you can view a provisional PDF over here: (Open Access)

Python pitfalls: comparing objects of different types

Static vs dynamic typing is a constant debate in the programming languages community. There is no right or wrong answer to what is best, it strongly depends on your application and goals.

Dynamically typed languages (like Python), are more flexible, make programming easy and fast. That’s why they are incredibly popular for implementing software with changing or unknown requirements, like, you know, data analysis software I write.

Sometimes though, dynamical typing turns programming in Python into walking through a minefield and brings me memories of John Hughes’s lectures on Haskell at Chalmers.

CCC #17: Typing

So, let’s say we are comparing numbers. Type this in your Python 2.x interpreter. What will it return?

>>> 5 < 8

You guessed right. It will return True . No surprises here. Same if you type

>>> "a"< "b"

This will also be True.

So, what do you think Python interpreter give you if you type this:

>>> 5 < "b"

Well, you are unlikely to ever type it on purpose. But your large data table might have one numerical column and one of type string.
If for some reason your code will end up comparing this two objects to each other, you are going to regret it.

Well, won’t Python throw me an error? Or at least a warning? After all, what sensible answer can one get by comparing strings to numbers? Well, apparently, someone decided otherwise.

From the Python 2.7 tutorial:

Note that comparing objects of different types is legal. The outcome is deterministic but arbitrary: the types are ordered by their name. Thus, a list is always smaller than a string, a string is always smaller than a tuple, etc. [1] Mixed numeric types are compared according to their numeric value, so 0 equals 0.0, etc.


 [1] The rules for comparing objects of different types should not be relied upon; they may change in a future version of the language.


My first thought when I read this was: WHAT THE… Guido van Rossum.

Luckily, I’m not the first to arrive at this thought. So in Python 3.x this weird behaviour has been changed so that if you attempt to order an integer and a string, an error will be raised:

>>> 5 < '8'
Traceback (most recent call last):
  File "", line 1, in 
    5 < '8'
TypeError: unorderable types: int() < str()

But it is still possible to check for equality of objects of different types. Comparisons like this will always return false, but are absolutely legal.

>>> print (8=='10');

Dynamical typing does give you certain flexibility, but it also gives you responsibility to follow what’s going on with your objects. This is incredibly important in data analysis, so make sure to check data integrity in between the steps of your data pipeline.

Software Carpentry at SciLifeLab, Sweden

This has been my project of the past few months and of course I’m dying to share how it went.

This is how it all started. I got this email from Greg Wilson:

I hope you don’t mind mail out of the blue…[…]

I run a project for the Mozilla Foundation called Software Carpentry (, the goal of which is to teach basic computing skills to research scientists.  In the past eighteen months, we’ve run over 100 two-day bootcamps for almost 4000 scientists in a dozen countries; our instructors are volunteers, and overlap with groups like PyLadies, Ladies Learning Code, and others.  We’ve been to Norway, but never to Sweden — would you be interested in helping us put something together in Stockholm?

If this sounds good, let’s get it into the calendar and away we go!

So away we went… 🙂

I already knew about Software Carpentry and it’s efforts on making science reproducible and more efficient. And I thought it was amazing. So I became a local host. Greg put me in touch with their administrator and I started putting things together on my end.

I have now organised many events in the tech industry. It always goes like this:

  •  find the date that will fit as many people as possible
  •  find a room
  •  find funding

Now in academia “find funding” is always a tiny bit more challenging. Luckily, to bring Software Carpentry to your institution/conference you actually don’t need that much funding. Software Carpentry is backed by Mozilla Science Foundation. Instructor training, curriculum development, website, administrative costs are covered through donations. So first things first, it’s great if you can arrange a donation towards the central costs. $1500 is a good goal. Now here comes the problem that is slowing down Software Carpentry taking over Europe. It is close to impossible to arrange a donation from the University and any other grant based money to another non-profit. How would you motivate that? It would actually be easier if Software Carpentry required the money as a course fee. But that would kind of defeat the honourable goal of being a volunteer-based organisation, or would it? What do you think?

One cost that is absolutely required is the instructors’ travel costs. They’re volunteers, so they don’t need to be paid for their time, but it’s only common curtesy to compensate for plane tickets and a place stay.

Other costs to think about could be coffee/drinks and some food for participants to mingle around.

So, now, where can we get the money?  One option Software Carpentry suggest on their guide for the local hosts is to charge people for attendance. This lucrative option was quickly eliminated. Since 95% of our attendants are going to be PhD students, where do you think the money will come from in the end? Right, PI grants. Will that make it more difficult for everyone in the research group to sign-up? Will that limit the number of people who can participate? Quite possibly. Besides, charging money for something in academia is even worse than looking for funds. Which account to use? Who will be in charge of receipts? What to do with refunds? Anyways, just forget about it.

Luckily, I got tremendous level of support among my peers and senior colleagues. My supervisor professor Arne Elofsson and professor Lars Arvestad representing Swedish e-Science Research Center (SeRC)  thought it will fit nicely to our own local efforts of teaching Python and good programming practices to bioinformaticians at SciLifeLab and KTH.

National bioinformatics communities BILS and WABI supported us also.  In fact, I would like to thank both of the organisations for their financial support.

Next step is to find instructors for this particular bootcamp. Software Carpentry admin will help you with this! We got amazing ones I must say. Lots of positive feedback from the attendees and lots to learn for the local teachers.

Meet our heros:

Lex Nederbragt 

Husband, father of two, biologist, bioinformatician, researcher, Dutchman.

Research Fellow at the University of Oslo.


Karin Lagesen

Bioinformatician, python coder, science fiction conrunner, cat owner, synth music fan. Occational blogger at

Assistant professor at the University of Oslo.


Konrad Hinsen

Research scientist at the Centre de Biophysique Moléculaire in Orléans (France)

Associated scientist at the Synchrotron SOLEIL in Saint Aubin (France)


Nelle Varoquaux

PhD student at the Center for computational biology, which is part of the INSERM U900/Curis/Mines bioinformatics unit, currently working on inferring the 3D structure of the genome from Hi-C data.

Why were we lucky to get all 4 of them at once, you’d ask?

Well, Software Carpentry bootcamp at SciLifeLab had two tracks. One classical Software Carpentry workshop aimed at researchers with limited computing experience and the other one for more advanced programmers.

In the beginners track we covered the following topics:

  • Unix shell
  • Basics for Python programming
  • Git and GitHub
  • Data exploration and  testing with IPython

All materials are available on github:

Software Carpentry at SciLifeLab, beginners track
Software Carpentry at SciLifeLab, beginners track
Software Carpentry at SciLifeLab, beginners track
Software Carpentry at SciLifeLab, beginners track

 Intermediate+ track was focused on bringing the best practises and tricks of scientific computing to every day life of bioinformaticians and anyone involved in computational life sciences. The topics included:

  • Scientific computing in Python, Intro to NumPy/matplotlib
  • Collaborating using version control (git & github)
  • Object oriented programming in Python
  • Program design ( packaging and testing )

All materials are available on github:

Software Carpentry at SciLifeLab, intermediate+ track
Software Carpentry at SciLifeLab, intermediate+ track

All in all, two days of fun, learning and discussions of future collaborations!

45 people from SciLifeLab, KI and KTH attended and completed the training and 5 local volunteers were helping out. Our local team deserves a special mention. I’m hoping that highly skilled scientific computing community at SciLifeLab will only grow and here are the people at the frontier of it: Måns Magnusson, Jose Beltran, Robin Andeer Olav Vahtras and Radovan Bast. And yours truly, of course 😉

Thank you all for the amazing experience and for all the efforts of promoting reproducible research and open science in Sweden.

P.S: Little birdie told me that there soon will be an on-site instructor training in Europe, probably in the UK and if you got inspired by the event at SciLifeLab and would like to become an instructor yourself, just get in touch with me or Måns Magnusson and we’ll try to arrange it.

First PyCon Sweden

So, now, after finally getting enough hours of sleep and reflecting a little bit on what’s just happened… Pycon Sweden was pretty amazing, but certainly could have been a lot better.

It was not only my first PyCon, it was also the first conference I helped organise.  So I cut myself some slack.

Python is an extremely popular language in Sweden, perhaps the exploded in recent years start-up tech scene is to blame. Only Python User Group in Stockholm alone has almost 1000 members. Every meet-up gets booked only 5 minutes after it gets announced.

So all stars were pointing towards organizing our very own Swedish National Python Convention. Sounds pretty cool, eller hur? A few crazy people who loved Python and conferences thought so too. They registered a non-profit “Python Sverige” and  started begging for money  asking companies to sponsor the first PyCon Sweden. A few months later they also asked me & PyLadies Stockholm to help out with volunteers and diversity initiatives.

Here we all are

Fast forward 6 months later 250 people gather in the Q-building at KTH. Volunteers from Pyladies and KTH are running around in white t-shirts that say “PyCon Sweden” and the first keynote by Helena Bengtsson starts in the big lecture hall. 

Soon we will realise that the schedule is all wrong, some speakers will start panicking and the catering company will forget to deliver our lunch. But despite all the chaos, I got to meet awesome people, listen to some great speakers and get inspired to create. Which is why we were all doing it in the first place, right?

I haven’t seen all the talks, due to running around in one of the white volunteer t-shirts, but some of those I did see were truly worth mentioning.

Jack Parmer with “New scientific plotting in Python” convinced me once and for all to really start using Plotly and it’s new API.

Per Fagrell with “How to write actually Object-oriented Code in Python” somehow made following the SOLID principles in Python easy. I also remember lots of ducks in your slides ;-D 

and Laurens Van Houten made Cryptography sound fascinating and simple.

I would like to thank everyone for attending, volunteering, speaking and providing valuable feedback on how to improve the conference.

Now after reading all the feedback attendees sent to the board,  I would like to mention that we got the following areas need the most improvement:

1. Coffee. And water. We need those. All the time.

2. Breaks. They need to be shorter.

3. Schedule. God damn it, it should not change once it’s announced. It should also be clearer.

4. Speakers. Some did not deliver. We should pick better next time.


It has been lots of work. Organising a big event is challenging and messy, but I’m on the board  of Python Sweden again with the hope and determination to do better next year.

If you would like to help us, please don’t hesitate to drop us a line at

So long and see you next year!

Books that helped me figure out what I really want

I have a long way to go in understanding how to put my skills, character and aspirations into a better use. My approach so far has been to try and learn from experience. I’ve tried all sorts of different occupations, starting from web development and SEO consulting to evolutionary biology.

I’m still searching and constantly reevaluating my goals and investigating new challenges to pursue. In all that struggle that probably everyone faces at a certain time in their lives(or all the time? :), I’d like to share some words of wisdom that helped me to structure my search for purpose and passion. Maybe you’ll find them inspirational too.


“A lot of people want a shortcut. I find the best shortcut is the long way, which is basically two words: work hard.”

The Last Lecture by Randy Pausch

I have yet to read something more inspiring than this book, where Randy Pausch, professor of Computer Science, shares his advice and stories from his exciting life as an engineer, a scientist, a husband and a Dad. It’s a pleasure to read a success story of a technology geek who followed his childhood dreams and managed to build a great career out of it.  Especially recommended to anyone considering an academic career in computational sciences.


“The fastest way to change yourself is to hang out with people who are already the way you want to be.”

The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career by Reid Hoffman, Ben Casnocha

This book is actually mostly about networking and how to find career opportunities. But for me it was more than just a career guide, it helped to look at my career as a path in a mystical jungle, rather than a straight road to the final goal. Some great advice on how to leverage your talents and develop yourself professionally as a business were a nice a bonus. Not only an inspirational, but a practical “how to” book.

The_Art_of_Non-Conformity_Set_Your_Own_Rules_Live_the_Life_You_Want_and_Change_the_World_book_cover-sixhundred   “The concept of deferred gratification, or sacrificing now to save for the future, can be helpful in setting aside money in a retirement account for old age. It can also serve as an effective rationalization for life avoidance.”

The Art of Non-Conformity: Set Your Own Rules, Live the Life You Want, and Change the World by Chris Guillebeau

This is a great book if you are tired of the corporate rat race and need that last push to pursue your dream of becoming a writer, blogger, traveller, entrepreneur or whatever else is on your list. It’s not practical at all and mostly irrelevant if you really feel that a more conventional career path can be just as exciting as travelling around the world.

I’m really for the idea of setting my own rules in life though and Chris’s words of encouragement helped me with it.


 “There is no perfect fit when you’re looking for the next big thing to do. You have to take opportunities and make an opportunity fit for you, rather than the other way around. The ability to learn is the most important quality a leader can have.”

Lean In: Women, Work, and the Will to Lead

by Sheryl Sandberg

Strong female role models are still hard to come by, that’s why I could not miss this book, even though what Sheryl is doing (she is COO of Facebook at the moment) is quite far from anything I aspire to do in the near future. It’s an easy and an engaging read that every career woman would relate to. Some really practical advice on putting yourself out there, fighting the impostor syndrome and speaking up, finding a partner that will treat you as an equal and support you every step of the way.

I don’t like how she is shaming the mothers that choose to leave the workforce or lean out from some opportunities for their children. The solution to this problem isn’t women hiring nannies and going back to work 2 weeks after giving birth, in my opinion, but better social policies for maternity and paternity leave and affordable child care solutions.

Sheryl’s definition of success is just one of a kind and I’m truly happy she found her version of “having it all”. For now I’ll just follow her advice on leaning in until I will really have to reconsider 🙂

51jr5Lpi6iL._AA300_ Find Your Passion: 25 Questions You Must Ask Yourself by Henry Juntilla

Last but not the least comes this exercise book for finding your passion. Henry has more books and a blog with lots of materials on the topic: This book is not going to be easy though, but if you have a couple of free evenings, I’d say go for it. It will help you explore what you really like doing and how to turn it into something that will both bring you money and a fulfilling life.

Weighted random choice using NumPy

random choice


If you’ve ever implemented a decent generative model you probably been in a situation when sampling using standard random.choice isn’t going to cut it. You want to generate numbers from a certain distribution or you just want a certain letter from your alphabet appear in the results a bit more often to simulate the real data better.


Then you probably ended up implementing something like this:

or you just stumbled upon this post from 2010 that has some heated discussions about this problem in python up and  lots of different implementations of this function.

So did I for my machine learning project last week until the glorious  internet helped me find this (relatively) new function in numpy that just effortlessly and elegantly does exactly what I want.

Thank you the Numpy crew 🙂

This generic sampling function will generate samples from a given array. The samples can be with or without replacement, and with uniform or given non-uniform probabilities.

Using this nice wrapper we can generate a custom random protein sequence, for example. In this code snippet I’m using a Dirichlet distribution with a uniform prior to sample from. But a prior can be estimated from a given set of proteins to give the generator a better idea of what a dataset of our interest should look like.

My first year as a PhD student: short summary

In November last year I got accepted into a PhD program in Bioinformatics at the Department of Biochemistry and Biophysics, Stockholm University. Becoming the next Sonja Kovalevsky*  was my long aspiring goal ever since…

Ok, I don’t actually remember when the idea to become a researcher got into my head. Perhaps when I was sitting at my first math lecture at the Bauman University or when I learned about Rosetta (a computer program that is predicting 3D structures of proteins using machine learning algorithms) or when I realized that I need a greater challenge in life than just writing PHP code 9-5.

For some reason I thought less about my ability to actually do science or the exact way I will accomplish a great life of daily contributing to the common knowledge, but someone told me that a PhD degree is what you need first…

So long story short, here I am, finishing my first year of a PhD program.

Some numbers:

Times I changed my main supervisor  1
Times I considered quitting  100?
Times I thought it’s the most awesome job ever  100?
Times I rejected a job in the industry with a much higher salary   5
Papers published  1
Papers in review 1
Credits earned  18.8 (~4 months of studying full time)
Programming languages learned 1 (R)
Programming languages improved 1 (Python)
Papers I read about my research topic(s): ~100-200 


Lessons learned or  “I wish a year ago I knew that…”

  • Most bioinformaticians can’t program and don’t even now that
  • No one cares about well written code in academia
  • Software is not the center of my life anymore. Results are.
  • I will be working alone 99% of the time
  • I’ll feel stupid and useless 99% of the time
  • Research project is 60% trying to figure out what to do and how, 10% actually doing it, 30% trying to interpret your results
  • Academia is really reluctant to change, even in progressive Sweden
  • Many PhD students are very unhappy human beings (been one of them for a while)
  • After graduation ~80% of PhDs don’t find their experience related to the jobs outside academia
  • The most important choice you make isn’t even your research topic, it’s the supervisor
  • Ok, picking your research topic is important as well. Make sure you believe in the direction you are going and enjoy working with it.
  • It’s important to learn to let go: let go of the useless results you produce, of the code you’ve written for 3 months to obtain them and of countless meaningless things you need to do just to get through the PhD studies.
  • It’s really important to stay healthy: exercise and eat well. Stressed and depressed brain(at least mine) is not productive at all.
  • Effective time management is the key: prioritize and do one thing at a time.
  • Learn to write well, starting now.

The above being said, it has been one of the most interesting and challenging experiences of my life and when I actually manage to keep the big picture in mind, I feel like I’m on the right track. Whatever the next step might be.


*  did you know that Sonja, or in Russian actually, – Sofia Kovalevskaya
became a professor at SU, because in Russia she wasn’t allowed to work as a lecturer, being a woman and all.

Hierarchical clustering in Python

If you are like me looking for a way to produce a heat map similar to the one in R but in Python, then this post is for you.

My particular goal here was to cluster samples and genes in a gene expression data table, but there might be tons of applications where it’s useful.

Turns out, Python offers several ways to approach this problem. I picked out three of them: quick and easy numpy+scipy combination, fastcluster module ( the name speaks for itself) and BioPython that provides superior visualization possibilities.

Unfortunately, there does not seem to be a way to easily plot a heat map along with the dendrograms on one plot, hence the issue I created on matplotlib github page

Follow the instructions I created about these packages for hierarchical clustering in Python: IPython notebook or just download the source code from the github repo.

Try something new today

It’s vacation time and that means lots of reading, reflecting on your life and personality and dreaming big.

For me it also means diving to some new experiences, like biking through Berlin on a single speed bike for hours.

It all together with a fantastic book I’m reading now lead me to an idea of a new series of posts on my blog.  You all  know about inspirational boards, right?

Lot’s of pretty motivational pictures glued to a piece of wood in purpose of making you feel fearless and powerful by looking at it.

Inspiration it is. I’ve just decided to go for a digital version of it. Created with the help of Picasa and Pinterest here is this week’s inspirational board.

Try something new today


I hope you are enjoying your summer and  trying something new too!