Top 5 bash commands for handling data

I have a confession to make. I don’t really like the command line. Or at least I didn’t like it before. Bash commands used to simply terrify me and I would opt out for other tools whenever I could.

After moving to bioinformatics and data analysis though, I started to realise, that command line tools and tricks are not only error prone and hard

to remember, but also true efficiency maximisers if used for the right tasks.

Here are my Top 5 bash commands for handling data, running a bioinformatics pipeline and generally for making life of a data scientist better.

1. How does my data look like?  cat, zcat, head, tail, grep, wc 

cat, head and tail were probably some of the first commands I learnt when starting with the Unix environment. But the real power comes when you combine them with grep to only look

at the data you’re interested in and with wc to count how many data points you have.

Say that you want to quickly check how many lines in your file contain a certain flag. Doing this in bash is beautifully simple and fast:

zcat  data_chrX.gz  | grep "DEBUG" | wc -l

Also, zcat is a real saviour if you work with gzipped files a lot. Explore them like this:

zcat  data_chrX.gz | less

OR

zcat  data_chrX.gz | head -n 20

to see just the first 20 lines.

2. Shortcuts for moving the cursor

Typing the commands in your command line can get pretty tedious especially if you use full file paths as input or the script you are launching has many options.

These simple command line shortcuts I just learnt recently and don’t know how I did anything without them, now using these all the time.

ctrl-E # move cursor to end of line
ctrl-A # move cursor to beginning of line

3. Screen for working on a remote server

Have you ever ended up in a situation where you ssh to your super powerful remote machine, launch your script, go get coffee

while waiting for the results, and then come back to “broken pipe”. This is a common beginner mistake, don’t worry, just add screen

to your data analysis toolbox! With screen you can create as many remote terminal windows on the remote machine as you like,

detach from them and your jobs will still be running.

A shortcut to start a program in a new virtual terminal and detach from it is this:

screen -m -d <your command>
screen -ls # will display all current windows you have running
screen -r  <screenID> # switch back to the detached screen

Then if you by mistake created too many windows or just want to kill them all:

killall -15 screen

4. Check how many results have completed with ls and wc

Now that you are running your script you might want to check how many results have already
completed. I usually output to a dedicated directory called something like results/experimentID/…

ls -l results/experimentID | wc -l

But sometimes the results directory has subdirectories with, for example, plots or old results.

To count only the actual files I found this great trick:

find targetdir -maxdepth 1 -type f | wc -l

5. Perlk – Perl perk: Renaming multiple files with a one liner

Rename command is actually a part of perl, but on a Linux machine, you’ll have it out of the box anyway in your command line.  With perl rename you can change the extension of the files on in any other way efficiently rename several files at once. You might want to read up on perl regular expressions first if you have some advanced renaming in mind.

rename 's/\.txt.txt\z/.txt/' *.txt.txt

Bonus command line trick for bioinformaticians:

Split one FASTA file into several with awk:

awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' dna.fa

This command will split the files at the point of finding  “>” character and name them based on their FASTA header. There are many tools that can achieve the same thing, but I fin this the most elegant and handy so far. Awk is in general very powerful and versatile, although it seems to have a steep learning curve 😉

 

There it is, happy hacking in the command line!

 

 

1 Comment

  1. Nice article. I love the command line 🙂 Didn’t know the rename command, must try that.

    There are several more z* command, e.g. zless and zgrep. Which means you could rewrite

    zcat data_chrX.gz | less

    as

    zless data_chrX.gz

    and

    zcat data_chrX.gz | grep "DEBUG" | wc -l

    as

    zgrep "DEBUG" data_chrX.gz | wc -l

    But you versions might perform better on large files.

    Also, to continuously check how many results you have, try the watch command:

    watch 'find targetdir -maxdepth 1 -type f | wc -l'

    It will execute the commands every 2 seconds (change the interval with the -n option) and display the result.

Leave a Comment

Your email address will not be published. Required fields are marked *