I recently had the pleasure to meet lots of fellow PhD students and post-docs that come to a point in their research when they realize they desperately need some computer programming skills. Excel just isn’t working when you have huge amounts of data from your experiments or when you are fishing for new patterns and motifs.
Everyone shares a similar story. They started off with a book or a tutorial on R or Matlab and it all just was so confusing they lost their patience, motivation and got depressed. First of all, of course, don’t give up! We’ve all been there trying to make this stupid computer read our mind and do what we want it to do. The key to learn to use programming in your research is to start off with baby steps and never lose curiosity.
Programming is not only a useful tool for your great science ideas, it is also fun!
Here is a strategy I propose to you, my dear fellow biologist, who wants to learn to code:
1. So, let’s quizz your skills a little bit. I’m assuming you do well in math and you sleep with a book on Statistics under your pillow. So we got that covered.
Now, do you know how to use the command line to check what files are in your “Documents” directory, for example? Or have you perhaps taken a programming course at school in any programming language? If so, skip to the next step!
Otherwise, don’t worry and enjoy the power of Khan Academy programming basics tutorial. It will make you more familiar with variables, comments, functions and all other basic elements you are going to have in any programming language you will use in your work.
Try to go through a fun exercise, how about drawing this tree:
with code! See how it’s done here.
2. Now it’s time to get to some serious business and go through this course on Udacity: Intro to Computer Science.
The most important thing on this step is to learn how to learn things on your own. No one remembers all the commands by heart, but a good programmer knows where to find them! How to search for the function you need, how to debug your program and find errors, how to read documentation. Another important thing is to understand how a typical program you are going to write works. Say, you probably have your data in a file in some sort of format. Your program will need to read it, parse it into a representative data structure (what would that be? a table? a graph? a list of numbers?), then perform some analysis and output the result as a plot or another data structure. Keep it in mind when you’ll go through the course material.
Bonus points: you’ll finally understand what this nerdy guy from the Computational Biology department is talking about 🙂
A quicker alternative here will be to try CodeAcademy and start off with Python (i.e. now you can also move to step 3!): http://www.codecademy.com/tracks/python This is a great option if you want faster results and prefer interactive learning.
3. By know you should feel less afraid of code than before. Now it is time to pick your language!
You probably heard of R before. R is a statistical toolbox and a really impressive tool for anyone interested in genomics. I have yet to find anything as useful. Although, R might be tricky to start off with, as it can be really confusing for a beginner.
Therefore I suggest you start with Python. It’s dirt simple! I promise you will start doing cool stuff from day one. It also has a large community online and lots of ready to use modules and libraries. And no weird R confusion to deal with. You will deal with it later if you really need to use Bioconductor or plot lots of heatmaps 🙂
Now it is still time for fun.
- Go to learnpython.org and play with Python in your browser. No need to install anything. Or see http://www.codecademy.com/tracks/python I mentioned before.
- Another online course uses Python to teach programming and computational thinking in general: https://www.coursera.org/course/programming1
- A great read at this stage of your journey is Think Python: How to Think Like a Computer Scientist. It helped me so much when I was starting out!
- Check for more resources for learning Python here.
4. Now you might look around BioPython and decide if Python is a good choice for the problems you are facing: if you need to do heavy computing, sequence alignments, machine learning algorithms, – it most certainly is.
Although, if you just want to do some plots on your tabular format data every once in a while, cluster your experiments or do a PCA or if you want to enjoy all the public genomic data that is generously available on Bioconductor, it is time for you to master R!
Here is your go-to website that will get you started with the basics: http://tryr.codeschool.com/ Interactive learning for the win!
After that there is only trial and error method🙂 Don’t be afraid to ask your fellow bioinformaticians for help or search for your problem online!
More resources on learning R when you have basic programming knowledge:
- Data Analysis on coursera: https://www.coursera.org/course/compdata
- And this book is a charm: http://www.amazon.com/Introductory-Statistics-R-Peter-Dalgaard/dp/0387954759
- If you are in Stockholm, join the local coding community on meetup.com: R User Group in Stockholm , Python User Group in Stockholm.