“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database:
# inflating: UMDTP53_all_2012_R1_US.txt
# inflating: ._UMDTP53_all_2012_R1_US.txt
Wait now. What’s that weird hidden file, starting with a dot?
# "._UMDTP53_all_2012_R1_US.txt" may be a binary file. See it anyway?
# no, I don't want to do that
At this point, alarm bells are ringing. I’m thinking “Mac user”. Don’t get me wrong; I own a Macbook Air myself, I see smart people doing good work with Macs. But in my experience, biologist + Mac + text files = trouble.
I push on and check the number of records in the main text file.
wc -l UMDTP53_all_2012_R1_US.txt
# 0 UMDTP53_all_2012_R1_US.txt
Mmm – oh, ^M
Zero lines. That can’t be right. We look at the file using less
Aha, the old “line endings” issue. Easy enough to fix with the classic utility dos2unix – in this case, called as mac2unix:
mac2unix UMDTP53_all_2012_R1_US.txt > /tmp/UMDTP53_all_2012_R1_US.txt
# dos2unix: Skipping binary file UMDTP53_all_2012_R1_US.txt
Well that’s odd. Even odder is that this works.
cat UMDTP53_all_2012_R1_US.txt | mac2unix > /tmp/UMDTP53_all_2012_R1_US.txt
# on we go; overwriting the original since we can always get it from the zip
mv /tmp/UMDTP53_all_2012_R1_US.txt UMDTP53_all_2012_R1_US.txt
wc -l UMDTP53_all_2012_R1_US.txt
The web page said there should be 36 249 records…but the file looks OK. Right, let’s get on and parse this file. It’s tab-delimited, I like Ruby, so I try to do the right thing and use a library – in this case, the Ruby CSV library. First, just a run-through to see that it all works as expected:
f = "UMDTP53_all_2012_R1_US.txt"
CSV.foreach(f, :headers => true, :col_sep => "\t") do |row|
# prints lots of names and then...
# ArgumentError: invalid byte sequence in UTF-8
# from /usr/lib/ruby/1.9.1/csv.rb:1855:in `sub!'
# from /usr/lib/ruby/1.9.1/csv.rb:1855:in `block in shift'
# from /usr/lib/ruby/1.9.1/csv.rb:1849:in `loop'
# from /usr/lib/ruby/1.9.1/csv.rb:1849:in `shift'
# from /usr/lib/ruby/1.9.1/csv.rb:1791:in `each'
# from /usr/lib/ruby/1.9.1/csv.rb:1208:in `block in foreach'
# from /usr/lib/ruby/1.9.1/csv.rb:1354:in `open'
# from /usr/lib/ruby/1.9.1/csv.rb:1207:in `foreach'
Not all as expected, then. Colour me astonished. Frankly, this is why people resort to splitting on the delimiter instead of using libraries…anyway, search the web for that error and you’ll find a lot of confusing, conflicting and sometimes, just plain wrong explanations and advice. Let me save you the trouble. The Ruby CSV library expects UTF-8 encoding
. This file is not encoded in UTF-8. So what is it? A surprisingly tricky question to answer.
First step: another visual examination using less, which shows some odd characters. The file contains a “Medline” column, so I search PubMed with the UID and see that author
Mol<8F>s is supposed to be author
Molès. I am not an expert in character encoding but when I post my frustration to Twitter, I find someone who is:
@neilfws 8F is the hex code for Mac Roman encoding for è: en.wikipedia.org/wiki/Mac_OS_Ro… This might work: stackoverflow.com/questions/8420…—
J.J. Emerson (@JJ_Emerson) July 29, 2014
So tell CSV to transcode from macRoman to UTF-8:
CSV.foreach(f, :headers => true, :col_sep => "\t", :encoding => "macRoman:UTF-8") do |row|
All good! I can parse the file and start using the fields to do something useful. Once again the interesting part (analysis) takes minutes, getting to the analysis takes hours or days.
It’s always tempting to say “well all of these problems could be avoided if people only did [insert better approach here]“. The thing is, they don’t. Dealing with it is just part of the job and having the skills to deal with it, I think, deserves more recognition than it gets.
Filed under: bioinformatics
, research diary