Wiki : softeng:bioinformatics_and_its_problems
 

Bioinformatics and its problems

Bioinformatics still is at its early stages, when formalism and quality are still driven by biologists. For centuries biologists are struggling to formalise biological data with a very limited success. Because of that, there is virtually one truth for every biologist on every field. This lack of agreement not only happens every day, it's accepted as normal and sometimes even considered a virtue.

When interacting with computers and information systems, the lack of formalization imposes a huge limit on what it's possible to do and practically renders bioinformatics to a mere procedural and organizational task by assembling numerous different databases for the same data. Further on, as biologists are still formalising bioinformatics, those databases are redundant, non-standard and inefficient.

Biochemistry is where Bioinformatics is most active. It has the formalism of chemistry (definitions of molecules, interactions etc) but it becomes fuzzy when applying that to life forms. It's quite easy to know the aminoacid sequence (primary structure) of a protein from it's genetic code but can be extremely complex to fold it into a proper three dimensional structure and define it's active sites. Understanding how proteins evolve from one organism to another, how they interact with the different environments, how they can be used for completely different things in different organisms or even within the same organism are all challenges that are still impossible even to formalise.

The very definition of life is utterly philosophical, biologists still don't agree if the virus is a life form or not and in fact, not many of them are really sure about their opinions. If biology, that is “the study of living things” can't even define life itself, no wonder the rest of it. Nevertheless, everyday biology is quite simple: pine trees are plants, mushrooms are fungi and cats are animals. The complexity is hidden underneath several layers of broad definitions, ill-defined approximations, manipulated statistics and selective curve-fitting for any deeper understanding will prove impossible by today's standards.

Bioinformatics urgently needs a stronger formalization, standards and consortia, software quality assurance and a much bigger interaction with the software development community, especially the open source community. Furthermore, there is a growing concern of storage and information retrieval systems as biological databases have already passed the PetaBytes barrier a while ago.

Formalization paths

Science can cope with any level of formalism as long as you clearly state the precision but because some areas have historically more formalism than others the generally accepted precisions from area to area is different.

For instance, to measure the energy of a particle and say that it has 1.5+-3.0ev is totally unacceptable for a physicist (ie. the error bar is bigger than the value itself) but some biochemists face response levels from protein interaction like that in a daily basis. Unfortunately, the techniques known are so poor that this is the best they can come up with. It's not hard to find a PhD thesis in biology that rely entirely on numbers like this, sometimes in no numbers at all.

On the other hand, information science (aka. informatics) is not that docile. Programs must execute, get input and throw output. It's not possible to say that if I got half of the input I'll produce half of the output, it can produce nothing at all or even fail miserably, delete other files, format the whole filesystem, drop tables and so on. The consequences of small errors on complex information systems can be catastrophic. Information systems not only accept a tighter formalism, it requires formalization far beyond those of biology in order to be minimally meaningful.

The biggest challenge of joining biology with informatics is to draw the line between what is an acceptable failure for the information systems and what is the quality needed for the biological data. Being utterly simplistic, bioinformatics today live in a line, where to one side you get software quality and to the other side you get scientific quality. The more formalism you get for software, the less scientific information you get from the data and vice-versa.

Luckily reality is not that simplistic, it is possible to write quality software and yet have the required scientific quality, and that's the objective of this text. I hope I can show you how software quality is not a barrier to scientific results and how the time spent in producing good code can actually be saved later when not fixing numerous bugs.

Hammers are for blacksmiths

There is no such thing as magic, especially in the software field. In the beginning, operating systems were created from scratch for every new architecture available and there were much more architectures in the past than there is today, almost one per machine type. Nowadays, everything is basically x86 compatible so it's quite simple to put new machines running. It's almost like magic.

The same with network protocols converging to TCP/IP and its derivatives, web pages converging to HTML and alike, 3D libraries converging to OpenGL, etc. History repeats (or more accurately, resembles) itself quite often. Bioinformatics is just another case and despite it's age (decades now) it's still in its early stages.

Semantic Web is the new hype. It's considered one of the most complex problems informatics is facing nowadays because of the number of possibilities and the complex relationships with current technology involving software, hardware, network and especially theoretical problems, but after no more than 10 years it already have well defined formats, good quality query languages and reasonable tools available.

On the other hand, bioinformatics is a completely different story. It exists for quite a few decades and nothing is defined, there isn't a single standard for anything, or better, there are thousands of standards and unique identifiers (which is exactly the same thing). Anyone can build complex systems from scratch, completely ignoring everything made so far and still produce something that works as well as the existing tools and, for the eyes of a biologist, it'll be good enough. As long as biologists are in charge to define the quality of software, bioinformatics will be doomed to fail over and over again, increasing the ball of mud every cycle.

But that's not the only problem, not even the worst. A new class of professionals, called bioinformaticians, is being created by a mix of biologists, developers and enthusiasts students. Normally, interdisciplinary teams are the best, for you get the best in everyone to form a winning team but in the case of bioinformaticians, professional software engineers will maintain software poorly written by undergraduate students which quality was defined and controlled by biologists. It couldn't be worse!

When building a house, architects draw the concept and you accept it or not, ask for changes and refinements but who actually draws it and control the quality of the drawing is the architect. Later on you need to actually build it but unless you work in the construction business you need to hire an engineering company that will have professional workers, construction engineers, hydraulic and electrical engineers and so on.

Obviously, there a regulations and standards so you can't just ask them to put an electrical socket right underneath the kitchen tap, they can't do it. But, in the land of bioinformatics you create your own rules and standards, you can put the socket wherever you please or think it's more appropriate. The workers will eventually install it and the house will be set on fire one day or another.

In the ideal world, a professional software engineer should be solely responsible for the quality of software and the implementations of the algorithms. Students are the fresh minds bringing new ideas and algorithms and biologists will define how good their results are and demand for more results or less false-positives but never, under any circumstances, interfere with the software development process and quality control.

What makes this scenario less attractive is that it's quite impossible to find one person with all three qualities and hiring three people for the job of one is seldom a wise choice with tight budgets. Even if you do find someone with all three qualities (or at least the required bit of each) he/she won't be able to play all three roles at the same time.

Nevertheless, experienced software engineers should be the ones in charge of implementing and maintaining the software infra-structure and the algorithms, working together with students to gather new ideas and improve the software. Biologists should define scientific standards and work together with developers to assure the results are correct but never dictate the accepted quality (ie. time vs. features).

Academia vs. Industry

Academia is not streamlined, and should not be. It's the roughness of academia that makes possible so many discoveries, it's the lack of commercial return that makes possible to invest in ideas that won't be used anywhere near customers. However, there are some areas of academia close to the market, mostly technology research, that share a bit of the market pressure.

The market pressure is not all good, it puts wrong things like visual and marketing before important things like quality and security, but some of that pressure can be used for good purposes such as those that drive organizations like IETF, W3C, ISO among many others. They help companies and the academia to organise themselves into well defined standards that, if not good at one particular time, can be developed by all parts and improved for future use.

In some cases the standards are defined by the market, good examples are VHS, DVD, x86. Some cases are defined by organizations like the ones described above, good examples are TCP/IP, X.25, HTML, XML, RDF. Both cases work pretty well but bioinformatics seems to fall in between.

There are some organizations capable of enforcing standards but they're not quite as respected as IETF and W3C. There are also some market defined standards but because bioinformatics is not a market it doesn't need to be respected anyway. In the end, bioinformaticians respect whatever they please, whenever they want… or not. Because this is a common practice in the field, the other bioinformaticians respect the right of not respecting anything and have to take that into account when creating their own internal procedures.

The same information must be created in thousands of different formats, with slightly different contents, available in completely different places and services, embedded in other systems or stand alone as a file for download. For every new piece of information you have to do that all over again. Also, because every system has it's own pace, everyone keeps its own copy of everything and because format changes a lot over time and completely unpredictable inconsistencies arise as often as ticks in an atomic clock.

Lessons to learn

So, what can bioinformatics learn from industry without absorbing too much bureaucracy and weak values? Which steps to take to increase information systems quality and still be able to produce good scientific results? How to create, develop and maintain good standards without reinventing the broken wheel again and again? How to organize institutes in a way to improve compliance with standards without impacting the already slow release cycle?

These are very complex questions and require very complex answers. Not only complex, each answer is specific to each case, so instead of giving a set of defined rules I'll state the problem, which ways they're normally solved and why they're often considered as bad quality software. Also, I'll try to give more general advice for higher level management to understand why applying policies are necessary when dealing with software quality.



 
softeng/bioinformatics_and_its_problems.txt · Last modified: 23 02 2008 07:59 by rengolin
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki