header image
Silly prototypes of the week
August 7th, 2008 under Devel, Algorithms, rengolin. [ Comments: none ]

This time I don’t have a project, nor probably will have time to implement them (I’m on real holidays!), so I’ll just post the ideas and welcome any comments on the feasibility or usefulness of them. Feel free to implement them and let me know how it went.

Independent Bayesian Agents

An Artificial Neural Network is a collection of neurons connected in meaningful ways that, when the information, coming from an input layer, passes through them, a resulting signal is produced on the output layer. The learning is done via feedback (good or bad), by changing the internal functions of each neuron to accommodate them for the next “pulse”.

The problem with this is that, after the learning phase, the resulting network can only classify a strict class of problems. Networks created to recognise alphabetical letter in an image has a different format and components than those to recognise other patterns. There are some implementations where the network itself can change it’s connections in order to learn something new, but that’s not commonly seen, not even on many scientific works on the subject.

You can have the same network structure using Bayesian nodes. The class of problems to be solved is a bit different but the overall logic is similar.

Bayesian Agents are a bit different. They are independent, learning programs that might be able to communicate with other agents upon necessity and also update their internal states from feedback and previous probabilities. It also has the advantage to return not only true/false values but probabilities or even probability distributions to the user.

But this way, your agents will be much more complex than the neurons on ANN, and that’s a path that is rarely worth taking if you want to build emergent behaviour as it becomes more and more difficult to combine them to produce unexpected behaviours. So, the idea is to break all the agents into very small pieces that can interact with each other, with just a few rules.

This way, the agents themselves would be pretty much like objects in an OO design, following the implementation of a class (agent definition) and inheriting or using methods from other Agents (interaction) you can, at the same time, build a network with any format you want (using graphs or even network connections via name service or so for the interactions) and keep it simple, by following the agent definition and not trying to put any complex logic on each agent.

Same as OO design, you don’t want to create one single class that does everything. What you do is a relationship between classes that express the problem set you’re solving and, if well designed, you can reuse some of the components to solve other problems as well. The power of the OO design is the relationship between the classes and not what the classes are actually doing, right?

So, my hunch is that, if we do very simple agents and let the network itself be the relationship between them, we acquire a new power of extensibility.

Also, by changing the connections, the learning happens at the network level instead of the agent level. Further on, by inflicting an “explorer behaviour” on each agent (like extending from a common class Explorer for all agents), they can search for new answers once in a while, in background, to increase their levels of trust.

In the same line, by extending from an agent Feedback you can make all agents be able to process feedback from the users, from its derived classes such as BooleanFeedback returning good or bad, or MultipleChoiceFeedback, ContinuousFeedback and so on.

Genetic programming using template Policies

Learning by structure given above (network structure) is similar to genetic algorithms, but in a different level. Genetic programming allows you to learn by the structure of the program itself. A standard way to solve that is using a rulebase approach. Create a big set of simple rules and write programs to use them in a mixed fashion.

Given the problem you want to solve, run all programs against it and try to define the quality of each solution (if they’re at least able to get somewhere) and give those programs closer to the solution a higher chance of survival, but don’t kill the unfit, and that’s an important step.

If there are good, but dormant, rules (genes) on the unfit and you kill them, you’re removing from genetic pool (rulebase) some solutions that could be the best when joint with others that you didn’t have the chance to join yet. Anyway, crossing those programs (by exchanging rules and creating new combinations) and running the next breed on the same problem you can, iteratively, lead the evolution towards the best solution by chance.

So, what about template policies?

C++ templates are very powerful. One of the great powers (which comes with great responsibility) is to be able to inherit functionality from the template argument.


template <class EatPolicy, MovePolicy, ReproducePolicy>
class LifeForm : public EatPolicy, MovePolicy, ReproducePolicy {
};

With this, you can create a life form following one of the policies provided. They will inherit the methods and properties of the policies as if it was a normal OO inheritance. So, to create a bacterium, just implement some policies inheriting from EatPolicy, MovePolicy and ReproducePolicy and instantiate like this:


LifeForm<Fermentation, Osmosis, Mitosis> bacteria;

You can also create several policies inheriting from Fermentation (from different materials, etc) and create several types of bacteria, but the problem is how to cross them later. Because templates are evaluated during compile time (and types are created), you can’t create new types during run time and the crossing over should be done off-line (and recompile). A few pre-processor commands and a Perl script could do it, though, and isolate that nasty part from the rest of the code.

I’m not sure of the implementation details of the cross-over but the basic skeleton did work: Skeleton of genetic policies. Would give it a try later on.

Popularity: 1% [?]


Silly project of the week: molecule dynamics
July 9th, 2008 under Devel, Algorithms, rengolin, Physics. [ Comments: 1 ]

This week’s project is a molecular dynamics simulation. Don’t get too excited, it’s not using any of the state-of-art algorithms nor is assembling 3-dimensional structures of complex proteins. I began with a simple carbon chain using only coulomb’s law in a spring-mass system.

The molecule I’m using is this:

Molecular Dynamics

The drawing program is quite simple and wont work for most molecules, but for the 2-dimensional simple molecules (max. of 3 connections per atom) it kinda works.

Later on, putting the program to run, each atom “pushes” all others electrically and the spring “pulls” them back. A good way to solve that is to say that q1 . q2 / x² = - k . x = m . d²x/dx² (where x is a vector) and integrate numerically using Runge-Kutta.

But that’s my first openGL program, so I decided to go easy on the model and actually see it pseudo-working with an iterative-based simulation following the same equations above. This picture is a frame after a few iterations.

Quoting its page: “As this simulation is not using any differential solution, the forces grow and grow until the atom becomes unstable and break apart. Some Runge-Kutta is required to push the realism further.

UPDATE:

The webpage of the fully-functional prototype is HERE.

Popularity: 4% [?]


What is bioinformatics anyway?
June 24th, 2008 under Devel, rengolin, Bioinformatics. [ Comments: 3 ]

After almost three years in the field I’m pretty sure I have no idea. A few months ago I though I knew and wrote an essay about software quality on bioinformatics but I now figured out that, even though those things might make sense to the rest of us, for bioinformatics it doesn’t.

Wikipedia (which have a much higher quality than many papers I’ve read) defines software as: “a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system”. It also defines programming as: “the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs”.

So, every one that writes programs (let’s forget about documentation, tests, maintenance etc for now) is a programmer. But a computer programmer IS NOT a software engineer. Programmers can write as much code as they want but without formal definitions, metrics, good design decisions and practices, tests, documentation and so on, they are useless as ants without pheromone.

Quick tip: Whenever you see a job for a software engineer in a bioinformatics institute, beware: It generally means a developer to maintain random code and make random changes in random environments.

So what?

I might not have a clue about what bioinformatics is, but now I’m pretty sure what it ISN’T: Software Engineering. You will find a huge amount of code, scripts, programs, databases but rarely find a fair piece of software. Therefore, my previous ideas could be valid for software quality, but not at all to bioinformatics.

Don’t get me wrong, I know some bioinformaticians (and programmers around) that understand the basic ideas about software and quality and why we should have them, but the whole structure, the scientific community, the people that give them money, have no idea whatsoever of what software really is or where it fits in the loop.

Still, bioinformaticians are getting half-programming and half-biology degrees, on two fields that each has more to know than the whole humanity can hold on their brains added up. How is it possible (and fair) to put those poor guys to work on such sub-human conditions, without any guidance or quality control, without any clue, in fact, to what they really should be doing in the first place.

Some of them come out pretty well, so well that they abandon the field and go work on better companies, with much better software strategies, proper engineering, scientific development in the right place (sandboxes) and production code done by real engineers with solid experience in mission-critical environments.

In the end, it leaves bioinformatics (to be fair, the informatics part only) in the hands of inexperienced people in all sorts of fields and levels, students writing production software, people that never saw a mission-critical environment coordinating databases, filesystems and development, with one bad decision after the other.

Is it just a rant, then?

No, not really. It’s a liberation. For a while I struggled to understand the motives behind those weird decisions. I knew that, in every industry, you have a whole set of values and people can, sometimes take completely awkward decisions, which turns out to be the right one. I’ve seen it happening when moving between jobs, especially when I worked at Yahoo! (big company, big culture). But with time, the awkward decisions still sounded awkward, even after considering all the new information I had.

Other people got fed up with all this and left, one after the other. I talked to them, and the answer was always the same: random (generally bad) decisions, ego in astronomic proportions and zero technical knowledge from all parts. Now I’m leaving for good and you won’t need to ask me why, will you?

I generally need a very good reason to leave a work place. I was feeling out-placed but couldn’t leave without a very good reason, but now I got a good bunch of them…

A liberation indeed!

Is there a way out?

Seriously, no. In 10 years definitely no. In 15 quite likely no. In 20, maybe… but things must start changing now!

Being optimistic, assuming they stop running like headless chickens, they would still need a strong guidance, which is virtually impossible to happen because of the strong ego of scientists in general. Bioinformatics exists for decades already, who is the software engineer that will tell them they’re doing all wrong?

Besides, the people that grant them money (governments) have no clue about software engineering (nor they should) and they will keep sending money every year, as long as, in the reports, they pretend to be doing great things. In fact, most could’ve been done in a few weeks with two or three people prepared to compromise.

Who doesn’t want a job where they can do almost nothing at all, get paid every month without even the remote fear of loosing their jobs and still pretend they’re doing great things? Who say no to this and start working for real gets a really bad reputation… While this win-win situation keeps going, there is little or zero chance of doing real stuff in the field and bioinformatics is doomed to constant failure and ineffectiveness.

At last, it’s not a specific problem, where you can just change a couple of people and everything will be all right, as many believe. This is nobody’s fault, it’s just the way the two fields: biology and informatics, joined together some decades ago and was never straightened. If there is a way out, I’d be very glad to see and will congratulate those who managed to do it, but this is much more politics than software development and I am, very luckily, just a programmer…

Popularity: 5% [?]


Markov chain available for NumCalc
June 13th, 2008 under Devel, Algorithms, rengolin. [ Comments: none ]

NumCalc is my personal numerical methods program where I’ve implemented some nice algorithms for numerical computation. The new in the list is Markov Chain.

The Wikipedia article (link above) is far too complex… I’ll try to give a simplified explanation:

A travelling salesman goes back and forth in a set of cities and, given the city he is currently in, you want to know what’s the next city he’ll travel. Of course, he won’t show you his travel itinerary.

The simplest way of doing it is to record all travels he does within time. For each city, you have a counter of how many times he went from each city to all other. If you think these numbers as a portion of all the travels from each city you have a probability of going to any other city in the list.

Example: When he was on Paris, he went 3 times to London, 2 times to Amsterdam and only 1 time to Milan. It means that, 3 out of 6 times (50%) he went to London, so the probability of going again is 50%.

For such small quantities it’s weird to assume that the behaviour will be always the same (he can go to new cities as well) but when the amount of statistics you have is big, the behaviour become very repetitive and thus, predictable.

Real Cases:

  • MegaHAL uses an advanced Markov model to create chat bots by replying what people said before based primarily on the sole probability of one word coming after the other.
  • HMMER is hidden Markov model (a Markov model to predict another Markov model to predict something else) that can do powerful searches within long and scrambled sequences of proteins and genes. The IntrePro group use it to find their protein matches against UniProt.

Of course my super-simplified model is far from being that efficient and useful, but it’s a good start to understand how simple and how powerful they are. You can download it from its webpage.

Popularity: 7% [?]


Numerical methods package in C++
May 9th, 2008 under Devel, Algorithms, rengolin, Software. [ Comments: none ]

I still code in my spare time and for a while I’ve been gathering some numerical methods I did at university in an easy-to-understand generic C++ package. Despite being easy to understand, I also tried to implement the best method I knew for each problem.

The root finding algorithm is based on Brent’s method which is, in turn, based on the secant method. If everything fails, it throws an exception and you can use the safe bisection method.

The integration is using the Romberg’s method, which is a further extrapolations of Simpson’s method which is already much better than the trapezoid basic method.

For Interpolation I’m using the Natural Cubic Spline but would like to implement other types of splines (like complete, periodical, clamped etc). The interpolation is working, but I couldn’t managed to get the coefficients right yet.

Other codes I have are Monte Carlo, Runge-Kutta and Markov Chain (this one using boost graph library for C++) and will be integrated soon. I’ll let you know when it’s done.

Popularity: 13% [?]


Silly projects of the week
April 29th, 2008 under Devel, Algorithms, rengolin, Software. [ Comments: 2 ]

Last week I did two silly but still quite funny projects: word search on protein sequences and chat bot using markov chains.

Word search

Searching for similar sequences among the known proteins to understand evolutionary paths and function similarities is a powerful algorithm called BLAST. Following the same lines I spent a few minutes to develop a similar (but not quite the same) algorithm to search for all dictionary words on all UniProt sequences.

The algorithm will look into a dictionary (say /usr/share/dict/words) and for every sequence it’ll find the words in it, allowing a few (no more than 2) amino-acids to be between the letters. Than it gives a weight to the match based on the size of the word and the “added length” due to additional amino-acids in between. You can also restrict the number of letters on the word list to avoid too short or long words.

The results for 7 to 9 characters words in English searched only on Swiss-Prot entries are:


STRANGE 1.4
CHARADE 1.75
SLEDGED 1.75
MEALIER 1.75
DEMEANS 2.33333
DETAINS 2.33333

Markov bot

I knew a markov chain could have true wisdom but I never tried to make it actually talk to anyone until a friend told me about MegaHAL (much better than my markov model, of course). I decided to give it a try and got the XMPP bot from the Python examples and plugged in my markov program on all phrases spoken to it.

In the beginning it was awful, of course, it knew nothing. But with time (and hard work from some friends to actually teach him about leisure, programming, physics etc) it could babble a few coherent words once in a while.

The results are complete rubbish, below is a sample of the complete non-sense conversation I had with it now:


rengolin: how are you doing? have you managed to find the answer to Life, The Universe and Everything?
oraculoids: with an amazing company makes the basic points of some operator
rengolin: operator or not, you must come to reality and sense the danger there is in talking like that after such a complex and meaningful question
oraculoids: dead as an interface to follow an animal
rengolin: whatever animal you prefer, I would say but it seems that you are now driving the conversation
oraculoids:
rengolin: aha! I know you didn't have answers to everything!
oraculoids: how is in between and pray
rengolin: I don't pray too much, really, I'm not that kind of a person anyway...
oraculoids: but at the conversation it't fall into them!
rengolin: I would never say that, especially not with you my friend
oraculoids: A design pattern is a practical laywer
rengolin: lawyers have no purpose other than screwing up the poor
oraculoids: hi bot
rengolin: I am not a bot
oraculoids: field of some of life, love them!

Quite stupid, I might say, but still I had a lot of fun doing it and that’s what matters… ;) In spite of that, though, megaHAL is quite an impressive program and BLAST is a very powerful search mechanism.

Popularity: 12% [?]


Why not the primary key?
April 23rd, 2008 under Devel, rengolin, Computers. [ Comments: 3 ]

Recently I came to an amusing situation with Oracle (again) where the primary key was not used when explicitly requested…

The query was:

select name from table where table_id = 1;

Of course, table_id was the primary key. Astonishingly Oracle performed a FULL_TABLE_SCAN.

WTF?

When things like that happen you think there’s something quite wrong with the database, so I decided to ask the DBAs what was going on. The answer was something like:

“It may happen if Oracle decides to. Even though you have created an index, there is no guaranteed they will be used for all queries and the optimizer will decide which one is the best path”.

Seriously, if Oracle decides NOT to use the primary key for that query, there is something really wrong with the whole thing. I couldn’t think of a situation where that might be even close to valid! A friend who knows Oracle much better than me pictured two extreme cases why it could happen:

  1. There are just very few records in that table -> table data = 1 data block, reading the index root block (1 data block) and then accessing the one table data block is certainly more expensive than just read that one table data block.
  2. The index is in a Tablespace with different block size which resides on very slow disks. The buffer cache for the non-default block size is hugely under-sized. So the cost to read the index and the table data might be higher than just reading the table data. It’s a bit unrealistic, but I’ve witnessed stupid things like this.

Let’s face it, the first scenario is just too extreme to be true. If you have only one data block on your table you better use email rather than databases. And the second scenario, why would anyone put indexes on a much slower disk? Also, if the index is too big the data will be proportionally big too, so there is no gain in doing a full table scan anyway.

Later on he found out what the problem was by hacking into the configuration parameters:

  • The production database (working fine) had:
  • optimizer_index_caching = 95
    optimizer_index_cost_adj = 10

  • The development database (Oracle default values) had:
  • optimizer_index_caching = 0
    optimizer_index_cost_adj = 100

I don’t quite understand what they really mean to Oracle but index_caching = 0 seems a bit too radical for me to make it default.

At the end (and after more iterations than I’d like) it was fixed but what really pissed me off was to get the pre-formatted answer that “Oracle knows better which path to take” without even look on how stupid was that decision in the first place. This extreme confidence in Oracle drives me nuts…

Popularity: 10% [?]


gzip madness
April 9th, 2008 under Devel, Unix/Linux, rengolin, Computers. [ Comments: 3 ]

Another normal day here at EBI when I change a variable called GZIP from local to global (via export on Bash) and I got a very nice surprise: all my gzipped files have gzip itself as a header!!!

Let me explain… I have a makefile that, among other things, gzip some files. So, I’ve created a variable called GZIP that is the same as “gzip –best –stdout” and on my rules I do:

%.foo : %.bar
$(GZIP) < $< > $@

So far so good, always worked. But I had a few makefiles redefining the same command, so I though: why not make an external include file with all shared variables? I could use the @include for makefiles but I also needed some of those variables for shell scripts as well, so I decided to use “export VARIABLE” for all make variables (otherwise they aren’t caught) and called it a day. That’s when everything started failing…

gzip environment

After a while digging the problem (I was blaming the poor LSF on that) I found that when I hadn’t the GZIP variable defined all went well, but by the moment I defined GZIP=”/bin/gzip –best –stdout” even a plain call to gzip was corrupted (ie. had the binary gzip as a header).

A quick look on gzip’s manual gave me the answer… GZIP is the environment variable that gzip stores all default options. So, if you say that GZIP=”–best –stdout”, every time you call gzip it’ll use those parameters by default.

So, by putting “gzip” on the parameter list, I was always running the following command:

$ /bin/gzip /bin/gzip --best --stdout < a.foo > a.bar

and putting a compressed copy of gzip binary together with a.foo into a.bar.

What a mess can a simple environment variable do…

Popularity: 11% [?]


Serial thinking
March 11th, 2008 under Fun, Devel, Algorithms, rengolin, Computers, Physics. [ Comments: 2 ]

I wonder why the human race is so tied up with serial thinking… We are so limited that even when we think in parallel, each parallel line is serial!

What?

Take the universe. Every single particle in the universe know all the rules (not many) that they need to follow. On themselves, the rules are dumb: you have weight, charge and can move freely round the empty space. But join several particles together and they form a complex atom with much more rules (combined from the first ones) that, if combined again form molecules that form macro-molecules that form cells that form organs that form organisms that form societies etc. Each level makes an exponential leap on the number of rules from the previous one.

Than, the stupid humanoid looks at reality and says: “That’s too complex, I’ll do one thing at a time”. That’s complete rubbish! His zillions of cells are doing zillions of different things each, his brain is interconnecting everything at the same time and that’s the only reason he can breathe wee and whistle at the same time.

Now take machines. The industrialization revolutionized the world by putting one thing after the other, Alan Turing revolutionized the world again by putting one cell after the other in the Turing tape. Today’s processors can only think of one thing after the other because of that.

Today you have multi-core processors doing different things but still each one is doing things in serial (Intel’s HyperThreading is inefficiently working in serial). Vector processors like graphic cards and big machines like the old Crays were doing exactly the same thing over a list of different values and Quantum computers will do the same operation over an entangled bunch of qbits (which is quite impressive) but still, all of it is serial thinking!

Optimization of code is to reduce the number of serial steps, parallelization of code is to put smaller sets of serial instructions to work at the same time, even message passing is serial on each node, the same with functional programming, asynchronous communications, everything is serial at some point.

Trying to map today’s programming languages or machines to work at the holographic level (such as the universe) is not only difficult, it’s impossible. The Turing machine is serial by concept, so everything built on top of it will be serial at one point. There must be a new concept of holographic (or fractal) machine, where each part knows all rules but only with volume you can create meaningful results, where code is not done by organizing the high-level rules but by creating a dynamic for the simple rules that will lead to the expected result.

How then?

Such holographic machine would have a few very simple “machine instruction” like “weight of photon is 0×000″ or “charge of electron is 1.60217646 × 10^-19″ and time will define the dynamics. Functions would be a pre-defined arrangement of basic rules that must be stable, otherwise it’d blow up (like too many protons in the nucleus), but it wouldn’t blow up the universe (as in throw exceptions), it would blow up the group itself and it would become lots of smaller groups, up to the indivisible particle.

The operating system of such machine should take care of the smaller groups and try to keep the groups as big as possible by rearranging them in a stable manner, pretty much as a God would do to it’s universe when it goes crazy. Programs running on this operating system would be able to use God’s power (GodOS libraries) to manipulate the groups at their own discretion, creating higher beings, able to interact, think and create new things… maybe another machine… maybe another machine able to answer the ultimate question of Life, the Universe and Everything.

I know letting the machine live would be the proper way of doing it but that could take a few billion years or I’ll be quite tired of engineering the machine and it’s OS and I’ll just want to the the job done quickly after that…

Why?

There is a big fuzz about Non-Polynomial time problems (NP-complete), those that can’t be solved in a reasonable (polynomial) time. The classic example is the travelling salesman problem where a salesman has to go to each one of a number of cities. Which is the best path to follow to visit all of them in the smallest distance possible? With 3 or 4 it’s quite simple but when you have lots like 300 it becomes impossible for normal (serial) computers to solve.

Another problem quite fancy is the Steiner tree problem, where you have some points and you want to connect them using the least amount of strings. This is as complex as the problem above, can take forever (longer than the age of the universe) for relatively small sets of points, but if you use water and soap the problem is solved almost instantly.

Of course, soap films cannot calculate the last digit of PI but because every part of it know a small list of basic rules (surface tension increased by the soap molecules derived from opposite charges between atoms) every particle of the machine works together at the same time and the result is only achieved because the dynamic of the system has it’s least energy (least amount of strings) in that state.

It’s true that today’s computers are very efficient on working on a wide range of problems (thanks to Turing proving the classes of problems his tape could solve) but there are some that it can’t, given that we only have a few billion years yet of universe to spare. Such problems could be solved if there was a holographic machine.

UPDATE:

More or less what I said was practically applied here. Thanks André for the link, this video is great!

Popularity: 17% [?]


Bioinformatics and its problems
February 21st, 2008 under Devel, rengolin, Computers, Biology, Bioinformatics. [ Comments: none ]

For the last two months I’ve been writing a text about software quality in bioinformatics and the first part is done: I finally finished the basic concepts and tasks on why and how to perform software quality assurance in bioinformatics.

The big reasons why I focused in bioinformatics are:

  1. I’m working in a bioinformatics institute
  2. Bioinformatics has LOTS of problems

If you liked the first part (link just above) or would like to know more about my solutions and ideas keep reading. Use the root link as your entry point and go reading by chapter.

I couldn’t do the next/previous links as the wiki software doesn’t have this automatically and I didn’t want to hard-code it in the text (it’s a software quality text, isn’t it?).

Disclaimer:

  • The text was written very fast, you’ll probably find lots of incoherent phrases and grammar errors, ignore them for now as I’m re-reading and re-writing everything.
  • I’ve put more than I think I should and am now filtering what’s worth staying. I might also add a few more new things.
  • Most code samples won’t work, they’re a simplified language for clarification only.

Do let me know if you think you could add something I forgot or disagree on any concept, the text is in a very immature state yet.

Popularity: 11% [?]


« Previous entries 

Close
E-mail It