Over the past few years, I have directed my research and teaching efforts in the new area called bioinformatics (a term preferred by biologists) or Computational Biology (as computer scientists usually call it). I was initially attracted to this new subject after examining the work of David Searls [Sea 92], wherein he outlined the relationship between formal languages and biology. I became fascinated by the topic of parsing DNA and amino acids sequences for the purpose of determining genes and protein shape.
Most of the work done in parsing and compiling assumes deterministic grammars; this is largely insufficient to describe biological models. Grammars proposed for determining genes and the secondary structure of proteins are highly ambiguous. In addition, one has to handle very large strings and, due to variations in the strings being analyzed, a probabilistic approach is in order.
One of my doctoral students, Olivier Baby, wrote his dissertation on non-deterministic FSGs that define all possible alternations of exons and introns [Bab 98]. This is an important problem in gene finding. It turns out that ambiguous FSGs are very useful in defining those alternations. Unfortunately, the existing work in formal languages neglects this important area. The significant work in FSGs establishes that non-deterministic FSGs — and therefore ambiguous grammars — have equivalent deterministic counterparts. In the case of ambiguous grammars, that counterpart is obtained by a transformation that annihilates all vestiges of multiple parsers.
The important consideration in this case is: given an ambiguous FSA and a string it accepts, it is always possible to generate (in linear time) a DAG whose paths represent all possible parses of that string. The main problem then becomes finding the most likely (optimal) path in the DAG. That is the essence of Hidden Markov Models (HMMs) that have been widely used in speech recognition and now in bioinformatics. It requires the use of probabilistic and dynamic programming methods.
It turns out that a similar kind of technique is useful in parsing ambiguous CFGs. For example, RNA shape can be defined by highly ambiguous probabilistic CFG grammar and the goal of parsing is the transformation of those grammars into DAGs that can later be analyzed for optimal paths using dynamic programming.
In [Coh 99c], I summarize my views about the usefulness of logic programming (LP) and constraint LP (CLP) in solving some computational biology problems. I have also started to investigate the usage of machine learning (ML), in particular the methods used in Inductive LP (ILP). What is needed is the introduction of probabilistic approaches in ML to deal adequately with the interesting problems in bioinformatics.
Presently, there are two major problems that interest me:
- The inverse protein-folding problem: given a sequence of amino acids representing a known three-dimensional protein structure, determine sequences that are likely to have the same structure. The so-called threading approach belongs to this category of algorithms. A paper by Kleinberg [Kle 99] explores this topic from a theoretical operations research perspective.
- The reverse problem of cell regulation, also called modeling: given the micro-array data describing curves specifying how protein production or gene-expression varies with time, determine the regulation graph that governs cell behavior. Ideally, one would want to determine metabolic-pathways, which are essentially generalizations of the regulation graphs that take into account detailed interactions between the environment of a cell and its inner workings.
My paper [Coh 01b] explores how constraints may be used to generate regulation graphs from micro-array data.
In the spring of 2002, I was invited to co-teach a joint course in Bioinformatics offered to Brandeis and Wellesley College students. That course enabled me to establish closer ties with two biologists: Dagmar Ringe from Brandeis and Andrew Webb from Wellesley. I am a believer that work in bioinformatics depends heavily on the collaboration between researchers in both biology and in computer science. Interacting with biologists is among my highest priorities.
The 2011 paper on cell regulation [CRC 11] shows how novel concepts in discrete mathematics (e.g., eigenvalues) are useful in formulating the reverse engineering problem of inferring genetic networks from laboratory data. This work is related to that described in [Coh 01b].
As of April of 2014, the ACM Surveys paper on introducing Bioinformatics to Computer Scientists [Coh 04a] has had about 14,000 downloads from the ACM digital library. This high rate of activity demonstrates the increasing interest computer scientists have in the interdisciplinary area of computational molecular biology. My paper on guidelines for establishing undergraduate courses in bioinformatics [Coh 03b] has also attracted curriculum developers in both biology and computer science.
It should not be surprising that concepts in computer science, such as constraints and parsing, are playing key roles in bioinformatics. My own experience is that having an interdisciplinary approach to problems considerably widens the originality and scope of the research in any field.
References
[Sea 92] D. Searls, The Linguistics of DNA, American Scientist, 80, 579-591, 1992.
[Kle 99] J. Kleinberg, Efficient Algorithms for Protein Sequence Design and the Analysis of Certain Evolutionary Fitness, Proceedings of the 3rd ACM RECOMB International Conference on Computational Molecular Biology, 1999.