Abstract: Today, it is the amount of available data rather than its acquisition that poses a significant challenge to computer science. It is the issue of extracting valuable and useful information from increasing data volumes, and it has manifested itself throughout most of modern global digital frameworks, like economic analysis, weather forecast, or, indeed, national security, recently publicized in the 2013 NSA affair.

This problem remains very pronounced in the field of bioinformatics, where it is compounded by the rapid progress in the fields of DNA and protein sequencing within the last 20 years.

Methods like next generation sequencing provide a low-cost experimental approach and their wide adaptation has led to an exponential growth in biological data. The frontier, so to speak, is now the attempt to understand all this data well enough to make it useful in a variety of scientific and industrial contexts, including, but not limited to, evolutionary biology, biochemistry, pharmaceutics and medicine. To deal with this obstacle, a possible approach is to divide the computation in question into as many independent parts as possible, and then compute those parts on many machines. This idea of distributed computing may be seen in many examples, like Rosetta or Folding, both using the BOINC framework (Berkeley Open Infrastructure for Network Computing) to distribute their computations. A second approach, one that may also be used in tandem with distributed computing, is the idea to parallelize computation on multiple cores, be it the CPU or the GPU. This concept is becoming increasingly important with the growing availability of multi-core processors and the number of cores installed on common chipsets.

In the field of bioinformatics, the issue of the computational demands of analyzing tremendous amounts of data exceeding computational capacities may be found in different forms: be it the analysis of genome-wide association studies, genome-wide searches for ORFs, or in database-wide comparisons of all sorts of biological data, especially when an all-against-all comparison is required. One such instance, requiring the comparison of all known protein sequence to each other, may be found in the SIMAP project.

Author/s:  Sylvain Henry(1), Alexandre Denis(2), Denis Barthou(3), Marie-Christine Counilh(3),and Raymond Namyst(3)
(1) Exascale Computing Research Laboratory, France
(2) Inria Bordeaux
(3) Univ. of Bordeaux, France
Article: http://techoverflow.net/publications/CLSW-Report.pdf
Source/Type: Gobi Report