You are here : Home > IG > Technological resources > Scientific information technology

Scientific information technology

Published on 8 September 2016

The Scientific Information Technology Laboratory consists of 3 teams who intervene throughout the facilities of the Institute of Genomics. The Laboratory's missions are focused on system and network architecture, development and operation of production management systems, and management of the massive sequencing data. For more details see scientific information technology laboratory, Genoscope. 

The high-throughput sequencers are the essential components of the Institute's scientific IT. They are used by Genoscope and the CNG for their numerous applications. Each of the Institute's 17 instruments (installed at the IG in early 2015) has the capacity to generate 1.5 TB each month, i.e. more than 25 TB of sequencing data (after compression) for the set of instruments.

The IT system therefore has to be able to ensure the storage of the data with the required level of security, and to effectively mobilize them by computational means in reading and writing mode while ensuring adequate traceability of the objects whose total volume is 1,000 TB (1 PB).

The IT processing mainly consists in the category of applications known as 'data intensive' applications. The applications are characterized by the use of large quantities of data which are read, written and modified by programs which filter them, evaluate their quality and collate them (by comparison or statistical methods (classification), which may or not be supervised), with already known data. The processing may also be considered 'intensive computation'.

In all cases, run times are long due to the quantity of data to be processed or the complexity of the algorithms. Run times are also frequently difficult to predict (they depend on the data themselves and thus on the unknown in the equation). This adds an additional difficulty to the exploitation of the computation clusters.

The applications mainly fall into 2 categories:

  1. Codes adapted to a massive parallelism model without a strong requirement for synchronization ('embarrassingly parallel'). The aim is to exploit a property of the data enabling division of the data inflow and then application of an algorithm to each fraction of the inflow data. This requires the result of the calculation to be independent of the initial slicing (necessary condition).

  2. ​Applications similar to the course of a graph or to the constitution of repertories of words necessitating a very large addressing space (from a few hundred GB to a few TB) and a run time that may exceed 1 month. These applications require specialized machines equipped with considerable random-access memory (RAM).​
The information technology environment, calculation and storage, is sized to enable management of the primary processing of the data. The environment also hosts a number of tools for the exploration of genomes enabling exploitation of project data. In addition, a strategic partnership has been instituted with the TGCC (Very Large Computation Center), the CEA's computation center at Bruyères-le-Chatel, where capacities dedicated to storage (5 PB) and calculation (3,000 cores) have been deployed in the context of France Genomics. The dedicated installation is an extension of the CCRT's Airain computer (20,000 cores, 420 TFLOPS). The TGCC also houses the Curie computer (100,000 cores, 2 PFLOPS) which, in particular, is used in the context of PRACE calls for projects. The interest of the assembly resides in the central position of the data which are accessible, in a transparent manner, to the 2 computers.

The Laboratory pays special attention to the portability of codes and pipelines between the two environments, local and France Génomique.

The 3 groups host trainees, young people on apprenticeship contracts, and regularly offer temporary or permanent positions.​​​