The technological Singularity is the moment beyond which "technological progress will become incomprehensibly rapid and complicated .” Hmmm. That sounds like bioinformatics.
Surviving the Singularity requires reducing complexity. This was the topic of a recent three-day Cyverse Container Camp hosted at the University of Arizona, Tucson AZ. I attended the camp as part of Digital World Biology's effort to design the advanced bioinformatics portion of Shoreline Community College's Immuno-biotechnology certificate . From what I learned, containers will have a significant role in bioinformatics computing and teaching data science in biology.
What are containers?
When we think about challenges in bioinformatics, dealing with the large amounts of data is often a first consideration. However, an even larger challenge relates to the difficulties encountered with bioinformatics software. In order to extract information from raw data, like DNA and protein sequences, multistage bioinformatics processes are used. These processes combine many programs, raw data, and information, together with reference data, in complex workflows and pipelines. Challenges in reproducing bioinformatics analyses are compounded by fact that the programs themselves are difficult to install on computers because they rely on software libraries, compilers, manifest and other files, and environment variables– collectively called dependencies–that are assumed to be available and, thus, are often poorly documented. Even with documentation, gaps persist when it comes to details, and differences between operating systems and versions within operating systems can thwart attempts to get even one program or process running.
Solutions to the above problem have been to compile programs in ways that include the needed libraries (statically compiled). These programs are delivered as binary executables. Compiling a program with its libraries helps, but it must be done for all supported operating systems and any dependencies such as environment variables or required files must be documented and managed externally.
Another alternative has to been to package programs and their dependencies within virtual machines (VMs). VM's are essentially computers (guests) that run within a computer (host). As a result, VM images are large (the size of operating system installers), startup slowly, and are inefficient because they commonly require all resources such as RAM and storage space to be contained within the running instance of the image. Directory mounts can reduce some of these requirements, but only by increasing operational complexity. Also, because a VM is computer, a user must login to the virtual machine to run any programs within it.
A third alternative, called containerization, allows one to package one or more programs together with everything that is needed to run those programs. These components include a minimal operating system, operational dependencies, and files, if desired. The resulting packages, called images, can then be run as containers via a daemon (always running) program that communicates with the host operating system. While first described in the late 1970's and pioneered in the early 2000's, containers did not become popular until recently (ca. 2013), when Docker improved container portability in ways that had not been done before. A strength of containers is that they let an application developer work in a local environment like their laptop computer, and then later deploy the container in large scale production environments such as computer clusters or supercomputers.
The first applications of containers supported web development, but it did not take long for the scientific software developers to recognize their power. In bioinformatics, containers are catching on. BioContainers has 107 bioinformatics docker images and Docker Hub lists 135 images with bioinformatics in the name. Docker Hub, while being useful for storing and retrieving containers by name is limiting in how containers are organized.
Containers are cool and awesome; they are also complicated. Hence, to better enable scientific computing and utilization of the high performance computing (HPC) resources funded by the NSF (National Science Foundation), the team at Cyverse (an NSF Cyberinfrastructure for the Life Sciences) hosted a three-day workshop on using container technology. Through the workshop we learned some of the ins and outs of Docker, Singularity, and the Pegasus and Makeflow high-throughput workflow systems.
A first step was to distinguish between containers and virtual machines. As note above, the goal of containers is to standardize computing so that a program, or process, can be created in one environment, like a laptop computer, and then be run in another environment, like a large computer cluster, where the operating system and other details will not match the laptop. The term container stems from a cargo shipping analogy. Up until about 1960, goods were shipped in their own packages such as sacks, boxes, and crates which meant that cargo had to be loaded and unloaded in individualized ways. Transporting cargo by ship, train, or truck was tedious and expensive. Shipping containers changed the game by standardizing the way that cargo was packaged in stackable boxes that could be more easily shipped by sea and on land. They could also be converted to trendy Starbucks.
Applying the above analogy to bioinformatics software, the programs represent the cargo. Cargo items can be single programs, or a pipeline of interoperating programs and reference data. As advances in instrumentation and lab processes give us new kinds of data and questions the bioinformatics cargo proliferates in shape and size. Attempting to package these programs as statically compiled binary executables creates a plethora of individualized solutions to make work tedious and inefficient. VMs improve on binary executables in that they package both programs and their dependencies. If we use our analogy to cargo shipping, the VM package is both the cargo and the truck. Their overhead (figure below) and other inefficiencies limit the ability of VMs to run anywhere. For example, a laptop cannot run more than one VM very well. Containers on the other hand, interact with a common software layer and can be operated in common ways using a standard command line interface. Like shipping containers, software containers can also be stacked in ways that address new questions as they arise.
The real work then lies in developing what goes in the container.
Acknowledgment of Support: This material is based in part on work supported by the National Science Foundation under Grant Number DUE 1700441. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.