Software
HipMer and MetaHipMer
Extreme Scale De Novo Genome and MetaGenome Assembler
MetaHipMer, is the first high-quality end-to-end de novo metagenome assembler designed for extreme scale data sets on distributed memory HPC systems. It is based on an earlier single genome version, HipMer, which itself was based on a single node Meraculous assembler. MetaHipMer is a PGAS application, and following a major rewrite of the code (sometimes called MHM2), the main software dependencies are the UPC++ programming system and the underlying GASNet-EX communication layer. There is also support for GPU acceleration with a dependency on CUDA and HIP. MetaHipMer’s high performance is based on several novel algorithmic advancements attained by leveraging the efficiency and programmability of the one-sided communication capabilities and RPC calls from UPC++, including optimized high-frequency k-mer analysis, communication avoiding de Bruijn graph traversal, advanced I/O optimization, and extensive parallelization across the numerous and complex application phases.
Authors
The primary authors of MHM2 are Steven Hofmeyr, Rob Egan and Muaaz Awan. The authors of the original MetaHIpMer and HipMer are Evangelos Georganas, Steven Hofmeyr, Aydin Buluc, Rob Egan and Eugene Goltsman. Leonid Oliker and Kathy Yelick have provided direction and advice throughout the development process. The original Meraculous was developed by Jarrod Chapman, Isaac Ho, Eugene Goltsman, and Daniel Rokhsar.
Software
The latest release of MetaHipMer (version 2.1.0.2 released January 2023) is now available here.
Contact
For more information about MetaHipMer, contact Steven Hofmeyr
HipMCL
Distributed-Memory Protein Clustering using High-Performance MCL
HipMCL is a high-performance parallel algorithm for large-scale network clustering. HipMCL parallelizes popular Markov Cluster (MCL) algorithm that has been shown to be one of the most successful and widely used algorithms for network clustering. It is based on random walks and was initially designed to detect families in protein-protein interaction networks. Despite MCL’s efficiency and multi-threading support, scalability remains a bottleneck as it fails to process networks of several hundred million nodes and billion edges in an affordable running time. HipMCL overcomes all of these challenges by developing massively-parallel algorithms for all components of MCL. HipMCL can be x1000 times faster than the original MCL without any information loss. It can easily cluster a network of ~75 million nodes with ~68 billion edges in ~2.4 hours using ~2000 nodes of Cori supercomputer at NERSC. HipMCL is developed in C++ language and uses standard OpenMP and MPI libraries for shared- and distributed-memory parallelization.
Authors
Primary authors are Ariful Azad and Aydin Buluc, in collaboration with Georgios Pavlopoulos (JGI), Nikos Kyrpides (JGI) and Christos Ouzounis (CERTH).
Software
The first release of HipMCL (1.0.0) is now available. Download from Bitbucket
Contact
For more information about HipMCL, contact Ariful Azad
MerBench
Microbenchmarks for Measuring Asynchronous Collective Communication Performance
MerBench is a set of microbenchmarks originally developed for analyzing the performance of the primary communication patterns implemented in HipMer, an extreme-scale de novo genome assembler. One of the keys to HipMer’s high performance is attained by leveraging one-sided communication capabilities of the Unified Parallel C (UPC) for asynchronous Alltoall and Alltoallv communication. These benchmarks are a distillation of these essential communication patterns and parameters (e.g. message size) for cross-architecture and cross-application network performance analysis.
Authors
The primary authors are Evangelos Georganas, Rob Egan, and Marquita Ellis. Evangelos Georganas developed the original version of the microbenchmarks for analyzing HipMer. Rob Egan contributed a number of extensions for usability and cross-platform portability.
Software
The microbenchmarks are available as part of HipMer on Sourceforge, and as a standalone release.
Contact
For more information about MerBench, contact Marquita Ellis.
Metamer
Metagenome Clustering based on K-Mer Signatures
Metamer is a workflow tool that takes in multiple next generation sequencing metagenome samples, calculate pairwise distance based on their k-mer content, and further cluster them based on the distance. The framework will help in providing structure to available metagenome samples, which will be essential in generating a metagenome-based database for characterization of metagenomes.
Authors
The primary author is Migun Shakya and Patrick Chain.
Software
This is primarily a Python code.
Contact
For more information Migun Shakya.
diBELLA 2D
Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly
Distributed memory version of BELLA based on MPI and using CombBLAS library. diBELLA 2D uses BELLA's overlapping methodology and completes it adding a transitive reduction step performed through algebraic operations. Future work includes a repeat resolution step and a scaffolding step to obtain an end-to-end distributed memory long-read assembler.
Authors
The primary authors are Giulia Guidi, Aydin Buluc, Saliya Ekanayake, and Oguz Selvitopi.
Software
The first release is available at https://github.com/PASSIONLab/diBELLA.2D (master branch).
Contact
For more information about diBELLA 2D, contact Giulia Guidi.
BELLA
Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
A computationally efficient and highly accurate long-read to long-read aligner and overlapper. BELLA is written in C++ and it is currently implemented in shared-memory, single node using OpenMP.
Authors
The primary authors are Giulia Guidi and Aydin Buluc.
Software
The first release is available at https://github.com/PASSIONLab/BELLA (master branch).
Contact
For more information about BELLA, contact Giulia Guidi.
PASTIS
Distributed Many-to-Many Protein Sequence Alignment
PASTIS a fully distributed pipeline for large-scale protein similarity search. PASTIS constructs similarity graphs from large collections of protein sequences, which in turn can be used by a graph clustering algorithm to accurately discover protein families. A major novelty of PASTIS is its use of distributed sparse matrices as its underlying data structure. Not only the sequences and their k-mers are stored through sparse matrices, but also the substitute k-mers that are critical for controlling sensitivity and specificity during sequence overlapping. PASTIS extensively hides communication and exploits the symmetricity of the similarity matrix to achieve load balance. PASTIS is demonstrated to scale up to 2025 nodes (137,700 cores) and its accuracy is on par with the state-of-the-art.
Authors
The primary authors are Oguz Selvitopi, Saliya Ekaneyake, and Aydin Buluc.
Software
The first release is available at https://github.com/PASSIONLab/PASTIS.
Contact
For more information about PASTIS, contact Oguz Selvitopi.