Software

HipMer and MetaHipMer

Extreme Scale De Novo Genome and MetaGenome Assembler

MetaHipMer, is the first high-quality end-to-end de novo metagenome assembler designed for extreme scale data sets on distributed memory HPC systems. It is based on an earlier single genome version, HipMer, which itself was based on a single node Meraculous assembler. MetaHipMer is a PGAS application, and following a major rewrite of the code (sometimes called MHM2), the main software dependencies are the UPC++ programming system and the underlying GASNet-EX communication layer. There is also support for GPU acceleration with a dependency on CUDA and HIP. MetaHipMer’s high performance is based on several novel algorithmic advancements attained by leveraging the efficiency and programmability of the one-sided communication capabilities and RPC calls from UPC++, including optimized high-frequency k-mer analysis, communication avoiding de Bruijn graph traversal, advanced I/O optimization, and extensive parallelization across the numerous and complex application phases.

Authors

The primary authors of MHM2 are Steven Hofmeyr, Rob Egan and Muaaz Awan. The authors of the original MetaHIpMer and HipMer are Evangelos Georganas, Steven Hofmeyr, Aydin Buluc, Rob Egan and Eugene Goltsman. Leonid Oliker and Kathy Yelick have provided direction and advice throughout the development process. The original Meraculous was developed by Jarrod Chapman, Isaac Ho, Eugene Goltsman, and Daniel Rokhsar.

Software

The latest release of MetaHipMer (version 2.1.0.2 released January 2023) is now available here.

Contact

For more information about MetaHipMer, contact Steven Hofmeyr

HipMCL

Distributed-Memory Protein Clustering using High-Performance MCL

HipMCL is a high-performance parallel algorithm for large-scale network clustering. HipMCL parallelizes popular Markov Cluster (MCL) algorithm that has been shown to be one of the most successful and widely used algorithms for network clustering. It is based on random walks and was initially designed to detect families in protein-protein interaction networks. Despite MCL’s efficiency and multi-threading support, scalability remains a bottleneck as it fails to process networks of several hundred million nodes and billion edges in an affordable running time. HipMCL overcomes all of these challenges by developing massively-parallel algorithms for all components of MCL. HipMCL can be x1000 times faster than the original MCL without any information loss. It can easily cluster a network of ~75 million nodes with ~68 billion edges in ~2.4 hours using ~2000 nodes of Cori supercomputer at NERSC. HipMCL is developed in C++ language and uses standard OpenMP and MPI libraries for shared- and distributed-memory parallelization.

Authors

Primary authors are Ariful Azad and Aydin Buluc, in collaboration with Georgios Pavlopoulos (JGI), Nikos Kyrpides (JGI) and Christos Ouzounis (CERTH).

Software

The first release of HipMCL (1.0.0) is now available. Download from Bitbucket

Contact

For more information about HipMCL, contact Ariful Azad

MerBench

Microbenchmarks for Measuring Asynchronous Collective Communication Performance

MerBench is a set of microbenchmarks originally developed for analyzing the performance of the primary communication patterns implemented in HipMer, an extreme-scale de novo genome assembler. One of the keys to HipMer’s high performance is attained by leveraging one-sided communication capabilities of the Unified Parallel C (UPC) for asynchronous Alltoall and Alltoallv communication. These benchmarks are a distillation of these essential communication patterns and parameters (e.g. message size) for cross-architecture and cross-application network performance analysis.

Authors

The primary authors are Evangelos Georganas, Rob Egan, and Marquita Ellis. Evangelos Georganas developed the original version of the microbenchmarks for analyzing HipMer. Rob Egan contributed a number of extensions for usability and cross-platform portability.

Software

The microbenchmarks are available as part of HipMer on Sourceforge, and as a standalone release.

Contact

For more information about MerBench, contact Marquita Ellis.

Metamer

Metagenome Clustering based on K-Mer Signatures

Metamer is a workflow tool that takes in multiple next generation sequencing metagenome samples, calculate pairwise distance based on their k-mer content, and further cluster them based on the distance. The framework will help in providing structure to available metagenome samples, which will be essential in generating a metagenome-based database for characterization of metagenomes.

Authors

The primary author is Migun Shakya and Patrick Chain.

Software

This is primarily a Python code.

Contact

For more information Migun Shakya.

diBELLA 2D

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

Distributed memory version of BELLA based on MPI and using CombBLAS library. diBELLA 2D uses BELLA's overlapping methodology and completes it adding a transitive reduction step performed through algebraic operations. Future work includes a repeat resolution step and a scaffolding step to obtain an end-to-end distributed memory long-read assembler.

Authors

The primary authors are Giulia Guidi, Aydin Buluc, Saliya Ekanayake, and Oguz Selvitopi.

Software

The first release is available at https://github.com/PASSIONLab/diBELLA.2D (master branch).

Contact

For more information about diBELLA 2D, contact Giulia Guidi.

BELLA

Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

A computationally efficient and highly accurate long-read to long-read aligner and overlapper. BELLA is written in C++ and it is currently implemented in shared-memory, single node using OpenMP.

Authors

The primary authors are Giulia Guidi and Aydin Buluc.

Software

The first release is available at https://github.com/PASSIONLab/BELLA (master branch).

Contact

For more information about BELLA, contact Giulia Guidi.

PASTIS

Distributed Many-to-Many Protein Sequence Alignment

PASTIS a fully distributed pipeline for large-scale protein similarity search. PASTIS constructs similarity graphs from large collections of protein sequences, which in turn can be used by a graph clustering algorithm to accurately discover protein families. A major novelty of PASTIS is its use of distributed sparse matrices as its underlying data structure. Not only the sequences and their k-mers are stored through sparse matrices, but also the substitute k-mers that are critical for controlling sensitivity and specificity during sequence overlapping. PASTIS extensively hides communication and exploits the symmetricity of the similarity matrix to achieve load balance. PASTIS is demonstrated to scale up to 2025 nodes (137,700 cores) and its accuracy is on par with the state-of-the-art.

Authors

The primary authors are Oguz Selvitopi, Saliya Ekaneyake, and Aydin Buluc.

Software

The first release is available at https://github.com/PASSIONLab/PASTIS.

Contact

For more information about PASTIS, contact Oguz Selvitopi.