GNNome-Decision: Enhancing GNN Training for de Novo Genome Assembly by Targeting Decision Nodes
Abstract
De novo genome assembly, the reconstruction of complete DNA sequences from reads without the use of a reference genome, remains one of the most challenging and fundamental problems in computational biology. A common method for de novo genome assembly involves creating an assembly graph of reads and defining a graph traversal that represents the genomic sequence. Recently, the first deep learning-based method, GNNome, was proposed to tackle this problem. Starting from an assembly graph, GNNome performs de novo assembly in two steps: binary edge classification and a greedy walk. However, we observe that the decisions of the greedy agent only matter in 0.86% of nodes. In this paper, we develop an objective function based on margin-ranking loss for GNN training that focuses on these decision nodes, effectively aligning the training objective with the performance of the downstream task of greedy pathfinding. Furthermore, we introduce a modification to the dataset creation pipeline, which increases the fraction of decision nodes by more than tenfold to 9.35%, strongly enhancing the information density in the training dataset. Trained on only human data, our model improves the NGA50 score compared to GNNome on the CHM13 human genome from 111.0 Mb to 115.8 Mb, while achieving similar assembly quality on three non-human real genomes, consistently increasing assembly completeness and decreasing duplicated genes.
Type
Publication
In RECOMB International Workshop on Comparative Genomics