A Graph-Augmented Multi-Stage Transformer Model for Document Layout Understanding

A Arshad and M Moetesum and AU Hasan and F Shafait, INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION (2025).

DOI: 10.1007/s10032-025-00566-2

Visually Rich Document Understanding (VRDU) requires models that effectively capture spatial layouts and semantic relationships among textual and visual entities. This paper presents a Graph-Augmented Multi-Stage Transformer Model that integrates Graph Neural Networks (GNNs) with 2D positional embeddings to enhance spatial reasoning and contextual representations. The proposed model introduces learnable row- column embeddings and a hierarchical multi-stage transformer architecture for efficient and progressive feature refinement. Comprehensive evaluations on the FUNSD and DocVQA datasets demonstrate consistent performance improvements, achieving 91.35% F1 and 79.91% ANLS scores, respectively, establishing new benchmarks in structured document understanding. Furthermore, evaluation on the DUDE dataset, which comprises complex, multi-page, and heterogeneous documents, illustrates the model's scalability and robustness to real-world document variations. Comparative analyses with LayoutLMv3, LiLT, and Qwen2-VL confirm that the proposed approach achieves strong generalization with competitive efficiency, making it a robust and practical solution for document layout understanding tasks.

Return to Publications page