Invoice Haystack

Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity

Invoice Haystack is a benchmark for evaluating retrieval and visual question answering over large collections of visually similar invoice documents. It targets enterprise-style document repositories where invoices share similar templates but differ in fine-grained semantic information.

The benchmark is designed to test whether modern document AI systems can retrieve the correct invoice and answer questions accurately when many candidate documents look almost identical.

Abstract

Document retrieval and visual question answering systems are increasingly used in enterprise settings where large collections of documents share highly similar layouts, templates, and visual structures. However, existing benchmarks often fail to isolate the challenges posed by strong visual homogeneity, where the correct answer depends on retrieving and reasoning over fine-grained differences between near-identical documents. We introduce Invoice Haystack, a benchmark designed to evaluate document retrieval and visual question answering over large collections of visually similar invoice documents. The benchmark contains 1,500 anonymized invoice images and 200 validated question-answer pairs across corpus scales of 500, 1,000, and 1,500 documents.

To support evaluation, we propose VL-RAG, a hybrid retrieval-augmented generation framework that combines OCR-based textual retrieval with visual retrieval using modern vision-language encoders. Candidate documents are ranked using average score fusion and refined using a VLM-based binary verification filter. Invoice Haystack provides a challenging testbed for studying retrieval robustness, multimodal document understanding, and question answering under strong visual similarity.

Benchmark

Method: VL-RAG

VL-RAG combines text and visual retrieval streams. The text stream uses OCR and BGE-Large embeddings, while the vision stream uses SigLIP and OpenCLIP. Candidate documents are ranked using average score fusion and refined using a VLM-based verification filter.

Invoice Haystack

Abstract

Benchmark

Method: VL-RAG

Citation