SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification
SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification
Abstract
Large Language Models with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification-a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores don't improve, underscoring its potential to strengthen both performance and explainability.
One Introduction
One Introduction
Extending the context window of large language models to process thousands and millions of tokens is a promising step toward building systems capable of comprehending long, complex documents without relying on aggressive chunking or retrieval-based pipelines. However, constructing datasets for both fine-tuning and evaluating long-context large language models remains labor-intensive and costly, limiting scalability. Synthetic datasets have emerged as a promising alternative to manual annotation, enabling large-scale, low-cost generation of training and evaluation data.
Yet, in the long-context setting, empirical findings remain mixed: some studies report diminished or even negative effects from synthetic long-context training, while others demonstrate substantial gains over weak long-context baselines. These discrepancies highlight the need for a systematic evaluation of synthetic data's utility in improving long-context reasoning. In this work, we focus on evaluating long-context synthesis for long-context claim verification task.
We pose the following research questions, addressing both verification performance and explanation quality. RQ1: How does synthetic long-context training data affect downstream claim? We study this question along two dimensions: (i) the effect of context length on verification accuracy, and (ii) the impact of the source domain of the synthetic data on out-of-domain verification benchmarks. RQ2: How does synthesis logic affect downstream claim verification? We study this by varying error types in unverifiable claims and claim complexity in verifiable ones. RQ3: Does synthetic training improve the quality of model-generated explanations? We examine whether synthetic tuning improves explanation quality by encouraging rationales that more consistently cite relevant evidence from the input context.
We introduce SynClaimEval, an evaluation framework for systematically evaluating the utility of synthetic data in long-context claim verification across the dimensions outlined in our research questions. Figure one provides an overview of the framework. For RQ1, we vary training context length by truncating source articles, while keeping evaluation benchmarks untruncated as reference, and test both within-domain and out-of-domain settings to assess generalization. For RQ2, we manipulate the logic of synthesis along two dimensions: complexity, by conditioning on structured representations that induce multi-hop reasoning, and error type,
by contrasting hallucinated unverifiable claims with contradictory ones. For RQ3, we evaluate explanation quality through pairwise ranking, asking whether rationales generated under different synthesis strategies offer more support to the same predicted label.
Our study yields five key insights: (i) long-context synthesis enables base instruction-following models to narrow the gap with stronger models, though gains are not always consistent; (ii) extending training contexts improves verification performance; (iii) balancing contradictory and unverifiable hallucinated errors yields larger improvements than relying solely on unverifiable errors; (iv) structured synthesis, for example, multi-hop reasoning, improves performance and generalizes more effectively than unstructured approaches; and (v) although verification gains are modest, synthesis consistently improves explanation quality, independent of verification accuracy improvements.