SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

Try now

SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

Abstract

Large Language Models with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification-a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores don't improve, underscoring its potential to strengthen both performance and explainability.

One Introduction

One Introduction

Extending the context window of large language models to process thousands and millions of tokens is a promising step toward building systems capable of comprehending long, complex documents without relying on aggressive chunking or retrieval-based pipelines. However, constructing datasets for both fine-tuning and evaluating long-context large language models remains labor-intensive and costly, limiting scalability. Synthetic datasets have emerged as a promising alternative to manual annotation, enabling large-scale, low-cost generation of training and evaluation data.

Yet, in the long-context setting, empirical findings remain mixed: some studies report diminished or even negative effects from synthetic long-context training, while others demonstrate substantial gains over weak long-context baselines. These discrepancies highlight the need for a systematic evaluation of synthetic data's utility in improving long-context reasoning. In this work, we focus on evaluating long-context synthesis for long-context claim verification task.

We pose the following research questions, addressing both verification performance and explanation quality. RQ1: How does synthetic long-context training data affect downstream claim? We study this question along two dimensions: (i) the effect of context length on verification accuracy, and (ii) the impact of the source domain of the synthetic data on out-of-domain verification benchmarks. RQ2: How does synthesis logic affect downstream claim verification? We study this by varying error types in unverifiable claims and claim complexity in verifiable ones. RQ3: Does synthetic training improve the quality of model-generated explanations? We examine whether synthetic tuning improves explanation quality by encouraging rationales that more consistently cite relevant evidence from the input context.

We introduce SynClaimEval, an evaluation framework for systematically evaluating the utility of synthetic data in long-context claim verification across the dimensions outlined in our research questions. Figure one provides an overview of the framework. For RQ1, we vary training context length by truncating source articles, while keeping evaluation benchmarks untruncated as reference, and test both within-domain and out-of-domain settings to assess generalization. For RQ2, we manipulate the logic of synthesis along two dimensions: complexity, by conditioning on structured representations that induce multi-hop reasoning, and error type,

by contrasting hallucinated unverifiable claims with contradictory ones. For RQ3, we evaluate explanation quality through pairwise ranking, asking whether rationales generated under different synthesis strategies offer more support to the same predicted label.

Our study yields five key insights: (i) long-context synthesis enables base instruction-following models to narrow the gap with stronger models, though gains are not always consistent; (ii) extending training contexts improves verification performance; (iii) balancing contradictory and unverifiable hallucinated errors yields larger improvements than relying solely on unverifiable errors; (iv) structured synthesis, for example, multi-hop reasoning, improves performance and generalizes more effectively than unstructured approaches; and (v) although verification gains are modest, synthesis consistently improves explanation quality, independent of verification accuracy improvements.

Two Related Work

Three SynClaimEval

Three point one Preparing Claim Sources

Three point two. Claim Synthesis Strategies

Three point three. Evaluating Explanations

Four. Datasets

Four point two Evaluation Benchmarks

Five Experimental Setup

Five point two Continual Fine-tuning

Six Results and Analysis

Six point two RQ two: Error types and synthesis logic

Six point three RQ three Impact on generated explanations

Seven Conclusion and Future Work

Limitations

Ethics Statement

A Summarization Prompts

B. one Unstructured Synthesis

B. three Argument-graph Synthesis Prompts

D GPT-four Evaluation of Claim Synthesis

F Reasoning Output

SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification