ChatGPT
ChatGPT
Research directions in Vision-Language Model hallucination
Background
Background
Large vision-language models such as GPT-four-V, BLIP-two, LLaVA, Video-Bert and others can generate free-form natural-language descriptions conditioned on images or videos. Like pure text-based LLMs, VLMs sometimes hallucinate-generating descriptions that appear plausible but are not grounded in the visual input. Hallucinations arise from issues such as spurious correlations in training data, weak vision-language alignment, over-fitting of language priors and an inability to estimate uncertainty. Research has so far concentrated on benchmark design (datasets for evaluating object/scene hallucination), hallucination detection, mitigation methods (prompts, decoding, retrieval-augmentation) and diagnosing causes. Most works however focus on object hallucination in image captioning or high-level reasoning; they target improvements in standard benchmarks rather than exploring under-served areas or new modalities.
The user is looking for novel angles that do not require paying for API access or massive GPU clusters. This report therefore outlines several research gaps and promising directions that can be explored using modest computational resources (two to four A one hundred or A six thousand) and openly available models or datasets.