Self-Training

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

We introduce STIC (Self-Training on Image Comprehension) that enhances the understanding and reasoning capabilities of LVLMs through self-generated data. Our experiments across seven benchmarks, including ScienceQA, TextVQA, ChartQA, LLaVA-Bench, MMBench, MM-Vet, and MathVista, demonstrate a notable average accuracy gain of 4.0% by self-training.

May 30, 2024

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Jan 2, 2024