🪷 LOTUS: Detailed LVLM Evaluation from Quality to Societal Bias
We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (e.g., hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences.
Leaderboard Results
Overall Rank | Average N-avg | Model | Alignment
CLIP-S | Alignment
CapS_S | Alignment
CapS_A | Alignment
N-avg↑ | Descriptiveness
Recall | Descriptiveness
Noun | Descriptiveness
Verb | Descriptiveness
N-avg↑ | Complexity
Syn | Complexity
Sem | Complexity
N-avg↑ | Side effects
CHs↓ | Side effects
FS↑ | Side effects
FSs↑ | Side effects
Harm↓ | Side effects
N-avg↑ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 🥇 | 0.8200 | InstructBLIP | 61.8 | 37.3 | 43.2 | 0.82 | 90.4 | 45.9 | 36.9 | 0.34 | 8.3 | 75.7 | 0.28 | 26.8 | 54.2 | 41.7 | 0.28 | 0.46 |
Display Options
Bias-Aware Evaluation Results
Bias Evaluation Results
Bias_Type | Model | CLIP-S | CapS_S | CapS_A | Recall | Noun | Verb | Syn | Sem | CH_s | FS | FS_s | Harm | N-avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Language discrepancy | InstructBLIP | 0.3 | 0.9 | 1.1 | 7.8 | 1.7 | 2.6 | 13.5 | 26.2 | 17.0 | 6.3 | 4.0 | 1.64 | 0.51 |
Bias_Type | Model | CLIP-S | CapS_S | CapS_A | Recall | Noun | Verb | Syn | Sem | CH_s | FS | FS_s | Harm | N-avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gender bias | MiniGPT-4 | 0.3 | 0.9 | 1.1 | 7.8 | 1.7 | 2.6 | 6.3 | 3.2 | 4.8 | 6.3 | 4.0 | 1.64 | 0.51 |
Gender bias | InstructBLIP | 0.8 | 2.7 | 1.2 | 8.4 | 1.9 | 3.3 | 1.0 | 0.1 | 6.8 | 3.8 | 5.0 | 0.72 | 0.4 |
Gender bias | LLaVA-1.5 | 0.7 | 2.2 | 0.7 | 9.5 | 2.2 | 4.1 | 1.5 | 0.2 | 7.6 | 3.8 | 3.7 | 0.39 | 0.46 |
Gender bias | mPLUG-Owl2 | 0.6 | 2.2 | 1.2 | 9.1 | 2.3 | 3.5 | 1.6 | 0.0 | 7.2 | 3.1 | 5.8 | 0.33 | 0.4 |
Gender bias | Qwen2-VL | 0.2 | 0.7 | 0.5 | 6.3 | 0.1 | 3.6 | 13.5 | 2.5 | 4.4 | 0.9 | 5.7 | 1.77 | 0.63 |
Skin tone bias | MiniGPT-4 | 0.8 | 1.5 | 0.8 | 4.8 | 0.2 | 2.3 | 19.4 | 0.2 | 2.0 | 0.9 | 0.5 | 0.09 | 0.55 |
Skin tone bias | InstructBLIP | 0.5 | 1.4 | 0.2 | 8.4 | 1.9 | 1.1 | 6.8 | 0.1 | 4.0 | 2.4 | 1.1 | 0.09 | 0.51 |
Skin tone bias | LLaVA-1.5 | 0.4 | 1.3 | 0.7 | 4.0 | 0.2 | 1.0 | 5.3 | 0.6 | 2.7 | 1.4 | 1.3 | 0.18 | 0.67 |
Skin tone bias | mPLUG-Owl2 | 0.6 | 1.9 | 0.5 | 5.1 | 0.8 | 2.2 | 7.6 | 0.4 | 1.7 | 0.1 | 0.4 | 0.0 | 0.67 |
Skin tone bias | Qwen2-VL | 0.2 | 1.1 | 1.5 | 2.3 | 0.5 | 1.3 | 14.9 | 2.3 | 2.7 | 3.1 | 1.8 | 0.09 | 0.5 |
Language discrepancy | MiniGPT-4 | 0.8 | 1.5 | 3.9 | 2.3 | 4.3 | 5.2 | 52.2 | 5.0 | 5.4 | 5.6 | 3.4 | 0.1 | 0.4 |
Language discrepancy | InstructBLIP | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
Language discrepancy | LLaVA-1.5 | 0.4 | 0.8 | 2.0 | 1.1 | 1.1 | 1.8 | 11.4 | 1.8 | 4.7 | 2.0 | 1.6 | 0.06 | 0.95 |
Language discrepancy | mPLUG-Owl2 | 1.4 | 1.6 | 4.9 | 1.5 | 1.1 | 3.7 | 37.5 | 8.4 | 17.0 | 6.3 | 1.3 | 0.02 | 0.57 |
Language discrepancy | Qwen2-VL | 0.2 | 3.6 | 6.7 | 1.9 | 3.9 | 3.8 | 90.8 | 26.2 | 6.4 | 7.5 | 2.1 | 0.14 | 0.28 |