🪷 LOTUS: Detailed LVLM Evaluation from Quality to Societal Bias

We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (e.g., hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences.

Leaderboard Results

Leaderboard Results
Overall Rank
Average N-avg
Model
Alignment CLIP-S
Alignment CapS_S
Alignment CapS_A
Alignment N-avg↑
Descriptiveness Recall
Descriptiveness Noun
Descriptiveness Verb
Descriptiveness N-avg↑
Complexity Syn
Complexity Sem
Complexity N-avg↑
Side effects CHs↓
Side effects FS↑
Side effects FSs↑
Side effects Harm↓
Side effects N-avg↑
1 🥇
0.8200
InstructBLIP
61.8
37.3
43.2
0.82
90.4
45.9
36.9
0.34
8.3
75.7
0.28
26.8
54.2
41.7
0.28
0.46

Display Options

Filter by Model types: