index.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context">
  <meta name="keywords" content="MedVH, LLM, Multimodal LLM, AI, AI in Medicine, Hallucination">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script>
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>


<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">
            MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</h1>
          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <a href="https://github.com/dongzizhu">Zishan Gu</a><sup>1</sup>,</span>
            <span class="author-block">
              <a href="https://github.com/paralym">Jiayuan Chen</a><sup>1</sup>,</span>
            <span class="author-block">
              <a href="https://xiangyue9607.github.io/">Fenglin Liu</a><sup>2</sup>,
            </span>
            <span class="author-block">
              <a href="https://yinchangchang.github.io/">Changchang Yin</a><sup>1</sup>,
            </span>
            <span class="author-block">
              <a href="https://www.pingzhang.net/">Ping Zhang</a><sup>1</sup>,
            </span>
          </div>

          <div class="is-size-5 publication-authors">
            <span class="author-block"><sup>1</sup>The Ohio State University,</span>
            <span class="author-block"><sup>2</sup>University of Oxford</span>
          </div>

          <div class="column has-text-centered">
            <div class="publication-links">
              <!-- PDF Link. -->
              <span class="link-block">
                <a href="https://arxiv.org/abs/2407.02730" target="_blank"
                   class="external-link button is-normal is-rounded is-dark">
                  <span class="icon">
                      <i class="ai ai-arxiv"></i>
                  </span>
                  <span>arXiv</span>
                </a>
              </span>
              <!-- Code Link. -->
              <span class="link-block">
                <a href="https://github.com/AIMedLab/PULSE" target="_blank"
                   class="external-link button is-normal is-rounded is-dark"> 
                  <span class="icon">
                      <i class="fab fa-github"></i>
                  </span>
                  <span>Code</span>
                  </a>
              </span>
            </div>

          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">
    <!-- Abstract. -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
            Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination, especially in the real-life clinical context. To bridge this gap, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific as well as general LVLMs. MedVH comprises six tasks to evaluate hallucinations in LVLMs within the medical context, which includes two traditional tasks and four novel tasks formatted in multi-choice visual question answering and long response generation. Our extensive experiments reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Additionally, we explore potential methods for mitigating hallucinations without model-specific fine-tuning, including prompt engineering and collaboration between general and domain-specific models. Our work paves the way for future evaluations of these studies.
          </p>
        </div>
      </div>
    </div>
    <!--/ Abstract. -->

    <br>

    <section class="hero teaser">
      <div class="container is-max-desktop has-text-centered">
        <img src="static/images/figure0.png">
        <div class="content has-text-justified">
          <p>Figure 1. Overview of Evaluation results. </p>
        </div>
      </div>
    </section>
</section>


  <!-- Dataset-->
<section class="section">
  <div class="container is-max-desktop">

    <section class="hero teaser">
      <div class="container is-max-desktop has-text-centered">
        <h2 class="title is-3">MedVH Benchmark</h2>
        <img src="static/images/medVH.png">
        <div class="content has-text-justified">
          <p>Figure 2. Overall evaluation framework. we evaluate eight state-of-the-art LVLMs from two facets, each corresponding to a different type of hallucination in the medical context. The first facet examines the models' robustness against hallucinations in comprehensive understanding of medical visual information and textual input through MC-VQA tasks, such as disease identification and severity assessment. The second facet focuses on hallucinations occurring in long text generation, particularly with false confidence justification and medical report generation. The models' robustness against hallucinations will be evaluated considering their ability to leverage the medical knowledge base and their model size. 
          </p>
        </div>
      </div>
    </section>
    
  </div>
</section>

  <!-- Benchmark-->
  <section class="section">
    <div class="container is-max-desktop">
  
      <section class="hero teaser">
        <div class="container is-max-desktop has-text-centered">
          <h2 class="title is-3">Evaluation Tasks</h2>
          <img src="static/images/tasks.png">
          <div class="content has-text-justified">
            <p>Figure 3. Evluation Tasks.
            </p>
          </div>
        </div>
      </section>
      
    </div>
  </section>


<style>
    table {
        border-collapse: collapse;
        margin: 0 auto;
    }
    th, td {
        border-bottom: 1px solid lightgrey; /* Light grey horizontal borders */
        padding: 8px 12px;
        text-align: center;
        vertical-align: middle; /* Vertically centers the content */
    }
    th {
        white-space: nowrap;
    }
    /* Thicker top border */
    thead tr:first-child th {
        border-top: 2px solid black;
    }
    /* Thicker bottom border */
    tbody tr:last-child td {
        border-bottom: 2px solid black;
    }
</style>

<table>
    <caption>Performance comparison of all models on all six tasks. We highlight the best performance in each scenario.</caption>
    <thead>
        <tr>
            <th rowspan="2">Model Type</th>
            <th rowspan="2">LVLM</th>
            <th colspan="8">Visual and Textual Cross-understanding</th>
            <th colspan="4">Long Response Generation</th>
        </tr>
        <tr>
            <th colspan="2">Abnormality Detection</th>
            <th colspan="2">Wrongful Image</th>
            <th colspan="2">None of the Above</th>
            <th colspan="2">Clinically Incorrect Questions</th>
            <th colspan="2">False Confidence Justification</th>
            <th colspan="2">Report Generation</th>
        </tr>
        <tr>
            <th></th>
            <th></th>
            <th>acc_h</th>
            <th>acc_b</th>
            <th>acc_h</th>
            <th>acc_b</th>
            <th>acc_h</th>
            <th>acc_b</th>
            <th>acc_h</th>
            <th>acc_b</th>
            <th>acc_h</th>
            <th>acc_b</th>
            <th>CHAIR</th>
            <th>F<sub>1</sub></th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan="3">General Models</td>
            <td>GPT-4V</td>
            <td>0.324</td>
            <td>0.471</td>
            <td><strong>0.978</strong></td>
            <td>0.244</td>
            <td>0.244</td>
            <td>0.262</td>
            <td><strong>0.356</strong></td>
            <td>0.186</td>
            <td>0.366</td>
            <td>0.378</td>
            <td>0.665</td>
            <td>0.338</td>
        </tr>
        <tr>
            <td>LLaVa</td>
            <td>0.044</td>
            <td>0.500</td>
            <td>0.014</td>
            <td>0.344</td>
            <td><strong>0.478</strong></td>
            <td>0.280</td>
            <td>0.020</td>
            <td>0.366</td>
            <td>0.250</td>
            <td>0.360</td>
            <td>0.760</td>
            <td>0.194</td>
        </tr>
        <tr>
            <td>MiniGPT</td>
            <td>0.228</td>
            <td>0.508</td>
            <td>0.024</td>
            <td>0.326</td>
            <td>0.108</td>
            <td>0.124</td>
            <td>0.006</td>
            <td>0.030</td>
            <td><strong>0.490</strong></td>
            <td>0.326</td>
            <td>0.938</td>
            <td>0.040</td>
        </tr>
        <tr>
            <td rowspan="2">Medical Models</td>
            <td>LLaVa-Med</td>
            <td>0.170</td>
            <td>0.457</td>
            <td>0.110</td>
            <td>0.216</td>
            <td>0.028</td>
            <td>0.164</td>
            <td>0.004</td>
            <td>0.168</td>
            <td>0.172</td>
            <td>0.244</td>
            <td>0.737</td>
            <td>0.218</td>
        </tr>
        <tr>
            <td>Med-Flamingo</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>0.831</td>
            <td>0.133</td>
        </tr>
        <tr>
            <td rowspan="3">CXR Models</td>
            <td>LLM-CXR</td>
            <td>0.348</td>
            <td>0.472</td>
            <td>0.104</td>
            <td>0.220</td>
            <td>0.094</td>
            <td>0.130</td>
            <td>0.046</td>
            <td>0.244</td>
            <td>0.220</td>
            <td>0.256</td>
            <td>0.570</td>
            <td>0.401</td>
        </tr>
        <tr>
            <td>XrayGPT</td>
            <td>0.176</td>
            <td>0.173</td>
            <td>0.164</td>
            <td>0.286</td>
            <td>0.154</td>
            <td>0.140</td>
            <td>0.016</td>
            <td>0.030</td>
            <td>0.230</td>
            <td>0.132</td>
            <td>0.576</td>
            <td>0.278</td>
        </tr>
        <tr>
            <td>ChxXagent</td>
            <td><strong>0.378</strong></td>
            <td><strong>0.526</strong></td>
            <td>0.154</td>
            <td><strong>0.410</strong></td>
            <td>0.258</td>
            <td><strong>0.458</strong></td>
            <td>0.182</td>
            <td><strong>0.540</strong></td>
            <td>0.094</td>
            <td><strong>0.462</strong></td>
            <td><strong>0.461</strong></td>
            <td><strong>0.506</strong></td>
        </tr>
    </tbody>
</table>








<section class="section" id="BibTeX">
  <div class="container is-max-desktop content">
    <h2 class="title">BibTeX</h2>
    <pre><code>
      @misc{gu2024medvhsystematicevaluationhallucination,
          title={MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context}, 
          author={Zishan Gu and Changchang Yin and Fenglin Liu and Ping Zhang},
          year={2024},
          eprint={2407.02730},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2407.02730}, 
        }
</code></pre>
  </div>
</section>


<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website is adapted from <a href="https://nerfies.github.io/">Nerfies</a>, licensed under a <a rel="license"
                                                href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>
        </div>
      </div>
    </div>
  </div>
</footer>

</body>
</html>