Skip to content
/ MultiUI Public

Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

Notifications You must be signed in to change notification settings

neulab/MultiUI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

About MultiUI

MultitUI is a dataset of 7.3 million samples spanning various UI types and tasks, structured using enhanced accessibility trees and task taxonomies.

Repository Structure

This repository is divided into two parts:

  • Train: contains training code for LLaVA-OneVision, the base model we used.

  • Evaluation: contains evaluation code on all benchmarks we tested in the paper.

Dataset Download

  • MultiUI: Download our 7.3 million sample training dataset from huggingface.

Models Checkpoint

Model Name LLM Vision Tower Checkpoint
UIX-Qwen2 Qwen2-7B-Instruct siglip-so400m-patch14-384 neulab/UIX-Qwen2
UIX-Qwen2-M2W Qwen2-7B-Instruct siglip-so400m-patch14-384 neulab/UIX-Qwen2-Mind2Web

Run Evaluation

VisualWebBench

To evaluate VisualWebBench related tasks:

cd eval/VisualWebBench
bash run.sh

lmms-eval-MultiUI

We evaluate on GUI understanding&grounding benchmarks (WebSRC, ScreenQA-short, WidgetCap, ScreenSpot, RefExp), OCR/Doc/Chart-related QA (DocVQA, ChartQA, TextVQA, InfoVQA, VisualMRC, OCRBench), and general grounding benchmark (RefCOCO+) with the lmms-eval framework.

To evaluate these datasets:

cd eval/lmms-eval-MultiUI
model=MODEL_NAME
model_type=MODEL_TYPE
python3 -m accelerate.commands.launch \
         --num_processes=8 \
         -m lmms_eval \
         --model $model_type \
         --model_args pretrained=$model,conv_template=qwen_2 \
         --tasks ${task} \
         --batch_size 1 \
         --log_samples \
         --log_samples_suffix ${task} \
         --output_path eval_logs

Mind2Web Evaluation

Download our processed Mind2Web evaluation dataset from huggingface and place it under eval/Mind2Web-SeeAct/src/offline_experiments/screenshot_generation/data

Run inference

cd eval/Mind2Web-SeeAct/src/offline_experiments/

python eval_m2w.py \
--model_name MODEL_NAME \
--model_path MODEL_PATH \
--task_types test_{task/website/domain}

Calculate metrics

python ./action_generation/metric.py

Dataset Disclaimer

The MultiUI dataset is released for open-source use by the research and developer community. The data is largely sourced from publicly available web content or generated by large language models (LLMs). We constructed this dataset using links from Hugging Face’s FineWeb dataset, which is based on a Common Crawl dump, representing publicly accessible data from the web.

This dataset is mostly intended for research purposes, it may contain material that could have inaccuracies, biases, or other unintended issues. We do not intentionally include any copyrighted material, and any resemblance to such content is unintentional.

If you have any concerns regarding specific data or believe that any content should be removed, please contact us, and we will review the request and take appropriate action.

About

Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published