GQA Evaluation in test-dev_balanced dataset #31

Zeqing-Wang · 2023-06-30T15:03:23Z

Thanks for the great work! I want to reproduce evaluation on GQA. But I meet some problems, I have checked the issue, but the problem is still not resolved, so i choose to start a new issue.

From the issue, I understand that the results in the paper are obtained in the test-dev_balance of GQA. But I got the result in test-dev_balance only about 0.25 acc through the code and config.yaml in github. Meanwhile, in the first 5000 questions in test-dev_all, we get an acc close to 0.5 (similar to the result in paper). I don't understand for this large difference in results with the same settings.They differ only in test-dev_balance dataset and test-dev_all.

We also used stratified sampling for validation on the test-dev_balance dataset. We randomly selected 200 questions from the 0th to 2000, 2001 to 4000, 4001 to 6000, and 6001 to 8000 questions, respectively. The following are our test results (we computed all the acc as well as removed the acc that failed to compile separately).

Therefore, I would like to ask if there are some special config settings, such as BLIP model settings (blip2-flan-t5-xxl and blip2-flan-t5-xl), and load_models settings in base_config.yaml, or some other settings, when doing the verification of GQA.

If possible, could you provide some details in evaluating the GQA dataset?We wonder if we did done something wrong somewhere

Thanks in advance!

split in test-dev_balance	acc	filter failed to compile
0-2000	0.24742268041237114	0.3582089552238806
2001-4000	0.24861878453038674	0.3284671532846715
4001-6000	0.2346368715083799	0.35
6001-8000	0.23711340206185566	0.31724137931034485

Alchemistyui · 2023-07-16T11:50:18Z

Same problem here

xyliugo · 2023-07-27T09:58:54Z

Same problem, not only in this paper but also VisProg. So I'm wondering whether the improvement brought by "task decomposition" REALLY exists?

k1tano · 2023-08-06T07:18:17Z

same problem,still wating

astanic · 2023-08-30T23:26:06Z

same problem, though on refcoco, we didn't try gqa yet

surisdi · 2024-01-26T18:07:07Z

Hi, a few weeks ago we added more details about evaluation. Unfortunately, our experiments were run using Codex, which is not available anymore. But the benchmark-specific code should be helpful to mimic our experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GQA Evaluation in test-dev_balanced dataset #31

GQA Evaluation in test-dev_balanced dataset #31

Zeqing-Wang commented Jun 30, 2023 •

edited

Loading

Alchemistyui commented Jul 16, 2023

xyliugo commented Jul 27, 2023

k1tano commented Aug 6, 2023

astanic commented Aug 30, 2023

surisdi commented Jan 26, 2024

GQA Evaluation in test-dev_balanced dataset #31

GQA Evaluation in test-dev_balanced dataset #31

Comments

Zeqing-Wang commented Jun 30, 2023 • edited Loading

Alchemistyui commented Jul 16, 2023

xyliugo commented Jul 27, 2023

k1tano commented Aug 6, 2023

astanic commented Aug 30, 2023

surisdi commented Jan 26, 2024

Zeqing-Wang commented Jun 30, 2023 •

edited

Loading