Abstract

Recent advancements in generative AI have suggested that by taking visual prompts, GPT-4V can demonstrate significant proficiency in visual recognition tasks. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier to its wide use. To address this challenge, we propose a budget-friendly collage prompting task that collages multiple images into a single visual prompt and makes GPT-4V perform visual recognition on several images simultaneously, thereby reducing the cost. We collect a dataset of various collage prompts to assess its performance in GPT-4V's visual recognition. Our evaluations reveal several key findings: 1) Recognition accuracy varies with different positions in the collage. 2) Grouping images of the same category together leads to better visual recognition results. 3) Incorrect labels often come from adjacent images. These findings highlight the importance of image arrangement within collage prompt. To this end, we construct a benchmark called CollagePrompt, which offers a platform for designing collage prompt to achieve more cost-effective visual recognition with GPT-4V. A baseline method derived from genetic algorithms to optimize collage layouts is proposed and two metrics are introduced to measure the efficiency of the optimized collage prompt. Our benchmark enables researchers to better optimize collage prompts, thus making GPT-4V more cost-effective in visual recognition.

Overview

The CollagePrompt is a benchmark platform designed to address the financial challenges associated with utilizing GPT-4V for visual recognition tasks. By leveraging grid collages of various sizes, this benchmark provides an efficient and cost-effective approach to visual recognition without significantly compromising accuracy. The dataset includes a variety of visual prompts that have been carefully curated to facilitate robust testing and optimization of AI models.

Dataset Statistics:

2x2 Collages:
- Training Set: 25,000 collages and 110,250 collage prompts from ImageNet-1K
- Validation Set: 12,500 collages from ImageNet-1K
3x3 Collages:
- Training Set: 11,111 collages and 102,646 collages prompts from ImageNet-1K
- Validation Set: 5,555 collages from ImageNet-1K
Additional Validation Sets: The validation sets also include collages from other common image recognition datasets such as Aircraft, Caltech101, DTD, EuroSAT, Food101, OxfordFlowers, OxfordPets, StanfordCars, SUN397, and UCF101.

Key Features:

Cost Efficiency: Transitioning from single images to grid collages (e.g., 2x2) significantly reduces inference costs and time while maintaining an acceptable level of accuracy.
Grid Size Optimization: The benchmark highlights the practical value of optimizing arrangements for 2x2 and 3x3 grids, balancing cost and accuracy effectively.
Baseline Algorithms: The dataset has been used to train and evaluate baseline algorithms for optimizing collage prompts, providing a solid foundation for further research and development.

Using the CollagePrompt benchmark, researchers can optimize image arrangements to minimize accuracy loss and reduce costs associated with GPT-4V’s visual recognition tasks.

Arrangement Matters

The arrangement of images within a collage prompt significantly impacts the overall recognition accuracy of GPT-4V. For any given set of images forming a collage, there should exist one or more optimal arrangements that maximize overall recognition accuracy. Our findings indicate that different arrangements can yield varying levels of accuracy, with a quadrant-grid collage having 24 (4!) potential image arrangements and a nine-grid format exceeding 360,000 (9!) possibilities; The statistics demonstrates the effect of these arrangements, emphasizing the need for effective optimization to minimize accuracy loss.

Observations

Different positions within the collage grid have varying accuracy rates in GPT-4V's visual recognition. As shown in the figure, the top-left position in both 2×2 and 3×3 grids tends to have the highest accuracy, with accuracy decreasing towards the center and bottom-left positions, which have the lowest accuracy. Accuracy then improves slightly for the last row. This pattern suggests potential model fatigue when processing central images in the collage, leading to lower accuracy that recovers as the model approaches the final row. Based on this observation, a natural idea to optimize the arrangement is to place ‘hard’ images into positions with higher accuracy while leaving ‘easy’ images to remaining positions.

Observation 1

We observed that in both 2×2 and 3×3 collages, placing images of the same class together significantly improves GPT-4V's overall recognition accuracy. Conversely, when the order is shuffled and images of the same class are not adjacent, the accuracy decreases. This improvement can be due to clustering images of the same class reducing the complexity of batch recognition for GPT-4V.

Observation 2

We analyzed the prediction errors in 2×2 and 3×3 collages and found that the incorrectly predicted labels often correspond to images in adjacent positions. This indicates that the model correctly identifies the images but outputs the predictions to the wrong locations due to localization inaccuracies. When the arrangement order is changed, the model outputs the correctly predicted labels to the correct positions.

Observation 3

Try the baseline

Our CollagePrompt benchmark provides a foundational method to optimize collage layouts for cost-effective visual recognition using GPT-4V. The baseline method leverages genetic algorithms to determine the most efficient image arrangements within collages. This approach ensures that the recognition accuracy remains high while reducing the cost associated with multiple individual image recognitions.

Baseline

Our baseline algorithm employs a genetic algorithm to optimize collage prompts. The genetic algorithm iteratively evolves the arrangements of images within the collage to improve the overall recognition accuracy while keeping the inference cost low. Key steps of the baseline algorithm include:

Initialization: Generate an initial population of random collage arrangements.
Selection: Choose the best-performing arrangements based on recognition accuracy.
Crossover: Combine pairs of selected arrangements to create new ones.
Mutation: Introduce slight variations to some arrangements to explore new possibilities.
Evaluation: Assess the new arrangements using the GPT-4V API to measure their accuracy.

Experiment Results

This baseline serves as a starting point for further research and optimization, providing a clear pathway to enhancing the cost-efficiency of visual recognition tasks using GPT-4V. The experimental results demonstrate significant improvements in recognition accuracy when using optimized collage prompts compared to random arrangements.

Related and Concurrent Work

Cheng, Z., Kasai, J., & Yu, T. Batch prompting: Efficient inference with large language model apis. In EMNLP, 2023. [PDF] [Website]

Lin, J., Diesendruck, M., Du, L., & Abraham, R. BatchPrompt: Accomplish more with less. In ICLR, 2024. [PDF]

Anonymous. Tune-n-Batch: Fine-Tuning LLMs for Batch Prompting. In ACL submission, 2024. [PDF]

Yue, M., Zhao, J., Zhang, M., Du, L., & Yao, Z. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. In ICLR, 2024. [PDF] [Website]

Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., & Wang, J. GPT4Vis: what can GPT-4 do for zero-shot visual recognition?. In arXiv, 2023. [PDF] [Website]

Jiang, Y., Irvin, J., Wang, J. H., Chaudhry, M. A., Chen, J. H., & Ng, A. Y. Many-Shot In-Context Learning in Multimodal Foundation Models. In Arxiv, 2024. [PDF] [Website]

[Bibtex]

Acknowledgements

This template was originally made by Richard Zhang for the Colorization project.