Image UI Grounding for Creative Workflows: Model and Benchmark
Abstract
This report presents a model for image user interface (UI) grounding tailored to creative software, and introduces a new benchmark for evaluating AI understanding of creative workflows. Existing benchmarks do not adequately capture the complexity of creative software interfaces. We describe our model architecture, the construction of our benchmark, and report initial results.
1. Introduction
Creative software, such as graphic design and video editing tools, presents unique challenges for computer vision systems. User interfaces in these domains are complex, with diverse elements and dynamic layouts. Accurate UI grounding — the ability to identify and localize interface components in images — is essential for intelligent assistance and automation. However, current benchmarks are not representative of these creative environments. This work addresses this gap by proposing a new model and benchmark for image UI grounding in creative workflows.
2. Related Work
Prior research on UI grounding has focused on web and mobile interfaces, with benchmarks such as PubLayNet and RICO. These datasets do not reflect the complexity of creative software. Recent advances in vision-language models have improved general UI understanding, but performance on creative tools remains underexplored.
3. Method
Our model is based on a vision transformer backbone, adapted for the unique characteristics of creative software UIs. The architecture incorporates multi-scale feature extraction and a component relationship module to capture spatial and semantic dependencies between UI elements.
4. Creative UI Benchmark
We introduce a benchmark dataset consisting of annotated screenshots from a variety of creative software applications. The benchmark includes diverse tasks such as object localization, component classification, and relationship identification. Annotation guidelines were developed to ensure consistency and coverage of creative workflows.
5. Results
We evaluate our model on the proposed benchmark and compare it to baseline methods. Metrics include mean average precision (mAP) for localization and F1 score for component classification. Our model demonstrates improved performance over baselines, particularly in complex, multi-element scenes.
6. Discussion
The results indicate that specialized architectures and benchmarks are necessary for progress in creative UI understanding. Limitations include the scope of the benchmark and the need for further validation on additional creative tools. Future work will address these limitations and explore integration with interactive AI assistants.
7. Conclusion
We present a model and benchmark for image UI grounding in creative software. Our approach improves performance on complex creative interfaces and provides a foundation for future research in this area.
References
- Li, Y. et al. (2019). PubLayNet: Largest Dataset Ever for Document Layout Analysis. ICDAR.
- Deka, B. et al. (2017). RICO: A Mobile App Dataset for Building Data-Driven Design Applications. UIST.
- Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.