ICCVW-2023-Papers

Application

What is Next in Multimodal Foundation Models?

Title	Repo	Video
Coarse to Fine Frame Selection for Online Open-Ended Video Question Answering	➖
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	➖	➖
Video-and-Language (VidL) Models and their Cognitive Relevance	➖	➖
Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification		➖
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection		➖
ClipCrop: Conditioned Cropping Driven by Vision-Language Model	➖	➖
Towards an Exhaustive Evaluation of Vision-Language Foundation Models	➖	➖
Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts		➖
Painter: Teaching Auto-Regressive Language Models to Draw Sketches	➖	➖