Skip to content

Latest commit

 

History

History
40 lines (35 loc) · 5.56 KB

w-what-is-next-in-multimodal-foundation-models.md

File metadata and controls

40 lines (35 loc) · 5.56 KB

ICCVW-2023-Papers

Application App

What is Next in Multimodal Foundation Models?

Section Papers Preprint Papers Papers with Open Code Papers with Video

Title Repo Paper Video
Coarse to Fine Frame Selection for Online Open-Ended Video Question Answering thecvf YouTube
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models thecvf
arXiv
Video-and-Language (VidL) Models and their Cognitive Relevance thecvf
Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification GitHub thecvf
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection GitHub thecvf
arXiv
ClipCrop: Conditioned Cropping Driven by Vision-Language Model thecvf
arXiv
Towards an Exhaustive Evaluation of Vision-Language Foundation Models thecvf
Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts GitHub thecvf
arXiv
Painter: Teaching Auto-Regressive Language Models to Draw Sketches thecvf
arXiv