Lists (7)
Sort Name ascending (A-Z)
Stars
A paper list of some recent works about Token Compress for Vit and VLM
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Textureless Underwater Real Time Localization and Mapping
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
We propose CRKD to bridge the performance gap between LC and CR detectors with a novel cross-modality knowledge distillation (KD) framework.
Open-MAGVIT2: Democratizing Autoregressive Visual Generation
official repository of CVPR 2024 paper, RMem: Restricted Memory Banks Improve Video Object Segmentation
Taming Transformers for High-Resolution Image Synthesis
Official repository for "AM-RADIO: Reduce All Domains Into One"
[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Accelerating the development of large multimodal models (LMMs) with lmms-eval
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Reaching LLaMA2 Performance with 0.1M Dollars
✨✨Latest Advances on Multimodal Large Language Models
[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly …
EfficientViT is a new family of vision models for efficient high-resolution vision.
SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.47% of bugs in the SWE-bench evaluation set and takes just 1 minute to run.
Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)
[ECCV 2024] Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks.
[ECCV'24] GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
When do we not need larger vision models?