Stars
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
High-resolution models for human tasks.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
李白 👤 作为唐代杰出诗人,其诗歌作品在中国文学史上具有重要地位。近年来,随着数字技术和人工智能的快速发展,传统文化普及推广的形式也面临着创新与变革。国内外对于李白诗歌的研究虽已相当深入,但在数字化、智能化普及方面仍存在不足。因此,本项目旨在通过构建李白知识图谱,结合大模型训练出专业的AI智能体,以生成式对话应用的形式,推动李白文化的普及与推广。
Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
[CVPR 2024] This is the official source for our paper "SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis"
Use Microsoft Edge's online text-to-speech service from Python WITHOUT needing Microsoft Edge or Windows or an API key
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.
Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting yo…
A generative speech model for daily dialogue.
GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Official PyTorch implementation of ECCV 2024 Paper: ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback.
YOLOv10: Real-Time End-to-End Object Detection [NeurIPS 2024]
[CVPR2024] StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
GPT4V-level open-source multi-modal model based on Llama3-8B
[CVPR 2024] Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Mixture-of-Experts for Large Vision-Language Models
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation
Official implementation of FaceXFormer: A Unified Transformer for Facial Analysis
CAMixerSR: Only Details Need More “Attention” (CVPR 2024)