Split View: 멀티모달 AI 완전 정복: CLIP, LLaVA, GPT-4V, Gemini Vision 마스터하기

멀티모달 AI 완전 정복: CLIP, LLaVA, GPT-4V, Gemini Vision 마스터하기

1. 멀티모달 AI 개요

단일 모달리티의 한계

기존 AI 시스템은 텍스트, 이미지, 오디오 중 하나의 데이터 형태(모달리티)만 처리할 수 있었습니다. 이러한 단일 모달리티 접근 방식은 실세계의 복잡한 문제를 해결하는 데 근본적인 한계를 가지고 있습니다.

텍스트 전용 모델의 한계:

이미지 설명 요청 시 이미지 자체를 분석 불가
도표, 차트, 스크린샷의 내용 이해 불가
시각적 컨텍스트 없는 의사결정

이미지 전용 모델의 한계:

이미지 내 텍스트와 시각적 요소의 복합적 이해 불가
언어 기반 질의응답 불가
자연어 설명을 통한 검색 불가

멀티모달 AI의 가능성

멀티모달 AI는 여러 형태의 데이터를 동시에 처리하고 이해할 수 있는 시스템입니다. 인간의 자연스러운 인지 방식인 "보고, 듣고, 읽으면서 동시에 이해하는" 능력을 AI가 갖추게 됩니다.

주요 활용 분야:

의료 진단: 의료 영상 + 환자 기록 텍스트 통합 분석
자율주행: 카메라 + 라이더 + 지도 데이터 통합
교육: 교재 이미지 + 설명 텍스트 자동 생성
이커머스: 제품 사진 + 설명 + 리뷰 통합 처리
문서 이해: 스캔 문서 OCR + 내용 분석
창작: 텍스트 설명으로 이미지 생성 (DALL-E, Stable Diffusion)

비전-언어 모델의 발전사

2021: CLIP (OpenAI) - 대조 학습으로 이미지-텍스트 연결
2022: BLIP - 이미지 캡셔닝과 VQA 통합
2023: BLIP-2 - Q-Former로 효율적 멀티모달 학습
2023: LLaVA - 오픈소스 비전-언어 어시스턴트
2023: GPT-4V - 상업용 멀티모달 LLM
2023: Gemini - 구글의 멀티모달 파운데이션 모델
2024: Claude 3 Vision - Anthropic의 멀티모달 모델
2024: LLaVA-1.6, InternVL2, Qwen-VL2 - 오픈소스 개선
2025: 비디오 이해, 3D 이해로 확장

2. CLIP 심층 분석

CLIP의 핵심 아이디어

CLIP(Contrastive Language-Image Pre-training)은 2021년 OpenAI가 발표한 모델로, 4억 개의 이미지-텍스트 쌍으로 대조 학습(Contrastive Learning)을 수행하여 이미지와 텍스트를 동일한 임베딩 공간에 매핑합니다.

핵심 혁신: 별도의 레이블 없이 인터넷에서 수집한 이미지-캡션 쌍만으로 학습하여 강력한 제로샷 분류 능력을 획득했습니다.

CLIP 아키텍처

이미지 → [이미지 인코더 (ViT/ResNet)] → 이미지 임베딩 (512차원)
                                                    ↕ 유사도 측정
텍스트 → [텍스트 인코더 (Transformer)]→ 텍스트 임베딩 (512차원)

대조 학습 메커니즘:

배치 내 N개의 이미지-텍스트 쌍에서:

올바른 쌍(diagonal)의 유사도는 최대화
잘못된 쌍(off-diagonal)의 유사도는 최소화

import torch
import torch.nn.functional as F
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# CLIP 모델 로드
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def clip_zero_shot_classification(
    image: Image.Image,
    candidate_labels: list[str]
) -> dict[str, float]:
    """CLIP을 사용한 제로샷 이미지 분류."""

    # 텍스트 프롬프트 생성 (CLIP의 권장 형식)
    text_prompts = [f"a photo of a {label}" for label in candidate_labels]

    # 전처리
    inputs = processor(
        text=text_prompts,
        images=image,
        return_tensors="pt",
        padding=True
    )

    # 추론
    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image
        probs = logits_per_image.softmax(dim=1)

    # 결과 반환
    return {
        label: prob.item()
        for label, prob in zip(candidate_labels, probs[0])
    }

# 사용 예시
image_url = "https://example.com/sample_image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

labels = ["cat", "dog", "bird", "fish", "rabbit"]
results = clip_zero_shot_classification(image, labels)

# 확률 기준 정렬
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for label, prob in sorted_results:
    print(f"{label}: {prob:.4f} ({prob*100:.1f}%)")

이미지-텍스트 검색

import numpy as np
from typing import Union

class CLIPSearchEngine:
    """CLIP 기반 이미지-텍스트 검색 엔진."""

    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_embeddings = []
        self.text_embeddings = []
        self.image_metadata = []
        self.text_metadata = []

    def encode_images(self, images: list[Image.Image]) -> torch.Tensor:
        """이미지 배치를 임베딩으로 변환."""
        inputs = self.processor(
            images=images,
            return_tensors="pt",
            padding=True
        )
        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)
            # L2 정규화
            image_features = F.normalize(image_features, p=2, dim=-1)
        return image_features

    def encode_texts(self, texts: list[str]) -> torch.Tensor:
        """텍스트 배치를 임베딩으로 변환."""
        inputs = self.processor(
            text=texts,
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)
            text_features = F.normalize(text_features, p=2, dim=-1)
        return text_features

    def index_images(
        self,
        images: list[Image.Image],
        metadata: list[dict] = None
    ):
        """이미지를 인덱싱합니다."""
        embeddings = self.encode_images(images)
        self.image_embeddings.append(embeddings)
        if metadata:
            self.image_metadata.extend(metadata)

    def text_to_image_search(
        self,
        query: str,
        top_k: int = 5
    ) -> list[dict]:
        """텍스트 쿼리로 이미지를 검색합니다."""
        if not self.image_embeddings:
            return []

        # 모든 이미지 임베딩 결합
        all_embeddings = torch.cat(self.image_embeddings, dim=0)

        # 쿼리 인코딩
        query_embedding = self.encode_texts([query])

        # 코사인 유사도 계산 (이미 L2 정규화됨)
        similarities = (all_embeddings @ query_embedding.T).squeeze(-1)

        # Top-K 결과 선택
        top_indices = similarities.argsort(descending=True)[:top_k]

        results = []
        for idx in top_indices:
            idx = idx.item()
            result = {
                "index": idx,
                "similarity": similarities[idx].item()
            }
            if self.image_metadata:
                result.update(self.image_metadata[idx])
            results.append(result)

        return results

OpenCLIP (오픈소스 CLIP)

# OpenCLIP: 다양한 아키텍처와 학습 데이터 지원
# pip install open_clip_torch

import open_clip
import torch
from PIL import Image

# 사용 가능한 모델 목록 확인
available_models = open_clip.list_pretrained()
print("사용 가능한 모델:", available_models[:5])

# LAION-2B로 학습된 대형 모델 로드
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='laion2b_s32b_b79k'
)
tokenizer = open_clip.get_tokenizer('ViT-H-14')

def compute_clip_similarity(
    image: Image.Image,
    texts: list[str],
    model=model,
    preprocess=preprocess_val,
    tokenizer=tokenizer
) -> list[float]:
    """이미지와 텍스트 목록 간의 CLIP 유사도를 계산합니다."""
    model.eval()

    # 이미지 전처리
    image_input = preprocess(image).unsqueeze(0)

    # 텍스트 토큰화
    text_input = tokenizer(texts)

    with torch.no_grad(), torch.cuda.amp.autocast():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)

        # 정규화
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        # 유사도 계산
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

    return similarity[0].tolist()

CLIP 파인튜닝

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import CLIPModel, CLIPProcessor
import torch.optim as optim

class ImageTextDataset(Dataset):
    """이미지-텍스트 쌍 데이터셋."""

    def __init__(
        self,
        image_paths: list[str],
        texts: list[str],
        processor: CLIPProcessor
    ):
        self.image_paths = image_paths
        self.texts = texts
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        text = self.texts[idx]

        inputs = self.processor(
            images=image,
            text=text,
            return_tensors="pt",
            padding="max_length",
            max_length=77,
            truncation=True
        )

        return {
            "pixel_values": inputs["pixel_values"].squeeze(0),
            "input_ids": inputs["input_ids"].squeeze(0),
            "attention_mask": inputs["attention_mask"].squeeze(0)
        }

class CLIPFineTuner:
    """CLIP 모델 파인튜닝 클래스."""

    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def contrastive_loss(
        self,
        image_features: torch.Tensor,
        text_features: torch.Tensor,
        temperature: float = 0.07
    ) -> torch.Tensor:
        """대조 학습 손실 함수."""
        # 정규화
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        # 유사도 행렬
        logits = torch.matmul(image_features, text_features.T) / temperature

        # 대각선이 정답 (i번째 이미지는 i번째 텍스트와 쌍)
        labels = torch.arange(len(logits)).to(self.device)

        # 양방향 크로스 엔트로피
        loss_i = F.cross_entropy(logits, labels)
        loss_t = F.cross_entropy(logits.T, labels)

        return (loss_i + loss_t) / 2

    def train(
        self,
        train_dataset: ImageTextDataset,
        num_epochs: int = 10,
        batch_size: int = 32,
        learning_rate: float = 1e-5
    ):
        """CLIP 파인튜닝 학습."""
        dataloader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=4
        )

        optimizer = optim.AdamW(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=0.01
        )

        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=num_epochs
        )

        for epoch in range(num_epochs):
            total_loss = 0
            self.model.train()

            for batch in dataloader:
                pixel_values = batch["pixel_values"].to(self.device)
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                # 순전파
                outputs = self.model(
                    pixel_values=pixel_values,
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                # 손실 계산
                loss = self.contrastive_loss(image_features, text_features)

                # 역전파
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            scheduler.step()
            avg_loss = total_loss / len(dataloader)
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

3. BLIP 계열

BLIP (Bootstrapping Language-Image Pre-training)

BLIP은 2022년 Salesforce Research에서 발표한 모델로, 이미지 캡셔닝, 이미지-텍스트 검색, 시각적 질의응답(VQA) 등 다양한 비전-언어 작업에서 뛰어난 성능을 보입니다.

핵심 혁신: Captioner와 Filter를 활용한 데이터 부트스트래핑으로 웹에서 수집한 노이즈 많은 이미지-텍스트 쌍을 정제합니다.

from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import BlipForQuestionAnswering
from PIL import Image
import torch

# BLIP 이미지 캡셔닝
class BLIPCaptioner:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-large",
            torch_dtype=torch.float16
        )
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def caption(
        self,
        image: Image.Image,
        conditional_text: str = None,
        max_new_tokens: int = 50
    ) -> str:
        """이미지에 대한 캡션을 생성합니다."""
        if conditional_text:
            # 조건부 캡셔닝
            inputs = self.processor(
                image,
                conditional_text,
                return_tensors="pt"
            ).to(self.device, torch.float16)
        else:
            # 무조건부 캡셔닝
            inputs = self.processor(
                image,
                return_tensors="pt"
            ).to(self.device, torch.float16)

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True
            )

        return self.processor.decode(output[0], skip_special_tokens=True)

# BLIP VQA (시각적 질의응답)
class BLIPVisualQA:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        self.model = BlipForQuestionAnswering.from_pretrained(
            "Salesforce/blip-vqa-base"
        )

    def answer(self, image: Image.Image, question: str) -> str:
        """이미지에 대한 질문에 답변합니다."""
        inputs = self.processor(image, question, return_tensors="pt")

        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)

        return self.processor.decode(output[0], skip_special_tokens=True)

# 사용 예시
captioner = BLIPCaptioner()
vqa = BLIPVisualQA()

image = Image.open("sample.jpg")

# 캡션 생성
caption = captioner.caption(image)
print(f"캡션: {caption}")

# 조건부 캡션
cond_caption = captioner.caption(image, "a photo of")
print(f"조건부 캡션: {cond_caption}")

# VQA
answer = vqa.answer(image, "What color is the sky?")
print(f"답변: {answer}")

BLIP-2: Querying Transformer

BLIP-2는 2023년 발표된 BLIP의 후속 모델로, Q-Former(Querying Transformer)를 도입하여 동결된(frozen) 이미지 인코더와 동결된 LLM을 효율적으로 연결합니다.

Q-Former의 역할:

이미지 인코더의 출력에서 가장 중요한 시각적 특징을 추출
학습 가능한 쿼리 토큰 32개가 이미지 특징과 교류하여 압축된 표현 생성
이 압축된 표현이 LLM의 입력으로 전달

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

class BLIP2Assistant:
    """BLIP-2 기반 시각 질의응답 어시스턴트."""

    def __init__(
        self,
        model_name: str = "Salesforce/blip2-opt-2.7b"
    ):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def generate_response(
        self,
        image: Image.Image,
        prompt: str = None,
        max_new_tokens: int = 200,
        temperature: float = 1.0
    ) -> str:
        """이미지와 (선택적) 프롬프트에 대한 응답을 생성합니다."""

        if prompt:
            inputs = self.processor(
                images=image,
                text=prompt,
                return_tensors="pt"
            ).to("cuda", torch.float16)
        else:
            inputs = self.processor(
                images=image,
                return_tensors="pt"
            ).to("cuda", torch.float16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        generated_text = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

    def batch_caption(
        self,
        images: list[Image.Image],
        batch_size: int = 8
    ) -> list[str]:
        """이미지 배치에 대한 캡션을 일괄 생성합니다."""
        all_captions = []

        for i in range(0, len(images), batch_size):
            batch = images[i:i + batch_size]

            inputs = self.processor(
                images=batch,
                return_tensors="pt",
                padding=True
            ).to("cuda", torch.float16)

            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs,
                    max_new_tokens=50
                )

            captions = self.processor.batch_decode(
                generated_ids,
                skip_special_tokens=True
            )

            all_captions.extend([c.strip() for c in captions])

        return all_captions

# 사용 예시
assistant = BLIP2Assistant("Salesforce/blip2-flan-t5-xxl")
image = Image.open("document.png")

# 자유 형식 질문
response = assistant.generate_response(
    image,
    "Question: What is the main topic of this document? Answer:"
)
print(response)

# 대화형 세션
conversation_history = []
questions = [
    "이미지에 무엇이 보이나요?",
    "색상은 어떻게 되나요?",
    "배경은 어떤가요?"
]

for q in questions:
    history_text = "\n".join(conversation_history)
    prompt = f"{history_text}\nQuestion: {q} Answer:"
    answer = assistant.generate_response(image, prompt)
    print(f"Q: {q}")
    print(f"A: {answer}")
    conversation_history.append(f"Q: {q} A: {answer}")

4. LLaVA: 대규모 언어-비전 어시스턴트

LLaVA 아키텍처

LLaVA(Large Language and Vision Assistant)는 2023년 발표된 오픈소스 시각-언어 모델로, 강력한 LLM (LLaMA, Vicuna)과 CLIP 비전 인코더를 연결하여 인스트럭션 팔로잉 능력을 갖춘 멀티모달 챗봇을 구현합니다.

아키텍처 구성:

이미지 → [CLIP ViT-L/14] → 이미지 특징 (1024차원)
                              ↓
                      [선형 프로젝션 레이어]
                              ↓
                      [비주얼 토큰들]
                              ↓
           [LLM (LLaMA/Vicuna)] ← [텍스트 토큰들]
                              ↓
                          최종 응답

LLaVA-1.5 개선사항:

MLP 프로젝션 레이어 (선형 → 2-레이어 MLP)
고해상도 이미지 지원
더 많은 학습 데이터

LLaVA-1.6 (LLaVA-NeXT) 개선사항:

Dynamic High Resolution: 최대 672x672 → 4배 더 많은 시각 토큰
개선된 추론 및 OCR 능력
다양한 종횡비 지원

HuggingFace로 LLaVA 사용

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

class LLaVAAssistant:
    """LLaVA-1.6 기반 시각 어시스턴트."""

    def __init__(
        self,
        model_name: str = "llava-hf/llava-v1.6-mistral-7b-hf"
    ):
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto"
        )

    def chat(
        self,
        image: Image.Image,
        message: str,
        max_new_tokens: int = 500,
        temperature: float = 0.7
    ) -> str:
        """이미지와 함께 대화합니다."""

        # LLaVA-1.6의 대화 형식
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": message}
                ]
            }
        ]

        prompt = self.processor.apply_chat_template(
            conversation,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt,
            image,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0,
                pad_token_id=self.processor.tokenizer.eos_token_id
            )

        # 입력 제외하고 생성된 텍스트만 추출
        generated = output[0][inputs["input_ids"].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

    def analyze_chart(self, chart_image: Image.Image) -> dict:
        """차트 이미지를 분석합니다."""
        analysis_prompts = [
            "이 차트의 제목은 무엇인가요?",
            "x축과 y축은 무엇을 나타내나요?",
            "가장 높은 값과 낮은 값은 무엇인가요?",
            "전체적인 트렌드를 설명해주세요.",
            "이 데이터에서 가장 중요한 인사이트는 무엇인가요?"
        ]

        results = {}
        for prompt in analysis_prompts:
            response = self.chat(chart_image, prompt)
            results[prompt] = response

        return results

    def extract_text_from_image(self, image: Image.Image) -> str:
        """이미지에서 텍스트를 추출합니다 (OCR)."""
        return self.chat(
            image,
            "이 이미지에 있는 모든 텍스트를 정확히 추출해주세요. "
            "텍스트만 반환하고 다른 설명은 추가하지 마세요."
        )


# 실전 활용: 문서 분석 파이프라인
class DocumentAnalysisPipeline:
    """LLaVA를 사용한 문서 분석 파이프라인."""

    def __init__(self):
        self.llava = LLaVAAssistant()

    def analyze_document(self, document_image: Image.Image) -> dict:
        """문서 이미지를 종합 분석합니다."""

        # 1. 문서 타입 식별
        doc_type = self.llava.chat(
            document_image,
            "이 문서의 유형은 무엇인가요? (청구서, 계약서, 보고서, 양식 등)"
        )

        # 2. 텍스트 추출
        extracted_text = self.llava.extract_text_from_image(document_image)

        # 3. 핵심 정보 추출
        key_info = self.llava.chat(
            document_image,
            f"이 {doc_type}에서 다음 정보를 JSON 형식으로 추출해주세요: "
            "날짜, 발신인, 수신인, 금액(있는 경우), 주요 내용 요약"
        )

        # 4. 액션 아이템 식별
        action_items = self.llava.chat(
            document_image,
            "이 문서에서 필요한 조치사항이 있다면 목록으로 나열해주세요."
        )

        return {
            "document_type": doc_type,
            "extracted_text": extracted_text,
            "key_information": key_info,
            "action_items": action_items
        }

5. InstructBLIP

InstructBLIP의 핵심

InstructBLIP은 BLIP-2를 기반으로 하되, 다양한 인스트럭션을 따를 수 있도록 인스트럭션 튜닝을 적용한 모델입니다. Q-Former가 인스트럭션을 인식하여 관련 시각 특징을 추출하는 것이 핵심입니다.

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

class InstructBLIPAssistant:
    """InstructBLIP 기반 지시 따르기 어시스턴트."""

    def __init__(self, model_name: str = "Salesforce/instructblip-vicuna-7b"):
        self.processor = InstructBlipProcessor.from_pretrained(model_name)
        self.model = InstructBlipForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def instruct(
        self,
        image: Image.Image,
        instruction: str,
        max_new_tokens: int = 300
    ) -> str:
        """이미지에 대한 구체적인 지시를 수행합니다."""
        inputs = self.processor(
            images=image,
            text=instruction,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                do_sample=False,
                num_beams=5,
                max_new_tokens=max_new_tokens,
                min_length=1,
                top_p=0.9,
                repetition_penalty=1.5,
                length_penalty=1.0,
                temperature=1.0
            )

        generated_text = self.processor.batch_decode(
            outputs,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

# 다양한 활용 예시
assistant = InstructBLIPAssistant()
image = Image.open("complex_diagram.png")

# 복잡한 다이어그램 설명
description = assistant.instruct(
    image,
    "이 다이어그램을 상세히 설명해주세요. 각 컴포넌트의 역할과 연결 관계를 포함하세요."
)

# 특정 객체 감지
objects = assistant.instruct(
    image,
    "이미지에서 발견되는 모든 객체를 목록으로 나열하고, 각 객체의 위치를 설명해주세요."
)

# 감정 분석
emotion = assistant.instruct(
    image,
    "이미지에 있는 사람들의 감정 상태를 분석하고, 그 근거를 설명해주세요."
)

# 비교 분석
if len([image]) > 1:  # 여러 이미지의 경우
    comparison = assistant.instruct(
        image,
        "이미지의 특징을 자세히 설명하고, 비슷한 이미지와 비교했을 때의 차이점을 설명해주세요."
    )

6. GPT-4 Vision

GPT-4V API 사용법

GPT-4 Vision은 OpenAI의 GPT-4 모델에 시각 능력을 추가한 것으로, 현재 가장 강력한 상업용 멀티모달 LLM 중 하나입니다.

import openai
import base64
from pathlib import Path
import httpx

client = openai.OpenAI()

def encode_image_to_base64(image_path: str) -> str:
    """이미지 파일을 Base64로 인코딩합니다."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def image_url_to_base64(url: str) -> str:
    """URL에서 이미지를 다운로드하여 Base64로 인코딩합니다."""
    response = httpx.get(url)
    return base64.b64encode(response.content).decode('utf-8')

class GPT4VisionAnalyzer:
    """GPT-4 Vision 기반 이미지 분석기."""

    def __init__(self, model: str = "gpt-4o"):
        self.client = openai.OpenAI()
        self.model = model

    def analyze_image(
        self,
        image_source: str,  # 파일 경로 또는 URL
        prompt: str,
        is_url: bool = True,
        detail: str = "high",  # "low", "high", "auto"
        max_tokens: int = 1000
    ) -> str:
        """단일 이미지를 분석합니다."""

        if is_url:
            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": image_source,
                    "detail": detail
                }
            }
        else:
            # 로컬 파일
            base64_image = encode_image_to_base64(image_source)
            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{media_type};base64,{base64_image}",
                    "detail": detail
                }
            }

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        image_content,
                        {"type": "text", "text": prompt}
                    ]
                }
            ],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_multiple_images(
        self,
        image_sources: list[dict],  # [{"source": "...", "is_url": True}]
        prompt: str,
        max_tokens: int = 2000
    ) -> str:
        """여러 이미지를 동시에 분석합니다."""
        content = []

        for img_info in image_sources:
            source = img_info["source"]
            is_url = img_info.get("is_url", True)

            if is_url:
                content.append({
                    "type": "image_url",
                    "image_url": {"url": source, "detail": "high"}
                })
            else:
                base64_image = encode_image_to_base64(source)
                content.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                })

        content.append({"type": "text", "text": prompt})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": content}],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_chart_or_graph(self, image_source: str) -> dict:
        """차트나 그래프를 구조화된 형식으로 분석합니다."""
        prompt = """이 차트/그래프를 분석하여 다음 JSON 형식으로 반환해주세요:
{
  "chart_type": "막대/선/원/산점도 등",
  "title": "차트 제목",
  "x_axis": {"label": "X축 레이블", "unit": "단위"},
  "y_axis": {"label": "Y축 레이블", "unit": "단위"},
  "data_series": [{"name": "시리즈명", "trend": "상승/하락/유지"}],
  "key_findings": ["발견사항1", "발견사항2"],
  "data_range": {"min": 0, "max": 0},
  "anomalies": ["이상치 설명"]
}"""

        response = self.analyze_image(
            image_source,
            prompt,
            detail="high",
            max_tokens=1500
        )

        import json
        try:
            # JSON 파싱 시도
            start = response.find('{')
            end = response.rfind('}') + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

        return {"raw_response": response}

    def extract_structured_data_from_document(
        self,
        document_image_path: str
    ) -> dict:
        """문서 이미지에서 구조화된 데이터를 추출합니다."""
        prompt = """이 문서에서 다음 정보를 JSON 형식으로 추출해주세요:
1. 문서 유형
2. 날짜 (있는 경우)
3. 발신인/작성자
4. 수신인 (있는 경우)
5. 주요 내용 요약 (3-5문장)
6. 주요 수치 데이터 (표, 금액 등)
7. 서명/승인 여부

JSON 형식:
{
  "document_type": "",
  "date": "",
  "author": "",
  "recipient": "",
  "summary": "",
  "numerical_data": [],
  "signature_present": false
}"""

        return self.analyze_chart_or_graph.__func__(self, document_image_path)


# 실전 활용: 이커머스 제품 분석
def analyze_product_images(image_urls: list[str]) -> dict:
    """여러 제품 이미지를 분석합니다."""
    analyzer = GPT4VisionAnalyzer()

    image_sources = [{"source": url, "is_url": True} for url in image_urls]

    result = analyzer.analyze_multiple_images(
        image_sources,
        prompt="""이 제품 이미지들을 분석하여 다음을 JSON으로 반환해주세요:
{
  "product_name": "추정 제품명",
  "category": "제품 카테고리",
  "color_options": ["색상 목록"],
  "key_features": ["주요 특징"],
  "condition": "새 제품/중고 등",
  "quality_score": 0-10,
  "marketing_description": "마케팅용 설명 (100자)",
  "seo_keywords": ["SEO 키워드"]
}"""
    )

    return result

7. Gemini Vision

Gemini의 멀티모달 능력

Google의 Gemini는 처음부터 멀티모달을 고려하여 설계된 파운데이션 모델입니다. 특히 Gemini 1.5 Pro는 100만 토큰 컨텍스트 윈도우로 장시간 비디오, 긴 문서, 다수의 이미지를 처리할 수 있습니다.

import google.generativeai as genai
import PIL.Image
from pathlib import Path
import base64

# API 키 설정
genai.configure(api_key="YOUR_GEMINI_API_KEY")

class GeminiVisionAnalyzer:
    """Gemini Vision 기반 분석기."""

    def __init__(self, model_name: str = "gemini-1.5-pro"):
        self.model = genai.GenerativeModel(model_name)
        self.vision_model = genai.GenerativeModel("gemini-1.5-flash")

    def analyze_image(
        self,
        image_path: str,
        prompt: str
    ) -> str:
        """이미지를 분석합니다."""
        image = PIL.Image.open(image_path)
        response = self.model.generate_content([prompt, image])
        return response.text

    def analyze_with_url(self, image_url: str, prompt: str) -> str:
        """URL의 이미지를 분석합니다."""
        import httpx
        image_data = httpx.get(image_url).content

        image_part = {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(image_data).decode('utf-8')
        }

        response = self.model.generate_content([
            {"text": prompt},
            image_part
        ])
        return response.text

    def analyze_video(
        self,
        video_path: str,
        questions: list[str]
    ) -> dict:
        """비디오를 분석합니다. (Gemini 1.5 Pro의 강점)"""

        # 비디오 파일 업로드
        print(f"비디오 업로드 중: {video_path}")
        video_file = genai.upload_file(
            path=video_path,
            display_name="analysis_video"
        )

        # 업로드 완료 대기
        import time
        while video_file.state.name == "PROCESSING":
            print("처리 중...")
            time.sleep(10)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name == "FAILED":
            raise ValueError("비디오 처리 실패")

        print(f"비디오 업로드 완료: {video_file.uri}")

        # 질문별 분석
        results = {}
        for question in questions:
            response = self.model.generate_content(
                [video_file, question],
                request_options={"timeout": 600}
            )
            results[question] = response.text

        # 파일 삭제 (선택적)
        genai.delete_file(video_file.name)

        return results

    def analyze_multiple_images_interleaved(
        self,
        image_text_pairs: list[dict]  # [{"image": PIL.Image, "text": str}]
    ) -> str:
        """이미지와 텍스트가 교차 배치된 복합 쿼리를 처리합니다."""
        content = []

        for pair in image_text_pairs:
            if "text" in pair:
                content.append(pair["text"])
            if "image" in pair:
                content.append(pair["image"])

        response = self.model.generate_content(content)
        return response.text

    def process_document_batch(
        self,
        document_images: list[PIL.Image.Image],
        extraction_schema: str
    ) -> list[dict]:
        """여러 문서를 일괄 처리합니다 (Gemini의 긴 컨텍스트 활용)."""
        import json

        # 모든 이미지를 하나의 요청으로 처리
        content = [f"다음 {len(document_images)}개의 문서를 분석해주세요:\n"]

        for i, img in enumerate(document_images, 1):
            content.append(f"\n--- 문서 {i} ---")
            content.append(img)

        content.append(f"\n각 문서에 대해 다음 JSON 스키마로 데이터를 추출하세요:\n{extraction_schema}")

        response = self.model.generate_content(content)

        # JSON 파싱
        try:
            text = response.text
            # JSON 배열 추출
            start = text.find('[')
            end = text.rfind(']') + 1
            if start >= 0 and end > start:
                return json.loads(text[start:end])
        except json.JSONDecodeError:
            return [{"raw_response": response.text}]


# 활용 예시: 비디오 분석
analyzer = GeminiVisionAnalyzer()

video_questions = [
    "비디오의 전체 내용을 요약해주세요.",
    "주요 장면들을 타임스탬프와 함께 나열해주세요.",
    "비디오에서 언급된 주요 키워드나 개념은 무엇인가요?",
    "비디오의 주제와 목적은 무엇인가요?"
]

results = analyzer.analyze_video("lecture_video.mp4", video_questions)
for question, answer in results.items():
    print(f"\n질문: {question}")
    print(f"답변: {answer}")

8. Claude Vision

Claude Vision API

Anthropic의 Claude 3.5 Sonnet은 강력한 시각 능력을 제공하며, 특히 문서 이해, 코드 스크린샷 분석, 세밀한 이미지 해석에서 뛰어난 성능을 보입니다.

import anthropic
import base64
import httpx
from pathlib import Path

client = anthropic.Anthropic()

class ClaudeVisionAnalyzer:
    """Claude Vision 기반 이미지 분석기."""

    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic()
        self.model = model

    def _prepare_image_content(
        self,
        image_source: str,
        is_url: bool = True
    ) -> dict:
        """이미지 콘텐츠를 Claude API 형식으로 준비합니다."""
        if is_url:
            return {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": image_source
                }
            }
        else:
            # 로컬 파일을 Base64로 인코딩
            with open(image_source, "rb") as f:
                image_data = base64.standard_b64encode(f.read()).decode("utf-8")

            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            return {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }
            }

    def analyze(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        system_prompt: str = None,
        max_tokens: int = 1000
    ) -> str:
        """이미지를 분석합니다."""
        image_content = self._prepare_image_content(image_source, is_url)

        messages = [
            {
                "role": "user",
                "content": [
                    image_content,
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        kwargs = {
            "model": self.model,
            "max_tokens": max_tokens,
            "messages": messages
        }

        if system_prompt:
            kwargs["system"] = system_prompt

        response = self.client.messages.create(**kwargs)
        return response.content[0].text

    def analyze_code_screenshot(
        self,
        screenshot_path: str
    ) -> dict:
        """코드 스크린샷을 분석하고 코드를 추출합니다."""
        system_prompt = """당신은 코드 분석 전문가입니다.
스크린샷에서 코드를 정확히 추출하고 분석하세요."""

        extraction_prompt = """이 코드 스크린샷에서:
1. 코드를 정확히 추출하세요 (들여쓰기 포함)
2. 프로그래밍 언어를 식별하세요
3. 코드의 주요 기능을 설명하세요
4. 잠재적인 버그나 개선사항을 제안하세요

다음 JSON 형식으로 응답하세요:
{
  "language": "프로그래밍 언어",
  "code": "추출된 코드",
  "description": "코드 설명",
  "potential_issues": ["이슈1", "이슈2"],
  "improvements": ["개선사항1", "개선사항2"]
}"""

        response = self.analyze(
            screenshot_path,
            extraction_prompt,
            is_url=False,
            system_prompt=system_prompt,
            max_tokens=2000
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            return {"raw_response": response}

    def compare_images(
        self,
        image_sources: list[tuple[str, bool]],  # (source, is_url) 쌍
        comparison_prompt: str
    ) -> str:
        """여러 이미지를 비교 분석합니다."""
        content = []

        for source, is_url in image_sources:
            content.append(self._prepare_image_content(source, is_url))

        content.append({"type": "text", "text": comparison_prompt})

        response = self.client.messages.create(
            model=self.model,
            max_tokens=2000,
            messages=[{"role": "user", "content": content}]
        )

        return response.content[0].text

    def analyze_ui_design(self, ui_screenshot_path: str) -> dict:
        """UI 디자인 스크린샷을 분석합니다."""
        prompt = """이 UI 스크린샷을 UX/UI 전문가 관점에서 분석해주세요:

분석 항목:
1. 레이아웃 구조
2. 색상 팔레트
3. 타이포그래피
4. 사용성 (Usability) 평가
5. 접근성 (Accessibility) 이슈
6. 개선 제안사항

JSON 형식으로 반환해주세요:
{
  "layout": "레이아웃 설명",
  "color_palette": ["주요 색상들"],
  "typography": "타이포그래피 평가",
  "usability_score": 0-10,
  "usability_issues": ["이슈들"],
  "accessibility_issues": ["접근성 문제"],
  "improvements": ["개선 제안"]
}"""

        return self.analyze(
            ui_screenshot_path,
            prompt,
            is_url=False,
            max_tokens=1500
        )

# 사용 예시
analyzer = ClaudeVisionAnalyzer()

# 이미지 분석
result = analyzer.analyze(
    "https://example.com/product.jpg",
    "이 제품의 특징을 자세히 설명하고, 잠재적 고객층을 추천해주세요.",
    is_url=True
)
print(result)

9. 멀티모달 RAG

멀티모달 RAG 개요

멀티모달 RAG는 텍스트뿐만 아니라 이미지, 표, 차트 등 다양한 형태의 콘텐츠를 인덱싱하고 검색하는 시스템입니다.

이미지 인덱싱 전략

import torch
import numpy as np
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
import base64
import io

class MultimodalRAGSystem:
    """멀티모달 RAG 시스템."""

    def __init__(self):
        # CLIP 모델 초기화
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

        # ChromaDB 초기화
        self.chroma_client = chromadb.Client()
        self.image_collection = self.chroma_client.get_or_create_collection(
            name="images",
            metadata={"hnsw:space": "cosine"}
        )
        self.text_collection = self.chroma_client.get_or_create_collection(
            name="texts"
        )

    def get_image_embedding(self, image: Image.Image) -> np.ndarray:
        """이미지를 CLIP 임베딩으로 변환합니다."""
        inputs = self.clip_processor(
            images=image,
            return_tensors="pt"
        )
        with torch.no_grad():
            features = self.clip_model.get_image_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def get_text_embedding(self, text: str) -> np.ndarray:
        """텍스트를 CLIP 임베딩으로 변환합니다."""
        inputs = self.clip_processor(
            text=[text],
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            features = self.clip_model.get_text_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def index_image(
        self,
        image: Image.Image,
        image_id: str,
        metadata: dict = None
    ):
        """이미지를 인덱싱합니다."""
        embedding = self.get_image_embedding(image)

        # 이미지를 Base64로 저장
        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        image_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

        doc_metadata = {"image_b64": image_b64}
        if metadata:
            doc_metadata.update(metadata)

        self.image_collection.add(
            embeddings=[embedding.tolist()],
            ids=[image_id],
            metadatas=[doc_metadata]
        )

    def search_images_by_text(
        self,
        query: str,
        n_results: int = 5
    ) -> list[dict]:
        """텍스트로 이미지를 검색합니다."""
        query_embedding = self.get_text_embedding(query)

        results = self.image_collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["metadatas", "distances", "ids"]
        )

        retrieved = []
        for i in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][i]
            image_b64 = metadata.pop('image_b64', None)

            image = None
            if image_b64:
                image_bytes = base64.b64decode(image_b64)
                image = Image.open(io.BytesIO(image_bytes))

            retrieved.append({
                "id": results['ids'][0][i],
                "distance": results['distances'][0][i],
                "metadata": metadata,
                "image": image
            })

        return retrieved

    def multimodal_rag_query(
        self,
        question: str,
        vision_model_fn,  # GPT-4V, Claude Vision 등
        n_image_results: int = 3
    ) -> str:
        """멀티모달 RAG 쿼리를 수행합니다."""

        # 관련 이미지 검색
        relevant_images = self.search_images_by_text(question, n_image_results)

        if not relevant_images:
            return vision_model_fn(question=question, images=[])

        # 검색된 이미지로 응답 생성
        retrieved_images = [r["image"] for r in relevant_images if r["image"]]
        metadata_info = [
            f"이미지 {i+1}: {r['metadata']}"
            for i, r in enumerate(relevant_images)
        ]

        enhanced_prompt = f"""
질문: {question}

관련 이미지 정보:
{chr(10).join(metadata_info)}

위의 이미지들을 참고하여 질문에 답변해주세요.
각 이미지의 관련 내용을 구체적으로 인용하세요.
"""

        return vision_model_fn(question=enhanced_prompt, images=retrieved_images)

ColPali: PDF 페이지 검색

# ColPali: 비전 언어 모델로 PDF 페이지 직접 검색
# pip install colpali-engine

from colpali_engine.models import ColPali, ColPaliProcessor
import torch

class ColPaliPDFSearch:
    """ColPali를 사용한 PDF 페이지 검색."""

    def __init__(self, model_name: str = "vidore/colpali-v1.2"):
        self.model = ColPali.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.processor = ColPaliProcessor.from_pretrained(model_name)

    def index_pdf_pages(
        self,
        page_images: list[Image.Image]
    ) -> torch.Tensor:
        """PDF 페이지 이미지들을 인덱싱합니다."""
        all_embeddings = []

        batch_size = 4
        for i in range(0, len(page_images), batch_size):
            batch = page_images[i:i + batch_size]
            inputs = self.processor.process_images(batch)
            inputs = {k: v.to("cuda") for k, v in inputs.items()}

            with torch.no_grad():
                embeddings = self.model(**inputs)

            all_embeddings.append(embeddings)

        return torch.cat(all_embeddings, dim=0)

    def search(
        self,
        query: str,
        page_embeddings: torch.Tensor,
        top_k: int = 3
    ) -> list[int]:
        """쿼리로 관련 PDF 페이지를 검색합니다."""
        # 쿼리 임베딩
        query_inputs = self.processor.process_queries([query])
        query_inputs = {k: v.to("cuda") for k, v in query_inputs.items()}

        with torch.no_grad():
            query_embedding = self.model(**query_inputs)

        # MaxSim 스코어 계산 (ColPali의 핵심)
        scores = self.processor.score_multi_vector(
            query_embedding,
            page_embeddings
        )

        # Top-K 페이지 인덱스 반환
        top_indices = scores[0].argsort(descending=True)[:top_k]
        return top_indices.tolist()

10. 오픈소스 멀티모달 모델

Phi-3 Vision (Microsoft)

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

class Phi3VisionModel:
    """Microsoft Phi-3 Vision 모델."""

    def __init__(self):
        model_id = "microsoft/Phi-3-vision-128k-instruct"

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            _attn_implementation='flash_attention_2'  # CUDA 필요
        )

        self.processor = AutoProcessor.from_pretrained(
            model_id,
            trust_remote_code=True
        )

    def analyze(self, image: Image.Image, prompt: str) -> str:
        """이미지를 분석합니다."""
        messages = [
            {"role": "user", "content": f"<|image_1|>\n{prompt}"}
        ]

        prompt_text = self.processor.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt_text,
            [image],
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=500,
                eos_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs['input_ids'].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

Qwen-VL (Alibaba)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

class QwenVLModel:
    """Qwen2-VL 멀티모달 모델."""

    def __init__(self, model_name: str = "Qwen/Qwen2-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(
            model_name,
            min_pixels=256*28*28,
            max_pixels=1280*28*28
        )

    def analyze_image(
        self,
        image_path: str,
        question: str
    ) -> str:
        """이미지를 분석합니다."""
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": image_path
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        image_inputs, video_inputs = process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output_ids = self.model.generate(**inputs, max_new_tokens=512)

        generated_ids = [
            output_ids[len(input_ids):]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        return self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

로컬 실행 가이드 (Ollama)

# Ollama로 멀티모달 모델 로컬 실행
# ollama.ai에서 Ollama 설치

# LLaVA 모델 다운로드 및 실행
ollama pull llava:13b

# 이미지와 함께 모델 실행
ollama run llava:13b

import ollama
from pathlib import Path

class OllamaVisionModel:
    """Ollama를 사용한 로컬 비전 모델."""

    def __init__(self, model: str = "llava:13b"):
        self.model = model

    def analyze(
        self,
        image_path: str,
        prompt: str
    ) -> str:
        """로컬 모델로 이미지를 분석합니다."""
        response = ollama.chat(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                    "images": [image_path]
                }
            ]
        )
        return response["message"]["content"]

    def batch_analyze(
        self,
        image_paths: list[str],
        prompt: str
    ) -> list[str]:
        """여러 이미지를 순차 분석합니다."""
        results = []
        for path in image_paths:
            result = self.analyze(path, prompt)
            results.append(result)
        return results

# 사용 예시
model = OllamaVisionModel("llava:13b")
result = model.analyze(
    "/path/to/image.jpg",
    "이 이미지에 무엇이 있나요? 자세히 설명해주세요."
)
print(result)

11. 비디오 이해 AI

비디오 이해의 과제

비디오 이해는 시간적 정보를 포함하는 멀티모달 태스크로, 정적 이미지 이해보다 훨씬 복잡합니다.

주요 과제:

시간적 의존성: 프레임 간의 시간적 관계 이해
대용량 데이터: 1분 비디오 = 약 1800 프레임 (30fps)
동작 인식: 움직임 패턴 파악
다중 스케일: 짧은 동작과 긴 이벤트 동시 이해

VideoMAE를 활용한 비디오 특징 추출

from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np

class VideoFeatureExtractor:
    """VideoMAE를 사용한 비디오 특징 추출기."""

    def __init__(self, model_name: str = "MCG-NJU/videomae-base"):
        self.processor = VideoMAEImageProcessor.from_pretrained(model_name)
        self.model = VideoMAEModel.from_pretrained(model_name)

    def extract_video_features(
        self,
        video_frames: list,  # PIL 이미지 또는 numpy 배열 목록
        num_frames: int = 16  # VideoMAE는 보통 16프레임 사용
    ) -> torch.Tensor:
        """비디오 프레임에서 특징을 추출합니다."""

        # 균일하게 프레임 샘플링
        total_frames = len(video_frames)
        indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        sampled_frames = [video_frames[i] for i in indices]

        # 전처리
        inputs = self.processor(sampled_frames, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(**inputs)

        # [batch, num_patches, hidden_size] 형태
        return outputs.last_hidden_state

# OpenCV로 비디오 프레임 추출
import cv2

def extract_frames_from_video(
    video_path: str,
    target_fps: int = 1
) -> list:
    """비디오에서 프레임을 추출합니다."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / target_fps)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            # BGR을 RGB로 변환
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            from PIL import Image
            pil_frame = Image.fromarray(frame_rgb)
            frames.append(pil_frame)

        frame_count += 1

    cap.release()
    return frames

Gemini로 긴 비디오 이해

import google.generativeai as genai
import time

class LongVideoUnderstanding:
    """Gemini 1.5 Pro를 사용한 긴 비디오 이해 시스템."""

    def __init__(self):
        self.model = genai.GenerativeModel("gemini-1.5-pro")

    def analyze_long_video(
        self,
        video_path: str,
        analysis_tasks: list[str]
    ) -> dict:
        """최대 1시간 길이의 비디오를 분석합니다."""

        print("비디오 업로드 중...")
        video_file = genai.upload_file(
            path=video_path,
            display_name="long_video_analysis"
        )

        # 처리 완료 대기
        while video_file.state.name == "PROCESSING":
            print(f"처리 중... (상태: {video_file.state.name})")
            time.sleep(15)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name != "ACTIVE":
            raise RuntimeError(f"비디오 처리 실패: {video_file.state.name}")

        print(f"업로드 완료 (URI: {video_file.uri})")

        results = {}

        for task in analysis_tasks:
            print(f"분석 중: {task}")
            response = self.model.generate_content(
                [video_file, task],
                request_options={"timeout": 900}
            )
            results[task] = response.text

        # 업로드된 파일 정리
        genai.delete_file(video_file.name)
        print("파일 정리 완료")

        return results

    def create_video_summary(self, video_path: str) -> dict:
        """비디오의 종합 요약을 생성합니다."""
        tasks = [
            "비디오의 전체 내용을 3-5문장으로 요약해주세요.",
            "주요 장면을 타임스탬프와 함께 나열해주세요. 형식: MM:SS - 설명",
            "비디오에 등장하는 주요 인물, 사물, 장소를 목록으로 나열해주세요.",
            "비디오에서 강조된 핵심 메시지나 결론은 무엇인가요?",
            "이 비디오의 대상 시청자와 목적은 무엇인가요?"
        ]

        return self.analyze_long_video(video_path, tasks)

# 비디오 이해 시스템 활용
video_analyzer = LongVideoUnderstanding()
summary = video_analyzer.create_video_summary("lecture.mp4")

for task, result in summary.items():
    print(f"\n{'='*50}")
    print(f"질문: {task}")
    print(f"답변: {result}")

마치며

멀티모달 AI는 빠르게 발전하고 있으며, 텍스트, 이미지, 비디오를 통합적으로 이해하는 능력이 점점 더 강력해지고 있습니다.

이 가이드에서 다룬 핵심 내용:

CLIP: 대조 학습으로 이미지-텍스트를 동일 공간에 매핑, 제로샷 분류의 기반
BLIP/BLIP-2: 부트스트래핑과 Q-Former로 효율적인 멀티모달 학습
LLaVA: 오픈소스 비전-언어 어시스턴트의 표준
GPT-4V / Claude Vision: 상업용 최고 성능 멀티모달 LLM
Gemini 1.5: 100만 토큰 컨텍스트로 긴 비디오와 문서 처리
멀티모달 RAG: CLIP 임베딩으로 이미지를 검색 가능한 지식베이스로 구축
오픈소스 생태계: Phi-3 Vision, Qwen-VL 등 로컬 실행 가능한 강력한 모델들

앞으로의 방향은 더 긴 비디오 이해, 3D 공간 이해, 실시간 멀티모달 처리로 나아가고 있습니다. 이 분야는 매우 빠르게 발전하므로 지속적인 학습이 필요합니다.

참고 자료

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
Li, J. et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597
Liu, H. et al. (2023). Visual Instruction Tuning (LLaVA). arXiv:2304.08485
OpenAI GPT-4 Technical Report: openai.com/research/gpt-4
Google Gemini API: ai.google.dev/gemini-api
Anthropic Claude API: docs.anthropic.com
HuggingFace LLaVA: huggingface.co/llava-hf
ColPali: Efficient Document Retrieval with Vision Language Models. arxiv.org/abs/2407.01449

Multimodal AI Complete Guide: Master CLIP, LLaVA, GPT-4V, and Gemini Vision

Multimodal AI Overview
CLIP In-Depth
The BLIP Family
LLaVA: Large Language and Vision Assistant
InstructBLIP
GPT-4 Vision
Gemini Vision
Claude Vision
Multimodal RAG
Open-Source Multimodal Models
Video Understanding AI

1. Multimodal AI Overview

Limitations of Single-Modality Systems

Traditional AI systems could only process one form of data (modality) at a time — text, images, or audio. This single-modality approach has fundamental limitations when solving complex real-world problems.

Limitations of text-only models:

Cannot analyze images when asked for image descriptions
Cannot understand charts, graphs, and screenshots
Cannot make decisions that require visual context

Limitations of image-only models:

Cannot holistically understand text and visual elements together within images
Cannot perform language-based question answering
Cannot search using natural language descriptions

The Potential of Multimodal AI

Multimodal AI systems can process and understand multiple forms of data simultaneously — mimicking the natural human cognitive ability to "see, hear, and read while understanding all at once."

Key application areas:

Medical Diagnosis: Integrated analysis of medical imaging and patient record text
Autonomous Driving: Integration of cameras, LiDAR, and map data
Education: Automatic generation of explanations from textbook images
E-Commerce: Integrated processing of product photos, descriptions, and reviews
Document Understanding: Scanned document OCR and content analysis
Creative Applications: Image generation from text descriptions (DALL-E, Stable Diffusion)

History of Vision-Language Models

2021: CLIP (OpenAI) - linking images and text via contrastive learning
2022: BLIP - unified image captioning and VQA
2023: BLIP-2 - efficient multimodal learning via Q-Former
2023: LLaVA - open-source vision-language assistant
2023: GPT-4V - commercial multimodal LLM
2023: Gemini - Google's multimodal foundation model
2024: Claude 3 Vision - Anthropic's multimodal model
2024: LLaVA-1.6, InternVL2, Qwen-VL2 - open-source improvements
2025: Expansion to video understanding and 3D understanding

2. CLIP In-Depth

The Core Idea Behind CLIP

CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, was trained on 400 million image-text pairs using contrastive learning, mapping images and text into a shared embedding space.

Key innovation: By training only on image-caption pairs collected from the internet — without any manual labels — CLIP acquired powerful zero-shot classification capabilities.

CLIP Architecture

Image → [Image Encoder (ViT/ResNet)] → Image Embedding (512-dim)
                                                   ↕ Similarity
Text  → [Text Encoder (Transformer)] → Text Embedding (512-dim)

Contrastive Learning Mechanism:

For N image-text pairs in a batch:

Maximize the similarity of correct pairs (diagonal)
Minimize the similarity of incorrect pairs (off-diagonal)

import torch
import torch.nn.functional as F
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def clip_zero_shot_classification(
    image: Image.Image,
    candidate_labels: list[str]
) -> dict[str, float]:
    """Zero-shot image classification using CLIP."""

    # Generate text prompts (CLIP's recommended format)
    text_prompts = [f"a photo of a {label}" for label in candidate_labels]

    # Preprocessing
    inputs = processor(
        text=text_prompts,
        images=image,
        return_tensors="pt",
        padding=True
    )

    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image
        probs = logits_per_image.softmax(dim=1)

    return {
        label: prob.item()
        for label, prob in zip(candidate_labels, probs[0])
    }

# Usage example
image_url = "https://example.com/sample_image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

labels = ["cat", "dog", "bird", "fish", "rabbit"]
results = clip_zero_shot_classification(image, labels)

# Sort by probability
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for label, prob in sorted_results:
    print(f"{label}: {prob:.4f} ({prob*100:.1f}%)")

Image-Text Retrieval

import numpy as np
from typing import Union

class CLIPSearchEngine:
    """CLIP-based image-text search engine."""

    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_embeddings = []
        self.image_metadata = []

    def encode_images(self, images: list[Image.Image]) -> torch.Tensor:
        """Convert a batch of images to embeddings."""
        inputs = self.processor(
            images=images,
            return_tensors="pt",
            padding=True
        )
        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)
            image_features = F.normalize(image_features, p=2, dim=-1)
        return image_features

    def encode_texts(self, texts: list[str]) -> torch.Tensor:
        """Convert a batch of texts to embeddings."""
        inputs = self.processor(
            text=texts,
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)
            text_features = F.normalize(text_features, p=2, dim=-1)
        return text_features

    def index_images(
        self,
        images: list[Image.Image],
        metadata: list[dict] = None
    ):
        """Index images for retrieval."""
        embeddings = self.encode_images(images)
        self.image_embeddings.append(embeddings)
        if metadata:
            self.image_metadata.extend(metadata)

    def text_to_image_search(
        self,
        query: str,
        top_k: int = 5
    ) -> list[dict]:
        """Search for images using a text query."""
        if not self.image_embeddings:
            return []

        all_embeddings = torch.cat(self.image_embeddings, dim=0)
        query_embedding = self.encode_texts([query])

        # Cosine similarity (already L2-normalized)
        similarities = (all_embeddings @ query_embedding.T).squeeze(-1)
        top_indices = similarities.argsort(descending=True)[:top_k]

        results = []
        for idx in top_indices:
            idx = idx.item()
            result = {
                "index": idx,
                "similarity": similarities[idx].item()
            }
            if self.image_metadata:
                result.update(self.image_metadata[idx])
            results.append(result)

        return results

OpenCLIP (Open-Source CLIP)

# OpenCLIP: supports various architectures and training datasets
# pip install open_clip_torch

import open_clip
import torch
from PIL import Image

# List available models
available_models = open_clip.list_pretrained()
print("Available models:", available_models[:5])

# Load a large model trained on LAION-2B
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='laion2b_s32b_b79k'
)
tokenizer = open_clip.get_tokenizer('ViT-H-14')

def compute_clip_similarity(
    image: Image.Image,
    texts: list[str]
) -> list[float]:
    """Compute CLIP similarity between an image and a list of texts."""
    model.eval()

    image_input = preprocess_val(image).unsqueeze(0)
    text_input = tokenizer(texts)

    with torch.no_grad(), torch.cuda.amp.autocast():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

    return similarity[0].tolist()

Fine-Tuning CLIP

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import CLIPModel, CLIPProcessor
import torch.optim as optim

class ImageTextDataset(Dataset):
    """Image-text pair dataset."""

    def __init__(
        self,
        image_paths: list[str],
        texts: list[str],
        processor: CLIPProcessor
    ):
        self.image_paths = image_paths
        self.texts = texts
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        text = self.texts[idx]

        inputs = self.processor(
            images=image,
            text=text,
            return_tensors="pt",
            padding="max_length",
            max_length=77,
            truncation=True
        )

        return {
            "pixel_values": inputs["pixel_values"].squeeze(0),
            "input_ids": inputs["input_ids"].squeeze(0),
            "attention_mask": inputs["attention_mask"].squeeze(0)
        }

class CLIPFineTuner:
    """CLIP model fine-tuning class."""

    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def contrastive_loss(
        self,
        image_features: torch.Tensor,
        text_features: torch.Tensor,
        temperature: float = 0.07
    ) -> torch.Tensor:
        """Contrastive learning loss function."""
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        logits = torch.matmul(image_features, text_features.T) / temperature
        labels = torch.arange(len(logits)).to(self.device)

        loss_i = F.cross_entropy(logits, labels)
        loss_t = F.cross_entropy(logits.T, labels)

        return (loss_i + loss_t) / 2

    def train(
        self,
        train_dataset: ImageTextDataset,
        num_epochs: int = 10,
        batch_size: int = 32,
        learning_rate: float = 1e-5
    ):
        """Fine-tune CLIP."""
        dataloader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=4
        )

        optimizer = optim.AdamW(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=0.01
        )

        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=num_epochs
        )

        for epoch in range(num_epochs):
            total_loss = 0
            self.model.train()

            for batch in dataloader:
                pixel_values = batch["pixel_values"].to(self.device)
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                outputs = self.model(
                    pixel_values=pixel_values,
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                loss = self.contrastive_loss(image_features, text_features)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            scheduler.step()
            avg_loss = total_loss / len(dataloader)
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

3. The BLIP Family

BLIP: Bootstrapping Language-Image Pre-training

BLIP, published by Salesforce Research in 2022, excels at diverse vision-language tasks including image captioning, image-text retrieval, and visual question answering (VQA).

Key innovation: Data bootstrapping using a Captioner and a Filter to clean noisy image-text pairs collected from the web.

from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import BlipForQuestionAnswering
from PIL import Image
import torch

# BLIP Image Captioning
class BLIPCaptioner:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-large",
            torch_dtype=torch.float16
        )
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def caption(
        self,
        image: Image.Image,
        conditional_text: str = None,
        max_new_tokens: int = 50
    ) -> str:
        """Generate a caption for an image."""
        if conditional_text:
            inputs = self.processor(
                image,
                conditional_text,
                return_tensors="pt"
            ).to(self.device, torch.float16)
        else:
            inputs = self.processor(
                image,
                return_tensors="pt"
            ).to(self.device, torch.float16)

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True
            )

        return self.processor.decode(output[0], skip_special_tokens=True)

# BLIP VQA (Visual Question Answering)
class BLIPVisualQA:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        self.model = BlipForQuestionAnswering.from_pretrained(
            "Salesforce/blip-vqa-base"
        )

    def answer(self, image: Image.Image, question: str) -> str:
        """Answer a question about an image."""
        inputs = self.processor(image, question, return_tensors="pt")

        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)

        return self.processor.decode(output[0], skip_special_tokens=True)

# Usage
captioner = BLIPCaptioner()
vqa = BLIPVisualQA()

image = Image.open("sample.jpg")

caption = captioner.caption(image)
print(f"Caption: {caption}")

cond_caption = captioner.caption(image, "a photo of")
print(f"Conditional caption: {cond_caption}")

answer = vqa.answer(image, "What color is the sky?")
print(f"Answer: {answer}")

BLIP-2: Querying Transformer

BLIP-2, published in 2023, introduces the Q-Former (Querying Transformer) to efficiently bridge a frozen image encoder with a frozen LLM.

Q-Former's role:

Extracts the most important visual features from the image encoder's output
32 learnable query tokens interact with image features to create a compressed representation
This compressed representation is passed as input to the LLM

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

class BLIP2Assistant:
    """BLIP-2 based visual question answering assistant."""

    def __init__(
        self,
        model_name: str = "Salesforce/blip2-opt-2.7b"
    ):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def generate_response(
        self,
        image: Image.Image,
        prompt: str = None,
        max_new_tokens: int = 200,
        temperature: float = 1.0
    ) -> str:
        """Generate a response for an image and optional prompt."""
        if prompt:
            inputs = self.processor(
                images=image,
                text=prompt,
                return_tensors="pt"
            ).to("cuda", torch.float16)
        else:
            inputs = self.processor(
                images=image,
                return_tensors="pt"
            ).to("cuda", torch.float16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        generated_text = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

    def batch_caption(
        self,
        images: list[Image.Image],
        batch_size: int = 8
    ) -> list[str]:
        """Generate captions for a batch of images."""
        all_captions = []

        for i in range(0, len(images), batch_size):
            batch = images[i:i + batch_size]
            inputs = self.processor(
                images=batch,
                return_tensors="pt",
                padding=True
            ).to("cuda", torch.float16)

            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs,
                    max_new_tokens=50
                )

            captions = self.processor.batch_decode(
                generated_ids,
                skip_special_tokens=True
            )
            all_captions.extend([c.strip() for c in captions])

        return all_captions

# Usage
assistant = BLIP2Assistant("Salesforce/blip2-flan-t5-xxl")
image = Image.open("document.png")

response = assistant.generate_response(
    image,
    "Question: What is the main topic of this document? Answer:"
)
print(response)

4. LLaVA: Large Language and Vision Assistant

LLaVA Architecture

LLaVA, published in 2023, is an open-source vision-language model that connects a powerful LLM (LLaMA, Vicuna) with a CLIP vision encoder to build a multimodal chatbot with instruction-following capabilities.

Architecture components:

Image → [CLIP ViT-L/14] → Image features (1024-dim)
                                   ↓
                         [Linear Projection Layer]
                                   ↓
                         [Visual Tokens]
                                   ↓
              [LLM (LLaMA/Vicuna)] ← [Text Tokens]
                                   ↓
                             Final Response

LLaVA-1.5 improvements:

MLP projection layer (linear → 2-layer MLP)
High-resolution image support
More training data

LLaVA-1.6 (LLaVA-NeXT) improvements:

Dynamic High Resolution: up to 672x672 → 4x more visual tokens
Improved reasoning and OCR capabilities
Support for various aspect ratios

Using LLaVA with HuggingFace

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

class LLaVAAssistant:
    """LLaVA-1.6 based visual assistant."""

    def __init__(
        self,
        model_name: str = "llava-hf/llava-v1.6-mistral-7b-hf"
    ):
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto"
        )

    def chat(
        self,
        image: Image.Image,
        message: str,
        max_new_tokens: int = 500,
        temperature: float = 0.7
    ) -> str:
        """Chat with the model about an image."""
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": message}
                ]
            }
        ]

        prompt = self.processor.apply_chat_template(
            conversation,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt,
            image,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0,
                pad_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs["input_ids"].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

    def analyze_chart(self, chart_image: Image.Image) -> dict:
        """Analyze a chart image."""
        analysis_prompts = [
            "What is the title of this chart?",
            "What do the x-axis and y-axis represent?",
            "What are the highest and lowest values?",
            "Describe the overall trend.",
            "What is the most important insight from this data?"
        ]

        results = {}
        for prompt in analysis_prompts:
            response = self.chat(chart_image, prompt)
            results[prompt] = response

        return results

    def extract_text_from_image(self, image: Image.Image) -> str:
        """Extract text from an image (OCR)."""
        return self.chat(
            image,
            "Please accurately extract all the text in this image. "
            "Return only the text without any additional explanations."
        )


# Production use case: Document analysis pipeline
class DocumentAnalysisPipeline:
    """Document analysis pipeline using LLaVA."""

    def __init__(self):
        self.llava = LLaVAAssistant()

    def analyze_document(self, document_image: Image.Image) -> dict:
        """Comprehensively analyze a document image."""

        doc_type = self.llava.chat(
            document_image,
            "What type of document is this? (invoice, contract, report, form, etc.)"
        )

        extracted_text = self.llava.extract_text_from_image(document_image)

        key_info = self.llava.chat(
            document_image,
            f"Extract the following information from this {doc_type} in JSON format: "
            "date, sender, recipient, amount (if applicable), main content summary"
        )

        action_items = self.llava.chat(
            document_image,
            "If there are any required actions in this document, please list them."
        )

        return {
            "document_type": doc_type,
            "extracted_text": extracted_text,
            "key_information": key_info,
            "action_items": action_items
        }

5. InstructBLIP

The Core of InstructBLIP

InstructBLIP builds on BLIP-2 by applying instruction tuning to follow diverse instructions. The key is that the Q-Former is made instruction-aware, extracting visual features relevant to the specific instruction given.

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

class InstructBLIPAssistant:
    """Instruction-following assistant based on InstructBLIP."""

    def __init__(self, model_name: str = "Salesforce/instructblip-vicuna-7b"):
        self.processor = InstructBlipProcessor.from_pretrained(model_name)
        self.model = InstructBlipForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def instruct(
        self,
        image: Image.Image,
        instruction: str,
        max_new_tokens: int = 300
    ) -> str:
        """Follow a specific instruction about an image."""
        inputs = self.processor(
            images=image,
            text=instruction,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                do_sample=False,
                num_beams=5,
                max_new_tokens=max_new_tokens,
                min_length=1,
                top_p=0.9,
                repetition_penalty=1.5,
                length_penalty=1.0,
                temperature=1.0
            )

        generated_text = self.processor.batch_decode(
            outputs,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

# Diverse usage examples
assistant = InstructBLIPAssistant()
image = Image.open("complex_diagram.png")

description = assistant.instruct(
    image,
    "Please explain this diagram in detail, including the role of each component and its connections."
)

objects = assistant.instruct(
    image,
    "List all the objects found in this image and describe the location of each."
)

emotion = assistant.instruct(
    image,
    "Analyze the emotional state of the people in this image and explain your reasoning."
)

6. GPT-4 Vision

GPT-4V API Usage

GPT-4 Vision adds visual capabilities to OpenAI's GPT-4, making it one of the most powerful commercial multimodal LLMs available.

import openai
import base64
from pathlib import Path
import httpx

client = openai.OpenAI()

def encode_image_to_base64(image_path: str) -> str:
    """Encode an image file to Base64."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

class GPT4VisionAnalyzer:
    """Image analyzer based on GPT-4 Vision."""

    def __init__(self, model: str = "gpt-4o"):
        self.client = openai.OpenAI()
        self.model = model

    def analyze_image(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        detail: str = "high",
        max_tokens: int = 1000
    ) -> str:
        """Analyze a single image."""
        if is_url:
            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": image_source,
                    "detail": detail
                }
            }
        else:
            base64_image = encode_image_to_base64(image_source)
            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{media_type};base64,{base64_image}",
                    "detail": detail
                }
            }

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        image_content,
                        {"type": "text", "text": prompt}
                    ]
                }
            ],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_multiple_images(
        self,
        image_sources: list[dict],
        prompt: str,
        max_tokens: int = 2000
    ) -> str:
        """Analyze multiple images simultaneously."""
        content = []

        for img_info in image_sources:
            source = img_info["source"]
            is_url = img_info.get("is_url", True)

            if is_url:
                content.append({
                    "type": "image_url",
                    "image_url": {"url": source, "detail": "high"}
                })
            else:
                base64_image = encode_image_to_base64(source)
                content.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                })

        content.append({"type": "text", "text": prompt})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": content}],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_chart_or_graph(self, image_source: str, is_url: bool = True) -> dict:
        """Analyze a chart or graph and return structured data."""
        prompt = """Analyze this chart/graph and return the following JSON:
{
  "chart_type": "bar/line/pie/scatter/etc.",
  "title": "chart title",
  "x_axis": {"label": "X-axis label", "unit": "unit"},
  "y_axis": {"label": "Y-axis label", "unit": "unit"},
  "data_series": [{"name": "series name", "trend": "rising/falling/stable"}],
  "key_findings": ["finding1", "finding2"],
  "data_range": {"min": 0, "max": 0},
  "anomalies": ["anomaly description"]
}"""

        response = self.analyze_image(
            image_source,
            prompt,
            is_url=is_url,
            detail="high",
            max_tokens=1500
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

        return {"raw_response": response}


# Production use case: e-commerce product analysis
def analyze_product_images(image_urls: list[str]) -> str:
    """Analyze multiple product images."""
    analyzer = GPT4VisionAnalyzer()
    image_sources = [{"source": url, "is_url": True} for url in image_urls]

    return analyzer.analyze_multiple_images(
        image_sources,
        prompt="""Analyze these product images and return the following JSON:
{
  "product_name": "estimated product name",
  "category": "product category",
  "color_options": ["color list"],
  "key_features": ["key features"],
  "condition": "new/used/etc.",
  "quality_score": 0-10,
  "marketing_description": "marketing copy (100 chars)",
  "seo_keywords": ["SEO keywords"]
}"""
    )

7. Gemini Vision

Gemini's Multimodal Capabilities

Google's Gemini was designed from the ground up with multimodality in mind. Gemini 1.5 Pro in particular, with its 1M token context window, can process long-duration video, lengthy documents, and large numbers of images.

import google.generativeai as genai
import PIL.Image
from pathlib import Path
import base64

genai.configure(api_key="YOUR_GEMINI_API_KEY")

class GeminiVisionAnalyzer:
    """Gemini Vision-based analyzer."""

    def __init__(self, model_name: str = "gemini-1.5-pro"):
        self.model = genai.GenerativeModel(model_name)
        self.flash_model = genai.GenerativeModel("gemini-1.5-flash")

    def analyze_image(self, image_path: str, prompt: str) -> str:
        """Analyze an image."""
        image = PIL.Image.open(image_path)
        response = self.model.generate_content([prompt, image])
        return response.text

    def analyze_with_url(self, image_url: str, prompt: str) -> str:
        """Analyze an image from a URL."""
        import httpx
        image_data = httpx.get(image_url).content

        image_part = {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(image_data).decode('utf-8')
        }

        response = self.model.generate_content([
            {"text": prompt},
            image_part
        ])
        return response.text

    def analyze_video(
        self,
        video_path: str,
        questions: list[str]
    ) -> dict:
        """Analyze a video (Gemini 1.5 Pro's strength)."""
        print(f"Uploading video: {video_path}")
        video_file = genai.upload_file(
            path=video_path,
            display_name="analysis_video"
        )

        import time
        while video_file.state.name == "PROCESSING":
            print("Processing...")
            time.sleep(10)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name == "FAILED":
            raise ValueError("Video processing failed")

        print(f"Upload complete: {video_file.uri}")

        results = {}
        for question in questions:
            response = self.model.generate_content(
                [video_file, question],
                request_options={"timeout": 600}
            )
            results[question] = response.text

        genai.delete_file(video_file.name)
        return results

    def analyze_multiple_images_interleaved(
        self,
        image_text_pairs: list[dict]
    ) -> str:
        """Handle compound queries with interleaved images and text."""
        content = []
        for pair in image_text_pairs:
            if "text" in pair:
                content.append(pair["text"])
            if "image" in pair:
                content.append(pair["image"])

        response = self.model.generate_content(content)
        return response.text

    def process_document_batch(
        self,
        document_images: list[PIL.Image.Image],
        extraction_schema: str
    ) -> list[dict]:
        """Process multiple documents in a single batch (leveraging Gemini's long context)."""
        import json

        content = [f"Please analyze the following {len(document_images)} documents:\n"]

        for i, img in enumerate(document_images, 1):
            content.append(f"\n--- Document {i} ---")
            content.append(img)

        content.append(f"\nExtract data from each document using this JSON schema:\n{extraction_schema}")

        response = self.model.generate_content(content)

        try:
            text = response.text
            start = text.find('[')
            end = text.rfind(']') + 1
            if start >= 0 and end > start:
                return json.loads(text[start:end])
        except json.JSONDecodeError:
            return [{"raw_response": response.text}]


# Usage: video analysis
analyzer = GeminiVisionAnalyzer()

video_questions = [
    "Please summarize the overall content of this video.",
    "List the major scenes with their timestamps.",
    "What are the main keywords or concepts mentioned in this video?",
    "What is the topic and purpose of this video?"
]

results = analyzer.analyze_video("lecture_video.mp4", video_questions)
for question, answer in results.items():
    print(f"\nQuestion: {question}")
    print(f"Answer: {answer}")

8. Claude Vision

Claude Vision API

Anthropic's Claude 3.5 Sonnet offers powerful vision capabilities, excelling especially at document understanding, code screenshot analysis, and detailed image interpretation.

import anthropic
import base64
import httpx
from pathlib import Path

client = anthropic.Anthropic()

class ClaudeVisionAnalyzer:
    """Image analyzer based on Claude Vision."""

    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic()
        self.model = model

    def _prepare_image_content(
        self,
        image_source: str,
        is_url: bool = True
    ) -> dict:
        """Prepare image content in Claude API format."""
        if is_url:
            return {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": image_source
                }
            }
        else:
            with open(image_source, "rb") as f:
                image_data = base64.standard_b64encode(f.read()).decode("utf-8")

            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            return {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }
            }

    def analyze(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        system_prompt: str = None,
        max_tokens: int = 1000
    ) -> str:
        """Analyze an image."""
        image_content = self._prepare_image_content(image_source, is_url)

        messages = [
            {
                "role": "user",
                "content": [
                    image_content,
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        kwargs = {
            "model": self.model,
            "max_tokens": max_tokens,
            "messages": messages
        }

        if system_prompt:
            kwargs["system"] = system_prompt

        response = self.client.messages.create(**kwargs)
        return response.content[0].text

    def analyze_code_screenshot(
        self,
        screenshot_path: str
    ) -> dict:
        """Analyze a code screenshot and extract the code."""
        system_prompt = """You are an expert code analyst.
Accurately extract and analyze code from screenshots."""

        extraction_prompt = """From this code screenshot:
1. Extract the code accurately (including indentation)
2. Identify the programming language
3. Explain the main functionality
4. Suggest potential bugs or improvements

Respond in the following JSON format:
{
  "language": "programming language",
  "code": "extracted code",
  "description": "code description",
  "potential_issues": ["issue1", "issue2"],
  "improvements": ["improvement1", "improvement2"]
}"""

        response = self.analyze(
            screenshot_path,
            extraction_prompt,
            is_url=False,
            system_prompt=system_prompt,
            max_tokens=2000
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            return {"raw_response": response}

    def compare_images(
        self,
        image_sources: list[tuple[str, bool]],
        comparison_prompt: str
    ) -> str:
        """Comparatively analyze multiple images."""
        content = []

        for source, is_url in image_sources:
            content.append(self._prepare_image_content(source, is_url))

        content.append({"type": "text", "text": comparison_prompt})

        response = self.client.messages.create(
            model=self.model,
            max_tokens=2000,
            messages=[{"role": "user", "content": content}]
        )

        return response.content[0].text

    def analyze_ui_design(self, ui_screenshot_path: str) -> str:
        """Analyze a UI design screenshot."""
        prompt = """Please analyze this UI screenshot from a UX/UI expert perspective:

Analysis areas:
1. Layout structure
2. Color palette
3. Typography
4. Usability assessment
5. Accessibility issues
6. Improvement suggestions

Return in JSON format:
{
  "layout": "layout description",
  "color_palette": ["main colors"],
  "typography": "typography assessment",
  "usability_score": 0-10,
  "usability_issues": ["issues"],
  "accessibility_issues": ["accessibility problems"],
  "improvements": ["improvement suggestions"]
}"""

        return self.analyze(
            ui_screenshot_path,
            prompt,
            is_url=False,
            max_tokens=1500
        )

# Usage
analyzer = ClaudeVisionAnalyzer()

result = analyzer.analyze(
    "https://example.com/product.jpg",
    "Describe the features of this product in detail and suggest a potential target audience.",
    is_url=True
)
print(result)

9. Multimodal RAG

Multimodal RAG Overview

Multimodal RAG indexes and retrieves not just text, but also images, tables, charts, and other content types.

Image Indexing Strategy

import torch
import numpy as np
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
import chromadb
import base64
import io

class MultimodalRAGSystem:
    """Multimodal RAG system."""

    def __init__(self):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

        self.chroma_client = chromadb.Client()
        self.image_collection = self.chroma_client.get_or_create_collection(
            name="images",
            metadata={"hnsw:space": "cosine"}
        )

    def get_image_embedding(self, image: Image.Image) -> np.ndarray:
        """Convert an image to a CLIP embedding."""
        inputs = self.clip_processor(images=image, return_tensors="pt")
        with torch.no_grad():
            features = self.clip_model.get_image_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def get_text_embedding(self, text: str) -> np.ndarray:
        """Convert text to a CLIP embedding."""
        inputs = self.clip_processor(
            text=[text],
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            features = self.clip_model.get_text_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def index_image(
        self,
        image: Image.Image,
        image_id: str,
        metadata: dict = None
    ):
        """Index an image."""
        embedding = self.get_image_embedding(image)

        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        image_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

        doc_metadata = {"image_b64": image_b64}
        if metadata:
            doc_metadata.update(metadata)

        self.image_collection.add(
            embeddings=[embedding.tolist()],
            ids=[image_id],
            metadatas=[doc_metadata]
        )

    def search_images_by_text(
        self,
        query: str,
        n_results: int = 5
    ) -> list[dict]:
        """Search for images using text."""
        query_embedding = self.get_text_embedding(query)

        results = self.image_collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["metadatas", "distances", "ids"]
        )

        retrieved = []
        for i in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][i]
            image_b64 = metadata.pop('image_b64', None)

            image = None
            if image_b64:
                image_bytes = base64.b64decode(image_b64)
                image = Image.open(io.BytesIO(image_bytes))

            retrieved.append({
                "id": results['ids'][0][i],
                "distance": results['distances'][0][i],
                "metadata": metadata,
                "image": image
            })

        return retrieved

    def multimodal_rag_query(
        self,
        question: str,
        vision_model_fn,
        n_image_results: int = 3
    ) -> str:
        """Perform a multimodal RAG query."""
        relevant_images = self.search_images_by_text(question, n_image_results)

        if not relevant_images:
            return vision_model_fn(question=question, images=[])

        retrieved_images = [r["image"] for r in relevant_images if r["image"]]
        metadata_info = [
            f"Image {i+1}: {r['metadata']}"
            for i, r in enumerate(relevant_images)
        ]

        enhanced_prompt = f"""
Question: {question}

Retrieved image information:
{chr(10).join(metadata_info)}

Please answer the question based on these images.
Specifically cite relevant content from each image.
"""

        return vision_model_fn(question=enhanced_prompt, images=retrieved_images)

ColPali: PDF Page Retrieval

# ColPali: direct PDF page retrieval using vision-language models
# pip install colpali-engine

from colpali_engine.models import ColPali, ColPaliProcessor
import torch

class ColPaliPDFSearch:
    """PDF page search using ColPali."""

    def __init__(self, model_name: str = "vidore/colpali-v1.2"):
        self.model = ColPali.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.processor = ColPaliProcessor.from_pretrained(model_name)

    def index_pdf_pages(
        self,
        page_images: list[Image.Image]
    ) -> torch.Tensor:
        """Index PDF page images."""
        all_embeddings = []
        batch_size = 4

        for i in range(0, len(page_images), batch_size):
            batch = page_images[i:i + batch_size]
            inputs = self.processor.process_images(batch)
            inputs = {k: v.to("cuda") for k, v in inputs.items()}

            with torch.no_grad():
                embeddings = self.model(**inputs)

            all_embeddings.append(embeddings)

        return torch.cat(all_embeddings, dim=0)

    def search(
        self,
        query: str,
        page_embeddings: torch.Tensor,
        top_k: int = 3
    ) -> list[int]:
        """Search for relevant PDF pages using a query."""
        query_inputs = self.processor.process_queries([query])
        query_inputs = {k: v.to("cuda") for k, v in query_inputs.items()}

        with torch.no_grad():
            query_embedding = self.model(**query_inputs)

        # MaxSim score computation (ColPali's key mechanism)
        scores = self.processor.score_multi_vector(
            query_embedding,
            page_embeddings
        )

        top_indices = scores[0].argsort(descending=True)[:top_k]
        return top_indices.tolist()

10. Open-Source Multimodal Models

Phi-3 Vision (Microsoft)

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

class Phi3VisionModel:
    """Microsoft Phi-3 Vision model."""

    def __init__(self):
        model_id = "microsoft/Phi-3-vision-128k-instruct"

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            _attn_implementation='flash_attention_2'
        )

        self.processor = AutoProcessor.from_pretrained(
            model_id,
            trust_remote_code=True
        )

    def analyze(self, image: Image.Image, prompt: str) -> str:
        """Analyze an image."""
        messages = [
            {"role": "user", "content": f"<|image_1|>\n{prompt}"}
        ]

        prompt_text = self.processor.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt_text,
            [image],
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=500,
                eos_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs['input_ids'].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

Qwen-VL (Alibaba)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

class QwenVLModel:
    """Qwen2-VL multimodal model."""

    def __init__(self, model_name: str = "Qwen/Qwen2-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(
            model_name,
            min_pixels=256*28*28,
            max_pixels=1280*28*28
        )

    def analyze_image(self, image_path: str, question: str) -> str:
        """Analyze an image."""
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image_path},
                    {"type": "text", "text": question}
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        image_inputs, video_inputs = process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output_ids = self.model.generate(**inputs, max_new_tokens=512)

        generated_ids = [
            output_ids[len(input_ids):]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        return self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

Local Execution Guide (Ollama)

# Run multimodal models locally with Ollama
# Install Ollama from ollama.ai

# Download and run the LLaVA model
ollama pull llava:13b

# Run the model
ollama run llava:13b

import ollama
from pathlib import Path

class OllamaVisionModel:
    """Local vision model using Ollama."""

    def __init__(self, model: str = "llava:13b"):
        self.model = model

    def analyze(self, image_path: str, prompt: str) -> str:
        """Analyze an image using a local model."""
        response = ollama.chat(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                    "images": [image_path]
                }
            ]
        )
        return response["message"]["content"]

    def batch_analyze(
        self,
        image_paths: list[str],
        prompt: str
    ) -> list[str]:
        """Analyze multiple images sequentially."""
        results = []
        for path in image_paths:
            result = self.analyze(path, prompt)
            results.append(result)
        return results

# Usage
model = OllamaVisionModel("llava:13b")
result = model.analyze(
    "/path/to/image.jpg",
    "What is in this image? Please describe it in detail."
)
print(result)

11. Video Understanding AI

The Challenges of Video Understanding

Video understanding is a multimodal task that includes temporal information, making it far more complex than static image understanding.

Key challenges:

Temporal dependencies: Understanding temporal relationships between frames
Large data volume: 1 minute of video at 30fps = ~1,800 frames
Action recognition: Capturing motion patterns
Multi-scale: Understanding both short actions and long events simultaneously

Video Feature Extraction with VideoMAE

from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np

class VideoFeatureExtractor:
    """Video feature extractor using VideoMAE."""

    def __init__(self, model_name: str = "MCG-NJU/videomae-base"):
        self.processor = VideoMAEImageProcessor.from_pretrained(model_name)
        self.model = VideoMAEModel.from_pretrained(model_name)

    def extract_video_features(
        self,
        video_frames: list,
        num_frames: int = 16
    ) -> torch.Tensor:
        """Extract features from video frames."""
        total_frames = len(video_frames)
        indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        sampled_frames = [video_frames[i] for i in indices]

        inputs = self.processor(sampled_frames, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(**inputs)

        return outputs.last_hidden_state

# Extract frames from video with OpenCV
import cv2

def extract_frames_from_video(
    video_path: str,
    target_fps: int = 1
) -> list:
    """Extract frames from a video."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / target_fps)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            from PIL import Image
            pil_frame = Image.fromarray(frame_rgb)
            frames.append(pil_frame)

        frame_count += 1

    cap.release()
    return frames

Long Video Understanding with Gemini

import google.generativeai as genai
import time

class LongVideoUnderstanding:
    """Long video understanding system using Gemini 1.5 Pro."""

    def __init__(self):
        self.model = genai.GenerativeModel("gemini-1.5-pro")

    def analyze_long_video(
        self,
        video_path: str,
        analysis_tasks: list[str]
    ) -> dict:
        """Analyze videos up to 1 hour in length."""
        print("Uploading video...")
        video_file = genai.upload_file(
            path=video_path,
            display_name="long_video_analysis"
        )

        while video_file.state.name == "PROCESSING":
            print(f"Processing... (state: {video_file.state.name})")
            time.sleep(15)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name != "ACTIVE":
            raise RuntimeError(f"Video processing failed: {video_file.state.name}")

        print(f"Upload complete (URI: {video_file.uri})")

        results = {}
        for task in analysis_tasks:
            print(f"Analyzing: {task}")
            response = self.model.generate_content(
                [video_file, task],
                request_options={"timeout": 900}
            )
            results[task] = response.text

        genai.delete_file(video_file.name)
        print("File cleanup complete")

        return results

    def create_video_summary(self, video_path: str) -> dict:
        """Generate a comprehensive summary of a video."""
        tasks = [
            "Summarize the overall content of this video in 3-5 sentences.",
            "List the major scenes with timestamps. Format: MM:SS - description",
            "List the main people, objects, and locations that appear in the video.",
            "What are the key messages or conclusions emphasized in this video?",
            "Who is the target audience and what is the purpose of this video?"
        ]

        return self.analyze_long_video(video_path, tasks)

# Usage
video_analyzer = LongVideoUnderstanding()
summary = video_analyzer.create_video_summary("lecture.mp4")

for task, result in summary.items():
    print(f"\n{'='*50}")
    print(f"Question: {task}")
    print(f"Answer: {result}")

Conclusion

Multimodal AI is advancing rapidly, with the ability to holistically understand text, images, and video becoming increasingly powerful.

Key takeaways from this guide:

CLIP: Maps images and text into a shared space via contrastive learning, forming the foundation for zero-shot classification
BLIP/BLIP-2: Efficient multimodal learning through bootstrapping and Q-Former
LLaVA: The standard for open-source vision-language assistants
GPT-4V / Claude Vision: The highest-performing commercial multimodal LLMs
Gemini 1.5: 1M token context for processing long videos and documents
Multimodal RAG: Building searchable knowledge bases from images using CLIP embeddings
Open-source ecosystem: Powerful locally-runnable models including Phi-3 Vision and Qwen-VL

The field is heading toward longer video understanding, 3D spatial understanding, and real-time multimodal processing. This area evolves very quickly, making continuous learning essential.

References

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
Li, J. et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597
Liu, H. et al. (2023). Visual Instruction Tuning (LLaVA). arXiv:2304.08485
OpenAI GPT-4 Technical Report: openai.com/research/gpt-4
Google Gemini API: ai.google.dev/gemini-api
Anthropic Claude API: docs.anthropic.com
HuggingFace LLaVA: huggingface.co/llava-hf
ColPali: Efficient Document Retrieval with Vision Language Models. arxiv.org/abs/2407.01449

멀티모달 AI 완전 정복: CLIP, LLaVA, GPT-4V, Gemini Vision 마스터하기

목차

1. 멀티모달 AI 개요

단일 모달리티의 한계

멀티모달 AI의 가능성

비전-언어 모델의 발전사

2. CLIP 심층 분석

CLIP의 핵심 아이디어

CLIP 아키텍처

이미지-텍스트 검색

OpenCLIP (오픈소스 CLIP)

CLIP 파인튜닝

3. BLIP 계열

BLIP (Bootstrapping Language-Image Pre-training)

BLIP-2: Querying Transformer

4. LLaVA: 대규모 언어-비전 어시스턴트

LLaVA 아키텍처

HuggingFace로 LLaVA 사용

5. InstructBLIP

InstructBLIP의 핵심

6. GPT-4 Vision

GPT-4V API 사용법

7. Gemini Vision

Gemini의 멀티모달 능력

8. Claude Vision

Claude Vision API

9. 멀티모달 RAG

멀티모달 RAG 개요

이미지 인덱싱 전략

ColPali: PDF 페이지 검색

10. 오픈소스 멀티모달 모델

Phi-3 Vision (Microsoft)

Qwen-VL (Alibaba)

로컬 실행 가이드 (Ollama)

11. 비디오 이해 AI

비디오 이해의 과제

VideoMAE를 활용한 비디오 특징 추출

Gemini로 긴 비디오 이해

마치며

참고 자료

Multimodal AI Complete Guide: Master CLIP, LLaVA, GPT-4V, and Gemini Vision

Table of Contents

1. Multimodal AI Overview

Limitations of Single-Modality Systems

The Potential of Multimodal AI

History of Vision-Language Models

2. CLIP In-Depth

The Core Idea Behind CLIP

CLIP Architecture

Image-Text Retrieval

OpenCLIP (Open-Source CLIP)

Fine-Tuning CLIP

3. The BLIP Family

BLIP: Bootstrapping Language-Image Pre-training

BLIP-2: Querying Transformer

4. LLaVA: Large Language and Vision Assistant

LLaVA Architecture

Using LLaVA with HuggingFace

5. InstructBLIP

The Core of InstructBLIP

6. GPT-4 Vision

GPT-4V API Usage

7. Gemini Vision

Gemini's Multimodal Capabilities

8. Claude Vision

Claude Vision API

9. Multimodal RAG

Multimodal RAG Overview

Image Indexing Strategy

ColPali: PDF Page Retrieval

10. Open-Source Multimodal Models

Phi-3 Vision (Microsoft)

Qwen-VL (Alibaba)

Local Execution Guide (Ollama)

11. Video Understanding AI

The Challenges of Video Understanding

Video Feature Extraction with VideoMAE

Long Video Understanding with Gemini

Conclusion

References