マルチモーダルAI完全ガイド: CLIP・LLaVA・GPT-4V・Gemini Visionをマスターする

1. マルチモーダルAI概要

単一モダリティシステムの限界

従来のAIシステムは、テキスト・画像・音声のいずれか一種類のデータ（モダリティ）しか処理できませんでした。この単一モダリティアプローチには、複雑な現実世界の問題を解く際に根本的な限界があります。

テキストのみのモデルの限界:

画像の説明を求められても画像を分析できない
グラフ、チャート、スクリーンショットを理解できない
視覚的コンテキストが必要な意思決定ができない

画像のみのモデルの限界:

画像内のテキストと視覚要素を統合して理解できない
言語ベースの質疑応答ができない
自然言語による説明で検索できない

マルチモーダルAIの可能性

マルチモーダルAIシステムは複数のデータ形式を同時に処理・理解でき、「見て、聞いて、読みながらすべてを同時に理解する」という人間の自然な認知能力を模倣します。

主要な応用分野:

医療診断: 医療画像と患者記録テキストの統合分析
自動運転: カメラ、LiDAR、地図データの統合
教育: 教科書の画像から説明を自動生成
Eコマース: 商品写真、説明文、レビューの統合処理
文書理解: スキャン文書のOCRとコンテンツ分析
クリエイティブ応用: テキスト説明からの画像生成（DALL-E、Stable Diffusion）

ビジョン言語モデルの歴史

2021: CLIP (OpenAI) - 対照学習で画像とテキストを接続
2022: BLIP - 統合画像キャプションとVQA
2023: BLIP-2 - Q-Formerによる効率的なマルチモーダル学習
2023: LLaVA - オープンソースビジョン言語アシスタント
2023: GPT-4V - 商用マルチモーダルLLM
2023: Gemini - Googleのマルチモーダル基盤モデル
2024: Claude 3 Vision - Anthropicのマルチモーダルモデル
2024: LLaVA-1.6、InternVL2、Qwen-VL2 - オープンソースの改善
2025: 動画理解・3D理解への拡張

2. CLIPの詳細

CLIPの核心アイデア

CLIP（Contrastive Language-Image Pre-training）は2021年にOpenAIが公開し、対照学習を使用して4億枚の画像テキストペアで学習し、画像とテキストを共通の埋め込み空間にマッピングします。

主要なイノベーション: 手動のラベルなしにインターネットから収集した画像キャプションのペアだけで学習することで、強力なゼロショット分類能力を獲得しました。

CLIPアーキテクチャ

画像 → [画像エンコーダ (ViT/ResNet)] → 画像埋め込み (512次元)
                                                   ↕ 類似度
テキスト → [テキストエンコーダ (Transformer)] → テキスト埋め込み (512次元)

対照学習メカニズム:

バッチ内のN個の画像テキストペアに対して:

正しいペア（対角線）の類似度を最大化
誤ったペア（非対角線）の類似度を最小化

import torch
import torch.nn.functional as F
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# CLIPモデルの読み込み
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def clip_zero_shot_classification(
    image: Image.Image,
    candidate_labels: list[str]
) -> dict[str, float]:
    """CLIPを使用したゼロショット画像分類。"""

    # テキストプロンプト生成（CLIPの推奨フォーマット）
    text_prompts = [f"a photo of a {label}" for label in candidate_labels]

    # 前処理
    inputs = processor(
        text=text_prompts,
        images=image,
        return_tensors="pt",
        padding=True
    )

    # 推論
    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image
        probs = logits_per_image.softmax(dim=1)

    return {
        label: prob.item()
        for label, prob in zip(candidate_labels, probs[0])
    }

# 使用例
image_url = "https://example.com/sample_image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

labels = ["cat", "dog", "bird", "fish", "rabbit"]
results = clip_zero_shot_classification(image, labels)

# 確率でソート
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for label, prob in sorted_results:
    print(f"{label}: {prob:.4f} ({prob*100:.1f}%)")

画像テキスト検索

import numpy as np
from typing import Union

class CLIPSearchEngine:
    """CLIPベースの画像テキスト検索エンジン。"""

    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_embeddings = []
        self.image_metadata = []

    def encode_images(self, images: list[Image.Image]) -> torch.Tensor:
        """画像バッチを埋め込みに変換。"""
        inputs = self.processor(
            images=images,
            return_tensors="pt",
            padding=True
        )
        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)
            image_features = F.normalize(image_features, p=2, dim=-1)
        return image_features

    def encode_texts(self, texts: list[str]) -> torch.Tensor:
        """テキストバッチを埋め込みに変換。"""
        inputs = self.processor(
            text=texts,
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)
            text_features = F.normalize(text_features, p=2, dim=-1)
        return text_features

    def index_images(
        self,
        images: list[Image.Image],
        metadata: list[dict] = None
    ):
        """検索用に画像をインデックス化。"""
        embeddings = self.encode_images(images)
        self.image_embeddings.append(embeddings)
        if metadata:
            self.image_metadata.extend(metadata)

    def text_to_image_search(
        self,
        query: str,
        top_k: int = 5
    ) -> list[dict]:
        """テキストクエリで画像を検索。"""
        if not self.image_embeddings:
            return []

        all_embeddings = torch.cat(self.image_embeddings, dim=0)
        query_embedding = self.encode_texts([query])

        # コサイン類似度（L2正規化済み）
        similarities = (all_embeddings @ query_embedding.T).squeeze(-1)
        top_indices = similarities.argsort(descending=True)[:top_k]

        results = []
        for idx in top_indices:
            idx = idx.item()
            result = {
                "index": idx,
                "similarity": similarities[idx].item()
            }
            if self.image_metadata:
                result.update(self.image_metadata[idx])
            results.append(result)

        return results

OpenCLIP（オープンソースCLIP）

# OpenCLIP: 様々なアーキテクチャとトレーニングデータセットをサポート
# pip install open_clip_torch

import open_clip
import torch
from PIL import Image

# 利用可能なモデルの一覧
available_models = open_clip.list_pretrained()
print("Available models:", available_models[:5])

# LAION-2Bで学習した大型モデルの読み込み
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='laion2b_s32b_b79k'
)
tokenizer = open_clip.get_tokenizer('ViT-H-14')

def compute_clip_similarity(
    image: Image.Image,
    texts: list[str]
) -> list[float]:
    """画像とテキストリスト間のCLIP類似度を計算。"""
    model.eval()

    image_input = preprocess_val(image).unsqueeze(0)
    text_input = tokenizer(texts)

    with torch.no_grad(), torch.cuda.amp.autocast():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

    return similarity[0].tolist()

CLIPのファインチューニング

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import CLIPModel, CLIPProcessor
import torch.optim as optim

class ImageTextDataset(Dataset):
    """画像テキストペアデータセット。"""

    def __init__(
        self,
        image_paths: list[str],
        texts: list[str],
        processor: CLIPProcessor
    ):
        self.image_paths = image_paths
        self.texts = texts
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        text = self.texts[idx]

        inputs = self.processor(
            images=image,
            text=text,
            return_tensors="pt",
            padding="max_length",
            max_length=77,
            truncation=True
        )

        return {
            "pixel_values": inputs["pixel_values"].squeeze(0),
            "input_ids": inputs["input_ids"].squeeze(0),
            "attention_mask": inputs["attention_mask"].squeeze(0)
        }

class CLIPFineTuner:
    """CLIPモデルのファインチューニングクラス。"""

    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def contrastive_loss(
        self,
        image_features: torch.Tensor,
        text_features: torch.Tensor,
        temperature: float = 0.07
    ) -> torch.Tensor:
        """対照学習の損失関数。"""
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        logits = torch.matmul(image_features, text_features.T) / temperature
        labels = torch.arange(len(logits)).to(self.device)

        loss_i = F.cross_entropy(logits, labels)
        loss_t = F.cross_entropy(logits.T, labels)

        return (loss_i + loss_t) / 2

    def train(
        self,
        train_dataset: ImageTextDataset,
        num_epochs: int = 10,
        batch_size: int = 32,
        learning_rate: float = 1e-5
    ):
        """CLIPのファインチューニング。"""
        dataloader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=4
        )

        optimizer = optim.AdamW(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=0.01
        )

        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=num_epochs
        )

        for epoch in range(num_epochs):
            total_loss = 0
            self.model.train()

            for batch in dataloader:
                pixel_values = batch["pixel_values"].to(self.device)
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                outputs = self.model(
                    pixel_values=pixel_values,
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                loss = self.contrastive_loss(image_features, text_features)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            scheduler.step()
            avg_loss = total_loss / len(dataloader)
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

3. BLIPファミリー

BLIP: Bootstrapping Language-Image Pre-training

BLIPは2022年にSalesforce Researchが公開し、画像キャプション、画像テキスト検索、視覚的質疑応答（VQA）など多様なビジョン言語タスクに優れています。

主要なイノベーション: キャプショナーとフィルタを使用したデータブートストラッピングにより、ウェブから収集したノイズの多い画像テキストペアをクリーニングします。

from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import BlipForQuestionAnswering
from PIL import Image
import torch

# BLIP画像キャプション
class BLIPCaptioner:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-large",
            torch_dtype=torch.float16
        )
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def caption(
        self,
        image: Image.Image,
        conditional_text: str = None,
        max_new_tokens: int = 50
    ) -> str:
        """画像のキャプションを生成。"""
        if conditional_text:
            inputs = self.processor(
                image,
                conditional_text,
                return_tensors="pt"
            ).to(self.device, torch.float16)
        else:
            inputs = self.processor(
                image,
                return_tensors="pt"
            ).to(self.device, torch.float16)

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True
            )

        return self.processor.decode(output[0], skip_special_tokens=True)

# BLIP VQA（視覚的質疑応答）
class BLIPVisualQA:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        self.model = BlipForQuestionAnswering.from_pretrained(
            "Salesforce/blip-vqa-base"
        )

    def answer(self, image: Image.Image, question: str) -> str:
        """画像に関する質問に回答。"""
        inputs = self.processor(image, question, return_tensors="pt")

        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)

        return self.processor.decode(output[0], skip_special_tokens=True)

# 使用方法
captioner = BLIPCaptioner()
vqa = BLIPVisualQA()

image = Image.open("sample.jpg")

caption = captioner.caption(image)
print(f"Caption: {caption}")

cond_caption = captioner.caption(image, "a photo of")
print(f"Conditional caption: {cond_caption}")

answer = vqa.answer(image, "What color is the sky?")
print(f"Answer: {answer}")

BLIP-2: クエリングトランスフォーマー

BLIP-2は2023年に公開され、Q-Former（クエリングトランスフォーマー）を導入して凍結した画像エンコーダと凍結したLLMを効率的に橋渡しします。

Q-Formerの役割:

画像エンコーダの出力から最も重要な視覚的特徴を抽出
32個の学習可能なクエリトークンが画像特徴と相互作用し、圧縮表現を生成
この圧縮表現がLLMへの入力として渡される

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

class BLIP2Assistant:
    """BLIP-2ベースの視覚的質疑応答アシスタント。"""

    def __init__(
        self,
        model_name: str = "Salesforce/blip2-opt-2.7b"
    ):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def generate_response(
        self,
        image: Image.Image,
        prompt: str = None,
        max_new_tokens: int = 200,
        temperature: float = 1.0
    ) -> str:
        """画像と任意のプロンプトへの応答を生成。"""
        if prompt:
            inputs = self.processor(
                images=image,
                text=prompt,
                return_tensors="pt"
            ).to("cuda", torch.float16)
        else:
            inputs = self.processor(
                images=image,
                return_tensors="pt"
            ).to("cuda", torch.float16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        generated_text = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

    def batch_caption(
        self,
        images: list[Image.Image],
        batch_size: int = 8
    ) -> list[str]:
        """画像バッチのキャプションを生成。"""
        all_captions = []

        for i in range(0, len(images), batch_size):
            batch = images[i:i + batch_size]
            inputs = self.processor(
                images=batch,
                return_tensors="pt",
                padding=True
            ).to("cuda", torch.float16)

            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs,
                    max_new_tokens=50
                )

            captions = self.processor.batch_decode(
                generated_ids,
                skip_special_tokens=True
            )
            all_captions.extend([c.strip() for c in captions])

        return all_captions

# 使用方法
assistant = BLIP2Assistant("Salesforce/blip2-flan-t5-xxl")
image = Image.open("document.png")

response = assistant.generate_response(
    image,
    "Question: What is the main topic of this document? Answer:"
)
print(response)

4. LLaVA: Large Language and Vision Assistant

LLaVAアーキテクチャ

LLaVAは2023年に公開されたオープンソースのビジョン言語モデルで、強力なLLM（LLaMA、Vicuna）とCLIPビジョンエンコーダを接続し、命令追従機能を持つマルチモーダルチャットボットを構築します。

アーキテクチャコンポーネント:

画像 → [CLIP ViT-L/14] → 画像特徴 (1024次元)
                                   ↓
                         [線形投影層]
                                   ↓
                         [ビジュアルトークン]
                                   ↓
              [LLM (LLaMA/Vicuna)] ← [テキストトークン]
                                   ↓
                             最終応答

LLaVA-1.5の改善点:

MLP投影層（線形 → 2層MLP）
高解像度画像サポート
より多くのトレーニングデータ

LLaVA-1.6（LLaVA-NeXT）の改善点:

動的高解像度: 672x672まで → 4倍多いビジュアルトークン
推論とOCR能力の向上
様々なアスペクト比のサポート

HuggingFaceでのLLaVA使用

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

class LLaVAAssistant:
    """LLaVA-1.6ベースのビジュアルアシスタント。"""

    def __init__(
        self,
        model_name: str = "llava-hf/llava-v1.6-mistral-7b-hf"
    ):
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto"
        )

    def chat(
        self,
        image: Image.Image,
        message: str,
        max_new_tokens: int = 500,
        temperature: float = 0.7
    ) -> str:
        """画像についてモデルとチャット。"""
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": message}
                ]
            }
        ]

        prompt = self.processor.apply_chat_template(
            conversation,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt,
            image,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0,
                pad_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs["input_ids"].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

    def analyze_chart(self, chart_image: Image.Image) -> dict:
        """チャート画像を分析。"""
        analysis_prompts = [
            "What is the title of this chart?",
            "What do the x-axis and y-axis represent?",
            "What are the highest and lowest values?",
            "Describe the overall trend.",
            "What is the most important insight from this data?"
        ]

        results = {}
        for prompt in analysis_prompts:
            response = self.chat(chart_image, prompt)
            results[prompt] = response

        return results

    def extract_text_from_image(self, image: Image.Image) -> str:
        """画像からテキストを抽出（OCR）。"""
        return self.chat(
            image,
            "Please accurately extract all the text in this image. "
            "Return only the text without any additional explanations."
        )


# プロダクション活用: 文書分析パイプライン
class DocumentAnalysisPipeline:
    """LLaVAを使用した文書分析パイプライン。"""

    def __init__(self):
        self.llava = LLaVAAssistant()

    def analyze_document(self, document_image: Image.Image) -> dict:
        """文書画像を包括的に分析。"""

        doc_type = self.llava.chat(
            document_image,
            "What type of document is this? (invoice, contract, report, form, etc.)"
        )

        extracted_text = self.llava.extract_text_from_image(document_image)

        key_info = self.llava.chat(
            document_image,
            f"Extract the following information from this {doc_type} in JSON format: "
            "date, sender, recipient, amount (if applicable), main content summary"
        )

        action_items = self.llava.chat(
            document_image,
            "If there are any required actions in this document, please list them."
        )

        return {
            "document_type": doc_type,
            "extracted_text": extracted_text,
            "key_information": key_info,
            "action_items": action_items
        }

5. InstructBLIP

InstructBLIPの核心

InstructBLIPはBLIP-2をベースに命令チューニングを適用して多様な命令を追従できるようにしています。重要なのはQ-Formerが命令を認識し、与えられた命令に関連する視覚的特徴を抽出するようになる点です。

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

class InstructBLIPAssistant:
    """InstructBLIPベースの命令追従アシスタント。"""

    def __init__(self, model_name: str = "Salesforce/instructblip-vicuna-7b"):
        self.processor = InstructBlipProcessor.from_pretrained(model_name)
        self.model = InstructBlipForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def instruct(
        self,
        image: Image.Image,
        instruction: str,
        max_new_tokens: int = 300
    ) -> str:
        """画像についての特定の命令を実行。"""
        inputs = self.processor(
            images=image,
            text=instruction,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                do_sample=False,
                num_beams=5,
                max_new_tokens=max_new_tokens,
                min_length=1,
                top_p=0.9,
                repetition_penalty=1.5,
                length_penalty=1.0,
                temperature=1.0
            )

        generated_text = self.processor.batch_decode(
            outputs,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

# 多様な使用例
assistant = InstructBLIPAssistant()
image = Image.open("complex_diagram.png")

description = assistant.instruct(
    image,
    "Please explain this diagram in detail, including the role of each component and its connections."
)

objects = assistant.instruct(
    image,
    "List all the objects found in this image and describe the location of each."
)

emotion = assistant.instruct(
    image,
    "Analyze the emotional state of the people in this image and explain your reasoning."
)

6. GPT-4 Vision

GPT-4V APIの使用

GPT-4 VisionはOpenAIのGPT-4に視覚能力を追加し、利用可能な最強の商用マルチモーダルLLMの一つです。

import openai
import base64
from pathlib import Path
import httpx

client = openai.OpenAI()

def encode_image_to_base64(image_path: str) -> str:
    """画像ファイルをBase64にエンコード。"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

class GPT4VisionAnalyzer:
    """GPT-4 Visionベースの画像アナライザー。"""

    def __init__(self, model: str = "gpt-4o"):
        self.client = openai.OpenAI()
        self.model = model

    def analyze_image(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        detail: str = "high",
        max_tokens: int = 1000
    ) -> str:
        """単一画像を分析。"""
        if is_url:
            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": image_source,
                    "detail": detail
                }
            }
        else:
            base64_image = encode_image_to_base64(image_source)
            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{media_type};base64,{base64_image}",
                    "detail": detail
                }
            }

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        image_content,
                        {"type": "text", "text": prompt}
                    ]
                }
            ],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_multiple_images(
        self,
        image_sources: list[dict],
        prompt: str,
        max_tokens: int = 2000
    ) -> str:
        """複数画像を同時に分析。"""
        content = []

        for img_info in image_sources:
            source = img_info["source"]
            is_url = img_info.get("is_url", True)

            if is_url:
                content.append({
                    "type": "image_url",
                    "image_url": {"url": source, "detail": "high"}
                })
            else:
                base64_image = encode_image_to_base64(source)
                content.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                })

        content.append({"type": "text", "text": prompt})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": content}],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_chart_or_graph(self, image_source: str, is_url: bool = True) -> dict:
        """チャートやグラフを分析して構造化データを返す。"""
        prompt = """Analyze this chart/graph and return the following JSON:
{
  "chart_type": "bar/line/pie/scatter/etc.",
  "title": "chart title",
  "x_axis": {"label": "X-axis label", "unit": "unit"},
  "y_axis": {"label": "Y-axis label", "unit": "unit"},
  "data_series": [{"name": "series name", "trend": "rising/falling/stable"}],
  "key_findings": ["finding1", "finding2"],
  "data_range": {"min": 0, "max": 0},
  "anomalies": ["anomaly description"]
}"""

        response = self.analyze_image(
            image_source,
            prompt,
            is_url=is_url,
            detail="high",
            max_tokens=1500
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

        return {"raw_response": response}


# プロダクション活用: Eコマース商品分析
def analyze_product_images(image_urls: list[str]) -> str:
    """複数の商品画像を分析。"""
    analyzer = GPT4VisionAnalyzer()
    image_sources = [{"source": url, "is_url": True} for url in image_urls]

    return analyzer.analyze_multiple_images(
        image_sources,
        prompt="""Analyze these product images and return the following JSON:
{
  "product_name": "estimated product name",
  "category": "product category",
  "color_options": ["color list"],
  "key_features": ["key features"],
  "condition": "new/used/etc.",
  "quality_score": 0-10,
  "marketing_description": "marketing copy (100 chars)",
  "seo_keywords": ["SEO keywords"]
}"""
    )

7. Gemini Vision

Geminiのマルチモーダル能力

GoogleのGeminiはマルチモダリティを念頭に置いてゼロから設計されました。特にGemini 1.5 Proは100万トークンのコンテキストウィンドウで長時間の動画、長い文書、大量の画像を処理できます。

import google.generativeai as genai
import PIL.Image
from pathlib import Path
import base64

genai.configure(api_key="YOUR_GEMINI_API_KEY")

class GeminiVisionAnalyzer:
    """Gemini Visionベースのアナライザー。"""

    def __init__(self, model_name: str = "gemini-1.5-pro"):
        self.model = genai.GenerativeModel(model_name)
        self.flash_model = genai.GenerativeModel("gemini-1.5-flash")

    def analyze_image(self, image_path: str, prompt: str) -> str:
        """画像を分析。"""
        image = PIL.Image.open(image_path)
        response = self.model.generate_content([prompt, image])
        return response.text

    def analyze_with_url(self, image_url: str, prompt: str) -> str:
        """URLから画像を分析。"""
        import httpx
        image_data = httpx.get(image_url).content

        image_part = {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(image_data).decode('utf-8')
        }

        response = self.model.generate_content([
            {"text": prompt},
            image_part
        ])
        return response.text

    def analyze_video(
        self,
        video_path: str,
        questions: list[str]
    ) -> dict:
        """動画を分析（Gemini 1.5 Proの強み）。"""
        print(f"Uploading video: {video_path}")
        video_file = genai.upload_file(
            path=video_path,
            display_name="analysis_video"
        )

        import time
        while video_file.state.name == "PROCESSING":
            print("Processing...")
            time.sleep(10)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name == "FAILED":
            raise ValueError("Video processing failed")

        print(f"Upload complete: {video_file.uri}")

        results = {}
        for question in questions:
            response = self.model.generate_content(
                [video_file, question],
                request_options={"timeout": 600}
            )
            results[question] = response.text

        genai.delete_file(video_file.name)
        return results

    def analyze_multiple_images_interleaved(
        self,
        image_text_pairs: list[dict]
    ) -> str:
        """画像とテキストが混在する複合クエリを処理。"""
        content = []
        for pair in image_text_pairs:
            if "text" in pair:
                content.append(pair["text"])
            if "image" in pair:
                content.append(pair["image"])

        response = self.model.generate_content(content)
        return response.text

    def process_document_batch(
        self,
        document_images: list[PIL.Image.Image],
        extraction_schema: str
    ) -> list[dict]:
        """複数の文書を一括処理（Geminiの長いコンテキストを活用）。"""
        import json

        content = [f"Please analyze the following {len(document_images)} documents:\n"]

        for i, img in enumerate(document_images, 1):
            content.append(f"\n--- Document {i} ---")
            content.append(img)

        content.append(f"\nExtract data from each document using this JSON schema:\n{extraction_schema}")

        response = self.model.generate_content(content)

        try:
            text = response.text
            start = text.find('[')
            end = text.rfind(']') + 1
            if start >= 0 and end > start:
                return json.loads(text[start:end])
        except json.JSONDecodeError:
            return [{"raw_response": response.text}]


# 使用方法: 動画分析
analyzer = GeminiVisionAnalyzer()

video_questions = [
    "Please summarize the overall content of this video.",
    "List the major scenes with their timestamps.",
    "What are the main keywords or concepts mentioned in this video?",
    "What is the topic and purpose of this video?"
]

results = analyzer.analyze_video("lecture_video.mp4", video_questions)
for question, answer in results.items():
    print(f"\nQuestion: {question}")
    print(f"Answer: {answer}")

8. Claude Vision

Claude Vision API

AnthropicのClaude 3.5 Sonnetは強力なビジョン能力を持ち、特に文書理解、コードスクリーンショット分析、詳細な画像解釈に優れています。

import anthropic
import base64
import httpx
from pathlib import Path

client = anthropic.Anthropic()

class ClaudeVisionAnalyzer:
    """Claude Visionベースの画像アナライザー。"""

    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic()
        self.model = model

    def _prepare_image_content(
        self,
        image_source: str,
        is_url: bool = True
    ) -> dict:
        """Claude APIフォーマットで画像コンテンツを準備。"""
        if is_url:
            return {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": image_source
                }
            }
        else:
            with open(image_source, "rb") as f:
                image_data = base64.standard_b64encode(f.read()).decode("utf-8")

            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            return {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }
            }

    def analyze(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        system_prompt: str = None,
        max_tokens: int = 1000
    ) -> str:
        """画像を分析。"""
        image_content = self._prepare_image_content(image_source, is_url)

        messages = [
            {
                "role": "user",
                "content": [
                    image_content,
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        kwargs = {
            "model": self.model,
            "max_tokens": max_tokens,
            "messages": messages
        }

        if system_prompt:
            kwargs["system"] = system_prompt

        response = self.client.messages.create(**kwargs)
        return response.content[0].text

    def analyze_code_screenshot(
        self,
        screenshot_path: str
    ) -> dict:
        """コードスクリーンショットを分析してコードを抽出。"""
        system_prompt = """You are an expert code analyst.
Accurately extract and analyze code from screenshots."""

        extraction_prompt = """From this code screenshot:
1. Extract the code accurately (including indentation)
2. Identify the programming language
3. Explain the main functionality
4. Suggest potential bugs or improvements

Respond in the following JSON format:
{
  "language": "programming language",
  "code": "extracted code",
  "description": "code description",
  "potential_issues": ["issue1", "issue2"],
  "improvements": ["improvement1", "improvement2"]
}"""

        response = self.analyze(
            screenshot_path,
            extraction_prompt,
            is_url=False,
            system_prompt=system_prompt,
            max_tokens=2000
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            return {"raw_response": response}

    def compare_images(
        self,
        image_sources: list[tuple[str, bool]],
        comparison_prompt: str
    ) -> str:
        """複数画像を比較分析。"""
        content = []

        for source, is_url in image_sources:
            content.append(self._prepare_image_content(source, is_url))

        content.append({"type": "text", "text": comparison_prompt})

        response = self.client.messages.create(
            model=self.model,
            max_tokens=2000,
            messages=[{"role": "user", "content": content}]
        )

        return response.content[0].text

    def analyze_ui_design(self, ui_screenshot_path: str) -> str:
        """UIデザインスクリーンショットを分析。"""
        prompt = """Please analyze this UI screenshot from a UX/UI expert perspective:

Analysis areas:
1. Layout structure
2. Color palette
3. Typography
4. Usability assessment
5. Accessibility issues
6. Improvement suggestions

Return in JSON format:
{
  "layout": "layout description",
  "color_palette": ["main colors"],
  "typography": "typography assessment",
  "usability_score": 0-10,
  "usability_issues": ["issues"],
  "accessibility_issues": ["accessibility problems"],
  "improvements": ["improvement suggestions"]
}"""

        return self.analyze(
            ui_screenshot_path,
            prompt,
            is_url=False,
            max_tokens=1500
        )

# 使用方法
analyzer = ClaudeVisionAnalyzer()

result = analyzer.analyze(
    "https://example.com/product.jpg",
    "Describe the features of this product in detail and suggest a potential target audience.",
    is_url=True
)
print(result)

9. マルチモーダルRAG

マルチモーダルRAG概要

マルチモーダルRAGはテキストだけでなく、画像、テーブル、チャート、その他のコンテンツタイプもインデックス化して検索します。

画像インデックス戦略

import torch
import numpy as np
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
import chromadb
import base64
import io

class MultimodalRAGSystem:
    """マルチモーダルRAGシステム。"""

    def __init__(self):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

        self.chroma_client = chromadb.Client()
        self.image_collection = self.chroma_client.get_or_create_collection(
            name="images",
            metadata={"hnsw:space": "cosine"}
        )

    def get_image_embedding(self, image: Image.Image) -> np.ndarray:
        """画像をCLIP埋め込みに変換。"""
        inputs = self.clip_processor(images=image, return_tensors="pt")
        with torch.no_grad():
            features = self.clip_model.get_image_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def get_text_embedding(self, text: str) -> np.ndarray:
        """テキストをCLIP埋め込みに変換。"""
        inputs = self.clip_processor(
            text=[text],
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            features = self.clip_model.get_text_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def index_image(
        self,
        image: Image.Image,
        image_id: str,
        metadata: dict = None
    ):
        """画像をインデックス化。"""
        embedding = self.get_image_embedding(image)

        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        image_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

        doc_metadata = {"image_b64": image_b64}
        if metadata:
            doc_metadata.update(metadata)

        self.image_collection.add(
            embeddings=[embedding.tolist()],
            ids=[image_id],
            metadatas=[doc_metadata]
        )

    def search_images_by_text(
        self,
        query: str,
        n_results: int = 5
    ) -> list[dict]:
        """テキストで画像を検索。"""
        query_embedding = self.get_text_embedding(query)

        results = self.image_collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["metadatas", "distances", "ids"]
        )

        retrieved = []
        for i in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][i]
            image_b64 = metadata.pop('image_b64', None)

            image = None
            if image_b64:
                image_bytes = base64.b64decode(image_b64)
                image = Image.open(io.BytesIO(image_bytes))

            retrieved.append({
                "id": results['ids'][0][i],
                "distance": results['distances'][0][i],
                "metadata": metadata,
                "image": image
            })

        return retrieved

    def multimodal_rag_query(
        self,
        question: str,
        vision_model_fn,
        n_image_results: int = 3
    ) -> str:
        """マルチモーダルRAGクエリを実行。"""
        relevant_images = self.search_images_by_text(question, n_image_results)

        if not relevant_images:
            return vision_model_fn(question=question, images=[])

        retrieved_images = [r["image"] for r in relevant_images if r["image"]]
        metadata_info = [
            f"Image {i+1}: {r['metadata']}"
            for i, r in enumerate(relevant_images)
        ]

        enhanced_prompt = f"""
Question: {question}

Retrieved image information:
{chr(10).join(metadata_info)}

Please answer the question based on these images.
Specifically cite relevant content from each image.
"""

        return vision_model_fn(question=enhanced_prompt, images=retrieved_images)

ColPali: PDFページ検索

# ColPali: ビジョン言語モデルを使用した直接PDFページ検索
# pip install colpali-engine

from colpali_engine.models import ColPali, ColPaliProcessor
import torch

class ColPaliPDFSearch:
    """ColPaliを使用したPDFページ検索。"""

    def __init__(self, model_name: str = "vidore/colpali-v1.2"):
        self.model = ColPali.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.processor = ColPaliProcessor.from_pretrained(model_name)

    def index_pdf_pages(
        self,
        page_images: list[Image.Image]
    ) -> torch.Tensor:
        """PDFページ画像をインデックス化。"""
        all_embeddings = []
        batch_size = 4

        for i in range(0, len(page_images), batch_size):
            batch = page_images[i:i + batch_size]
            inputs = self.processor.process_images(batch)
            inputs = {k: v.to("cuda") for k, v in inputs.items()}

            with torch.no_grad():
                embeddings = self.model(**inputs)

            all_embeddings.append(embeddings)

        return torch.cat(all_embeddings, dim=0)

    def search(
        self,
        query: str,
        page_embeddings: torch.Tensor,
        top_k: int = 3
    ) -> list[int]:
        """クエリで関連PDFページを検索。"""
        query_inputs = self.processor.process_queries([query])
        query_inputs = {k: v.to("cuda") for k, v in query_inputs.items()}

        with torch.no_grad():
            query_embedding = self.model(**query_inputs)

        # MaxSimスコア計算（ColPaliの重要メカニズム）
        scores = self.processor.score_multi_vector(
            query_embedding,
            page_embeddings
        )

        top_indices = scores[0].argsort(descending=True)[:top_k]
        return top_indices.tolist()

10. オープンソースマルチモーダルモデル

Phi-3 Vision（Microsoft）

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

class Phi3VisionModel:
    """Microsoft Phi-3 Visionモデル。"""

    def __init__(self):
        model_id = "microsoft/Phi-3-vision-128k-instruct"

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            _attn_implementation='flash_attention_2'
        )

        self.processor = AutoProcessor.from_pretrained(
            model_id,
            trust_remote_code=True
        )

    def analyze(self, image: Image.Image, prompt: str) -> str:
        """画像を分析。"""
        messages = [
            {"role": "user", "content": f"<|image_1|>\n{prompt}"}
        ]

        prompt_text = self.processor.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt_text,
            [image],
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=500,
                eos_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs['input_ids'].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

Qwen-VL（Alibaba）

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

class QwenVLModel:
    """Qwen2-VLマルチモーダルモデル。"""

    def __init__(self, model_name: str = "Qwen/Qwen2-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(
            model_name,
            min_pixels=256*28*28,
            max_pixels=1280*28*28
        )

    def analyze_image(self, image_path: str, question: str) -> str:
        """画像を分析。"""
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image_path},
                    {"type": "text", "text": question}
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        image_inputs, video_inputs = process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output_ids = self.model.generate(**inputs, max_new_tokens=512)

        generated_ids = [
            output_ids[len(input_ids):]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        return self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

ローカル実行ガイド（Ollama）

# Ollamaでマルチモーダルモデルをローカル実行
# ollama.aiからOllamaをインストール

# LLaVAモデルをダウンロードして実行
ollama pull llava:13b

# モデルを実行
ollama run llava:13b

import ollama
from pathlib import Path

class OllamaVisionModel:
    """Ollamaを使用したローカルビジョンモデル。"""

    def __init__(self, model: str = "llava:13b"):
        self.model = model

    def analyze(self, image_path: str, prompt: str) -> str:
        """ローカルモデルで画像を分析。"""
        response = ollama.chat(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                    "images": [image_path]
                }
            ]
        )
        return response["message"]["content"]

    def batch_analyze(
        self,
        image_paths: list[str],
        prompt: str
    ) -> list[str]:
        """複数の画像を順次分析。"""
        results = []
        for path in image_paths:
            result = self.analyze(path, prompt)
            results.append(result)
        return results

# 使用方法
model = OllamaVisionModel("llava:13b")
result = model.analyze(
    "/path/to/image.jpg",
    "What is in this image? Please describe it in detail."
)
print(result)

11. 動画理解AI

動画理解の課題

動画理解は時間的情報を含むマルチモーダルタスクで、静止画像の理解よりはるかに複雑です。

主要な課題:

時間的依存性: フレーム間の時間的関係を理解する
大量のデータ: 30fpsで1分の動画 = 約1,800フレーム
行動認識: 動きのパターンを捉える
マルチスケール: 短い動作と長いイベントを同時に理解する

VideoMAEによる動画特徴抽出

from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np

class VideoFeatureExtractor:
    """VideoMAEを使用した動画特徴抽出器。"""

    def __init__(self, model_name: str = "MCG-NJU/videomae-base"):
        self.processor = VideoMAEImageProcessor.from_pretrained(model_name)
        self.model = VideoMAEModel.from_pretrained(model_name)

    def extract_video_features(
        self,
        video_frames: list,
        num_frames: int = 16
    ) -> torch.Tensor:
        """動画フレームから特徴を抽出。"""
        total_frames = len(video_frames)
        indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        sampled_frames = [video_frames[i] for i in indices]

        inputs = self.processor(sampled_frames, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(**inputs)

        return outputs.last_hidden_state

# OpenCVで動画からフレームを抽出
import cv2

def extract_frames_from_video(
    video_path: str,
    target_fps: int = 1
) -> list:
    """動画からフレームを抽出。"""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / target_fps)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            from PIL import Image
            pil_frame = Image.fromarray(frame_rgb)
            frames.append(pil_frame)

        frame_count += 1

    cap.release()
    return frames

Geminiによる長動画理解

import google.generativeai as genai
import time

class LongVideoUnderstanding:
    """Gemini 1.5 Proを使用した長動画理解システム。"""

    def __init__(self):
        self.model = genai.GenerativeModel("gemini-1.5-pro")

    def analyze_long_video(
        self,
        video_path: str,
        analysis_tasks: list[str]
    ) -> dict:
        """最大1時間の動画を分析。"""
        print("Uploading video...")
        video_file = genai.upload_file(
            path=video_path,
            display_name="long_video_analysis"
        )

        while video_file.state.name == "PROCESSING":
            print(f"Processing... (state: {video_file.state.name})")
            time.sleep(15)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name != "ACTIVE":
            raise RuntimeError(f"Video processing failed: {video_file.state.name}")

        print(f"Upload complete (URI: {video_file.uri})")

        results = {}
        for task in analysis_tasks:
            print(f"Analyzing: {task}")
            response = self.model.generate_content(
                [video_file, task],
                request_options={"timeout": 900}
            )
            results[task] = response.text

        genai.delete_file(video_file.name)
        print("File cleanup complete")

        return results

    def create_video_summary(self, video_path: str) -> dict:
        """動画の包括的なサマリーを生成。"""
        tasks = [
            "Summarize the overall content of this video in 3-5 sentences.",
            "List the major scenes with timestamps. Format: MM:SS - description",
            "List the main people, objects, and locations that appear in the video.",
            "What are the key messages or conclusions emphasized in this video?",
            "Who is the target audience and what is the purpose of this video?"
        ]

        return self.analyze_long_video(video_path, tasks)

# 使用方法
video_analyzer = LongVideoUnderstanding()
summary = video_analyzer.create_video_summary("lecture.mp4")

for task, result in summary.items():
    print(f"\n{'='*50}")
    print(f"Question: {task}")
    print(f"Answer: {result}")

まとめ

マルチモーダルAIは急速に進化しており、テキスト、画像、動画を統合的に理解する能力がますます強力になっています。

このガイドの重要なポイント:

CLIP: 対照学習で画像とテキストを共通空間にマッピングし、ゼロショット分類の基盤を形成
BLIP/BLIP-2: ブートストラッピングとQ-Formerによる効率的なマルチモーダル学習
LLaVA: オープンソースビジョン言語アシスタントの標準
GPT-4V / Claude Vision: 最高性能の商用マルチモーダルLLM
Gemini 1.5: 100万トークンコンテキストで長動画や文書を処理
マルチモーダルRAG: CLIP埋め込みを使って画像の検索可能な知識ベースを構築
オープンソースエコシステム: Phi-3 Vision、Qwen-VLなどローカル実行可能な強力なモデル

この分野は長動画理解、3D空間理解、リアルタイムマルチモーダル処理に向かって進んでいます。非常に速く進化する分野であり、継続的な学習が不可欠です。

参考文献

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
Li, J. et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597
Liu, H. et al. (2023). Visual Instruction Tuning (LLaVA). arXiv:2304.08485
OpenAI GPT-4 Technical Report: openai.com/research/gpt-4
Google Gemini API: ai.google.dev/gemini-api
Anthropic Claude API: docs.anthropic.com
HuggingFace LLaVA: huggingface.co/llava-hf
ColPali: Efficient Document Retrieval with Vision Language Models. arxiv.org/abs/2407.01449

目次