Skip to content
Published on

Multimodal AI Complete Guide: Master CLIP, LLaVA, GPT-4V, and Gemini Vision

Authors

Table of Contents

  1. Multimodal AI Overview
  2. CLIP In-Depth
  3. The BLIP Family
  4. LLaVA: Large Language and Vision Assistant
  5. InstructBLIP
  6. GPT-4 Vision
  7. Gemini Vision
  8. Claude Vision
  9. Multimodal RAG
  10. Open-Source Multimodal Models
  11. Video Understanding AI

1. Multimodal AI Overview

Limitations of Single-Modality Systems

Traditional AI systems could only process one form of data (modality) at a time — text, images, or audio. This single-modality approach has fundamental limitations when solving complex real-world problems.

Limitations of text-only models:

  • Cannot analyze images when asked for image descriptions
  • Cannot understand charts, graphs, and screenshots
  • Cannot make decisions that require visual context

Limitations of image-only models:

  • Cannot holistically understand text and visual elements together within images
  • Cannot perform language-based question answering
  • Cannot search using natural language descriptions

The Potential of Multimodal AI

Multimodal AI systems can process and understand multiple forms of data simultaneously — mimicking the natural human cognitive ability to "see, hear, and read while understanding all at once."

Key application areas:

  • Medical Diagnosis: Integrated analysis of medical imaging and patient record text
  • Autonomous Driving: Integration of cameras, LiDAR, and map data
  • Education: Automatic generation of explanations from textbook images
  • E-Commerce: Integrated processing of product photos, descriptions, and reviews
  • Document Understanding: Scanned document OCR and content analysis
  • Creative Applications: Image generation from text descriptions (DALL-E, Stable Diffusion)

History of Vision-Language Models

2021: CLIP (OpenAI) - linking images and text via contrastive learning
2022: BLIP - unified image captioning and VQA
2023: BLIP-2 - efficient multimodal learning via Q-Former
2023: LLaVA - open-source vision-language assistant
2023: GPT-4V - commercial multimodal LLM
2023: Gemini - Google's multimodal foundation model
2024: Claude 3 Vision - Anthropic's multimodal model
2024: LLaVA-1.6, InternVL2, Qwen-VL2 - open-source improvements
2025: Expansion to video understanding and 3D understanding

2. CLIP In-Depth

The Core Idea Behind CLIP

CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, was trained on 400 million image-text pairs using contrastive learning, mapping images and text into a shared embedding space.

Key innovation: By training only on image-caption pairs collected from the internet — without any manual labels — CLIP acquired powerful zero-shot classification capabilities.

CLIP Architecture

Image[Image Encoder (ViT/ResNet)]Image Embedding (512-dim)
Similarity
Text[Text Encoder (Transformer)]Text Embedding (512-dim)

Contrastive Learning Mechanism:

For N image-text pairs in a batch:

  • Maximize the similarity of correct pairs (diagonal)
  • Minimize the similarity of incorrect pairs (off-diagonal)
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def clip_zero_shot_classification(
    image: Image.Image,
    candidate_labels: list[str]
) -> dict[str, float]:
    """Zero-shot image classification using CLIP."""

    # Generate text prompts (CLIP's recommended format)
    text_prompts = [f"a photo of a {label}" for label in candidate_labels]

    # Preprocessing
    inputs = processor(
        text=text_prompts,
        images=image,
        return_tensors="pt",
        padding=True
    )

    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image
        probs = logits_per_image.softmax(dim=1)

    return {
        label: prob.item()
        for label, prob in zip(candidate_labels, probs[0])
    }

# Usage example
image_url = "https://example.com/sample_image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

labels = ["cat", "dog", "bird", "fish", "rabbit"]
results = clip_zero_shot_classification(image, labels)

# Sort by probability
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for label, prob in sorted_results:
    print(f"{label}: {prob:.4f} ({prob*100:.1f}%)")

Image-Text Retrieval

import numpy as np
from typing import Union

class CLIPSearchEngine:
    """CLIP-based image-text search engine."""

    def __init__(self, model_name: str = "openai/clip-vit-large-patch14"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_embeddings = []
        self.image_metadata = []

    def encode_images(self, images: list[Image.Image]) -> torch.Tensor:
        """Convert a batch of images to embeddings."""
        inputs = self.processor(
            images=images,
            return_tensors="pt",
            padding=True
        )
        with torch.no_grad():
            image_features = self.model.get_image_features(**inputs)
            image_features = F.normalize(image_features, p=2, dim=-1)
        return image_features

    def encode_texts(self, texts: list[str]) -> torch.Tensor:
        """Convert a batch of texts to embeddings."""
        inputs = self.processor(
            text=texts,
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            text_features = self.model.get_text_features(**inputs)
            text_features = F.normalize(text_features, p=2, dim=-1)
        return text_features

    def index_images(
        self,
        images: list[Image.Image],
        metadata: list[dict] = None
    ):
        """Index images for retrieval."""
        embeddings = self.encode_images(images)
        self.image_embeddings.append(embeddings)
        if metadata:
            self.image_metadata.extend(metadata)

    def text_to_image_search(
        self,
        query: str,
        top_k: int = 5
    ) -> list[dict]:
        """Search for images using a text query."""
        if not self.image_embeddings:
            return []

        all_embeddings = torch.cat(self.image_embeddings, dim=0)
        query_embedding = self.encode_texts([query])

        # Cosine similarity (already L2-normalized)
        similarities = (all_embeddings @ query_embedding.T).squeeze(-1)
        top_indices = similarities.argsort(descending=True)[:top_k]

        results = []
        for idx in top_indices:
            idx = idx.item()
            result = {
                "index": idx,
                "similarity": similarities[idx].item()
            }
            if self.image_metadata:
                result.update(self.image_metadata[idx])
            results.append(result)

        return results

OpenCLIP (Open-Source CLIP)

# OpenCLIP: supports various architectures and training datasets
# pip install open_clip_torch

import open_clip
import torch
from PIL import Image

# List available models
available_models = open_clip.list_pretrained()
print("Available models:", available_models[:5])

# Load a large model trained on LAION-2B
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='laion2b_s32b_b79k'
)
tokenizer = open_clip.get_tokenizer('ViT-H-14')

def compute_clip_similarity(
    image: Image.Image,
    texts: list[str]
) -> list[float]:
    """Compute CLIP similarity between an image and a list of texts."""
    model.eval()

    image_input = preprocess_val(image).unsqueeze(0)
    text_input = tokenizer(texts)

    with torch.no_grad(), torch.cuda.amp.autocast():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

    return similarity[0].tolist()

Fine-Tuning CLIP

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import CLIPModel, CLIPProcessor
import torch.optim as optim

class ImageTextDataset(Dataset):
    """Image-text pair dataset."""

    def __init__(
        self,
        image_paths: list[str],
        texts: list[str],
        processor: CLIPProcessor
    ):
        self.image_paths = image_paths
        self.texts = texts
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        text = self.texts[idx]

        inputs = self.processor(
            images=image,
            text=text,
            return_tensors="pt",
            padding="max_length",
            max_length=77,
            truncation=True
        )

        return {
            "pixel_values": inputs["pixel_values"].squeeze(0),
            "input_ids": inputs["input_ids"].squeeze(0),
            "attention_mask": inputs["attention_mask"].squeeze(0)
        }

class CLIPFineTuner:
    """CLIP model fine-tuning class."""

    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def contrastive_loss(
        self,
        image_features: torch.Tensor,
        text_features: torch.Tensor,
        temperature: float = 0.07
    ) -> torch.Tensor:
        """Contrastive learning loss function."""
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        logits = torch.matmul(image_features, text_features.T) / temperature
        labels = torch.arange(len(logits)).to(self.device)

        loss_i = F.cross_entropy(logits, labels)
        loss_t = F.cross_entropy(logits.T, labels)

        return (loss_i + loss_t) / 2

    def train(
        self,
        train_dataset: ImageTextDataset,
        num_epochs: int = 10,
        batch_size: int = 32,
        learning_rate: float = 1e-5
    ):
        """Fine-tune CLIP."""
        dataloader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=4
        )

        optimizer = optim.AdamW(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=0.01
        )

        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=num_epochs
        )

        for epoch in range(num_epochs):
            total_loss = 0
            self.model.train()

            for batch in dataloader:
                pixel_values = batch["pixel_values"].to(self.device)
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                outputs = self.model(
                    pixel_values=pixel_values,
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                loss = self.contrastive_loss(image_features, text_features)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            scheduler.step()
            avg_loss = total_loss / len(dataloader)
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

3. The BLIP Family

BLIP: Bootstrapping Language-Image Pre-training

BLIP, published by Salesforce Research in 2022, excels at diverse vision-language tasks including image captioning, image-text retrieval, and visual question answering (VQA).

Key innovation: Data bootstrapping using a Captioner and a Filter to clean noisy image-text pairs collected from the web.

from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import BlipForQuestionAnswering
from PIL import Image
import torch

# BLIP Image Captioning
class BLIPCaptioner:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-large",
            torch_dtype=torch.float16
        )
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def caption(
        self,
        image: Image.Image,
        conditional_text: str = None,
        max_new_tokens: int = 50
    ) -> str:
        """Generate a caption for an image."""
        if conditional_text:
            inputs = self.processor(
                image,
                conditional_text,
                return_tensors="pt"
            ).to(self.device, torch.float16)
        else:
            inputs = self.processor(
                image,
                return_tensors="pt"
            ).to(self.device, torch.float16)

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True
            )

        return self.processor.decode(output[0], skip_special_tokens=True)

# BLIP VQA (Visual Question Answering)
class BLIPVisualQA:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        self.model = BlipForQuestionAnswering.from_pretrained(
            "Salesforce/blip-vqa-base"
        )

    def answer(self, image: Image.Image, question: str) -> str:
        """Answer a question about an image."""
        inputs = self.processor(image, question, return_tensors="pt")

        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)

        return self.processor.decode(output[0], skip_special_tokens=True)

# Usage
captioner = BLIPCaptioner()
vqa = BLIPVisualQA()

image = Image.open("sample.jpg")

caption = captioner.caption(image)
print(f"Caption: {caption}")

cond_caption = captioner.caption(image, "a photo of")
print(f"Conditional caption: {cond_caption}")

answer = vqa.answer(image, "What color is the sky?")
print(f"Answer: {answer}")

BLIP-2: Querying Transformer

BLIP-2, published in 2023, introduces the Q-Former (Querying Transformer) to efficiently bridge a frozen image encoder with a frozen LLM.

Q-Former's role:

  • Extracts the most important visual features from the image encoder's output
  • 32 learnable query tokens interact with image features to create a compressed representation
  • This compressed representation is passed as input to the LLM
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

class BLIP2Assistant:
    """BLIP-2 based visual question answering assistant."""

    def __init__(
        self,
        model_name: str = "Salesforce/blip2-opt-2.7b"
    ):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def generate_response(
        self,
        image: Image.Image,
        prompt: str = None,
        max_new_tokens: int = 200,
        temperature: float = 1.0
    ) -> str:
        """Generate a response for an image and optional prompt."""
        if prompt:
            inputs = self.processor(
                images=image,
                text=prompt,
                return_tensors="pt"
            ).to("cuda", torch.float16)
        else:
            inputs = self.processor(
                images=image,
                return_tensors="pt"
            ).to("cuda", torch.float16)

        with torch.no_grad():
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        generated_text = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

    def batch_caption(
        self,
        images: list[Image.Image],
        batch_size: int = 8
    ) -> list[str]:
        """Generate captions for a batch of images."""
        all_captions = []

        for i in range(0, len(images), batch_size):
            batch = images[i:i + batch_size]
            inputs = self.processor(
                images=batch,
                return_tensors="pt",
                padding=True
            ).to("cuda", torch.float16)

            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs,
                    max_new_tokens=50
                )

            captions = self.processor.batch_decode(
                generated_ids,
                skip_special_tokens=True
            )
            all_captions.extend([c.strip() for c in captions])

        return all_captions

# Usage
assistant = BLIP2Assistant("Salesforce/blip2-flan-t5-xxl")
image = Image.open("document.png")

response = assistant.generate_response(
    image,
    "Question: What is the main topic of this document? Answer:"
)
print(response)

4. LLaVA: Large Language and Vision Assistant

LLaVA Architecture

LLaVA, published in 2023, is an open-source vision-language model that connects a powerful LLM (LLaMA, Vicuna) with a CLIP vision encoder to build a multimodal chatbot with instruction-following capabilities.

Architecture components:

Image[CLIP ViT-L/14]Image features (1024-dim)
                         [Linear Projection Layer]
                         [Visual Tokens]
              [LLM (LLaMA/Vicuna)][Text Tokens]
                             Final Response

LLaVA-1.5 improvements:

  • MLP projection layer (linear → 2-layer MLP)
  • High-resolution image support
  • More training data

LLaVA-1.6 (LLaVA-NeXT) improvements:

  • Dynamic High Resolution: up to 672x672 → 4x more visual tokens
  • Improved reasoning and OCR capabilities
  • Support for various aspect ratios

Using LLaVA with HuggingFace

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

class LLaVAAssistant:
    """LLaVA-1.6 based visual assistant."""

    def __init__(
        self,
        model_name: str = "llava-hf/llava-v1.6-mistral-7b-hf"
    ):
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto"
        )

    def chat(
        self,
        image: Image.Image,
        message: str,
        max_new_tokens: int = 500,
        temperature: float = 0.7
    ) -> str:
        """Chat with the model about an image."""
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": message}
                ]
            }
        ]

        prompt = self.processor.apply_chat_template(
            conversation,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt,
            image,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=temperature > 0,
                pad_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs["input_ids"].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

    def analyze_chart(self, chart_image: Image.Image) -> dict:
        """Analyze a chart image."""
        analysis_prompts = [
            "What is the title of this chart?",
            "What do the x-axis and y-axis represent?",
            "What are the highest and lowest values?",
            "Describe the overall trend.",
            "What is the most important insight from this data?"
        ]

        results = {}
        for prompt in analysis_prompts:
            response = self.chat(chart_image, prompt)
            results[prompt] = response

        return results

    def extract_text_from_image(self, image: Image.Image) -> str:
        """Extract text from an image (OCR)."""
        return self.chat(
            image,
            "Please accurately extract all the text in this image. "
            "Return only the text without any additional explanations."
        )


# Production use case: Document analysis pipeline
class DocumentAnalysisPipeline:
    """Document analysis pipeline using LLaVA."""

    def __init__(self):
        self.llava = LLaVAAssistant()

    def analyze_document(self, document_image: Image.Image) -> dict:
        """Comprehensively analyze a document image."""

        doc_type = self.llava.chat(
            document_image,
            "What type of document is this? (invoice, contract, report, form, etc.)"
        )

        extracted_text = self.llava.extract_text_from_image(document_image)

        key_info = self.llava.chat(
            document_image,
            f"Extract the following information from this {doc_type} in JSON format: "
            "date, sender, recipient, amount (if applicable), main content summary"
        )

        action_items = self.llava.chat(
            document_image,
            "If there are any required actions in this document, please list them."
        )

        return {
            "document_type": doc_type,
            "extracted_text": extracted_text,
            "key_information": key_info,
            "action_items": action_items
        }

5. InstructBLIP

The Core of InstructBLIP

InstructBLIP builds on BLIP-2 by applying instruction tuning to follow diverse instructions. The key is that the Q-Former is made instruction-aware, extracting visual features relevant to the specific instruction given.

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

class InstructBLIPAssistant:
    """Instruction-following assistant based on InstructBLIP."""

    def __init__(self, model_name: str = "Salesforce/instructblip-vicuna-7b"):
        self.processor = InstructBlipProcessor.from_pretrained(model_name)
        self.model = InstructBlipForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def instruct(
        self,
        image: Image.Image,
        instruction: str,
        max_new_tokens: int = 300
    ) -> str:
        """Follow a specific instruction about an image."""
        inputs = self.processor(
            images=image,
            text=instruction,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                do_sample=False,
                num_beams=5,
                max_new_tokens=max_new_tokens,
                min_length=1,
                top_p=0.9,
                repetition_penalty=1.5,
                length_penalty=1.0,
                temperature=1.0
            )

        generated_text = self.processor.batch_decode(
            outputs,
            skip_special_tokens=True
        )[0].strip()

        return generated_text

# Diverse usage examples
assistant = InstructBLIPAssistant()
image = Image.open("complex_diagram.png")

description = assistant.instruct(
    image,
    "Please explain this diagram in detail, including the role of each component and its connections."
)

objects = assistant.instruct(
    image,
    "List all the objects found in this image and describe the location of each."
)

emotion = assistant.instruct(
    image,
    "Analyze the emotional state of the people in this image and explain your reasoning."
)

6. GPT-4 Vision

GPT-4V API Usage

GPT-4 Vision adds visual capabilities to OpenAI's GPT-4, making it one of the most powerful commercial multimodal LLMs available.

import openai
import base64
from pathlib import Path
import httpx

client = openai.OpenAI()

def encode_image_to_base64(image_path: str) -> str:
    """Encode an image file to Base64."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

class GPT4VisionAnalyzer:
    """Image analyzer based on GPT-4 Vision."""

    def __init__(self, model: str = "gpt-4o"):
        self.client = openai.OpenAI()
        self.model = model

    def analyze_image(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        detail: str = "high",
        max_tokens: int = 1000
    ) -> str:
        """Analyze a single image."""
        if is_url:
            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": image_source,
                    "detail": detail
                }
            }
        else:
            base64_image = encode_image_to_base64(image_source)
            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            image_content = {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{media_type};base64,{base64_image}",
                    "detail": detail
                }
            }

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        image_content,
                        {"type": "text", "text": prompt}
                    ]
                }
            ],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_multiple_images(
        self,
        image_sources: list[dict],
        prompt: str,
        max_tokens: int = 2000
    ) -> str:
        """Analyze multiple images simultaneously."""
        content = []

        for img_info in image_sources:
            source = img_info["source"]
            is_url = img_info.get("is_url", True)

            if is_url:
                content.append({
                    "type": "image_url",
                    "image_url": {"url": source, "detail": "high"}
                })
            else:
                base64_image = encode_image_to_base64(source)
                content.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                })

        content.append({"type": "text", "text": prompt})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": content}],
            max_tokens=max_tokens
        )

        return response.choices[0].message.content

    def analyze_chart_or_graph(self, image_source: str, is_url: bool = True) -> dict:
        """Analyze a chart or graph and return structured data."""
        prompt = """Analyze this chart/graph and return the following JSON:
{
  "chart_type": "bar/line/pie/scatter/etc.",
  "title": "chart title",
  "x_axis": {"label": "X-axis label", "unit": "unit"},
  "y_axis": {"label": "Y-axis label", "unit": "unit"},
  "data_series": [{"name": "series name", "trend": "rising/falling/stable"}],
  "key_findings": ["finding1", "finding2"],
  "data_range": {"min": 0, "max": 0},
  "anomalies": ["anomaly description"]
}"""

        response = self.analyze_image(
            image_source,
            prompt,
            is_url=is_url,
            detail="high",
            max_tokens=1500
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

        return {"raw_response": response}


# Production use case: e-commerce product analysis
def analyze_product_images(image_urls: list[str]) -> str:
    """Analyze multiple product images."""
    analyzer = GPT4VisionAnalyzer()
    image_sources = [{"source": url, "is_url": True} for url in image_urls]

    return analyzer.analyze_multiple_images(
        image_sources,
        prompt="""Analyze these product images and return the following JSON:
{
  "product_name": "estimated product name",
  "category": "product category",
  "color_options": ["color list"],
  "key_features": ["key features"],
  "condition": "new/used/etc.",
  "quality_score": 0-10,
  "marketing_description": "marketing copy (100 chars)",
  "seo_keywords": ["SEO keywords"]
}"""
    )

7. Gemini Vision

Gemini's Multimodal Capabilities

Google's Gemini was designed from the ground up with multimodality in mind. Gemini 1.5 Pro in particular, with its 1M token context window, can process long-duration video, lengthy documents, and large numbers of images.

import google.generativeai as genai
import PIL.Image
from pathlib import Path
import base64

genai.configure(api_key="YOUR_GEMINI_API_KEY")

class GeminiVisionAnalyzer:
    """Gemini Vision-based analyzer."""

    def __init__(self, model_name: str = "gemini-1.5-pro"):
        self.model = genai.GenerativeModel(model_name)
        self.flash_model = genai.GenerativeModel("gemini-1.5-flash")

    def analyze_image(self, image_path: str, prompt: str) -> str:
        """Analyze an image."""
        image = PIL.Image.open(image_path)
        response = self.model.generate_content([prompt, image])
        return response.text

    def analyze_with_url(self, image_url: str, prompt: str) -> str:
        """Analyze an image from a URL."""
        import httpx
        image_data = httpx.get(image_url).content

        image_part = {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(image_data).decode('utf-8')
        }

        response = self.model.generate_content([
            {"text": prompt},
            image_part
        ])
        return response.text

    def analyze_video(
        self,
        video_path: str,
        questions: list[str]
    ) -> dict:
        """Analyze a video (Gemini 1.5 Pro's strength)."""
        print(f"Uploading video: {video_path}")
        video_file = genai.upload_file(
            path=video_path,
            display_name="analysis_video"
        )

        import time
        while video_file.state.name == "PROCESSING":
            print("Processing...")
            time.sleep(10)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name == "FAILED":
            raise ValueError("Video processing failed")

        print(f"Upload complete: {video_file.uri}")

        results = {}
        for question in questions:
            response = self.model.generate_content(
                [video_file, question],
                request_options={"timeout": 600}
            )
            results[question] = response.text

        genai.delete_file(video_file.name)
        return results

    def analyze_multiple_images_interleaved(
        self,
        image_text_pairs: list[dict]
    ) -> str:
        """Handle compound queries with interleaved images and text."""
        content = []
        for pair in image_text_pairs:
            if "text" in pair:
                content.append(pair["text"])
            if "image" in pair:
                content.append(pair["image"])

        response = self.model.generate_content(content)
        return response.text

    def process_document_batch(
        self,
        document_images: list[PIL.Image.Image],
        extraction_schema: str
    ) -> list[dict]:
        """Process multiple documents in a single batch (leveraging Gemini's long context)."""
        import json

        content = [f"Please analyze the following {len(document_images)} documents:\n"]

        for i, img in enumerate(document_images, 1):
            content.append(f"\n--- Document {i} ---")
            content.append(img)

        content.append(f"\nExtract data from each document using this JSON schema:\n{extraction_schema}")

        response = self.model.generate_content(content)

        try:
            text = response.text
            start = text.find('[')
            end = text.rfind(']') + 1
            if start >= 0 and end > start:
                return json.loads(text[start:end])
        except json.JSONDecodeError:
            return [{"raw_response": response.text}]


# Usage: video analysis
analyzer = GeminiVisionAnalyzer()

video_questions = [
    "Please summarize the overall content of this video.",
    "List the major scenes with their timestamps.",
    "What are the main keywords or concepts mentioned in this video?",
    "What is the topic and purpose of this video?"
]

results = analyzer.analyze_video("lecture_video.mp4", video_questions)
for question, answer in results.items():
    print(f"\nQuestion: {question}")
    print(f"Answer: {answer}")

8. Claude Vision

Claude Vision API

Anthropic's Claude 3.5 Sonnet offers powerful vision capabilities, excelling especially at document understanding, code screenshot analysis, and detailed image interpretation.

import anthropic
import base64
import httpx
from pathlib import Path

client = anthropic.Anthropic()

class ClaudeVisionAnalyzer:
    """Image analyzer based on Claude Vision."""

    def __init__(self, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic()
        self.model = model

    def _prepare_image_content(
        self,
        image_source: str,
        is_url: bool = True
    ) -> dict:
        """Prepare image content in Claude API format."""
        if is_url:
            return {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": image_source
                }
            }
        else:
            with open(image_source, "rb") as f:
                image_data = base64.standard_b64encode(f.read()).decode("utf-8")

            ext = Path(image_source).suffix.lower()
            media_type_map = {
                ".jpg": "image/jpeg",
                ".jpeg": "image/jpeg",
                ".png": "image/png",
                ".gif": "image/gif",
                ".webp": "image/webp"
            }
            media_type = media_type_map.get(ext, "image/jpeg")

            return {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }
            }

    def analyze(
        self,
        image_source: str,
        prompt: str,
        is_url: bool = True,
        system_prompt: str = None,
        max_tokens: int = 1000
    ) -> str:
        """Analyze an image."""
        image_content = self._prepare_image_content(image_source, is_url)

        messages = [
            {
                "role": "user",
                "content": [
                    image_content,
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        kwargs = {
            "model": self.model,
            "max_tokens": max_tokens,
            "messages": messages
        }

        if system_prompt:
            kwargs["system"] = system_prompt

        response = self.client.messages.create(**kwargs)
        return response.content[0].text

    def analyze_code_screenshot(
        self,
        screenshot_path: str
    ) -> dict:
        """Analyze a code screenshot and extract the code."""
        system_prompt = """You are an expert code analyst.
Accurately extract and analyze code from screenshots."""

        extraction_prompt = """From this code screenshot:
1. Extract the code accurately (including indentation)
2. Identify the programming language
3. Explain the main functionality
4. Suggest potential bugs or improvements

Respond in the following JSON format:
{
  "language": "programming language",
  "code": "extracted code",
  "description": "code description",
  "potential_issues": ["issue1", "issue2"],
  "improvements": ["improvement1", "improvement2"]
}"""

        response = self.analyze(
            screenshot_path,
            extraction_prompt,
            is_url=False,
            system_prompt=system_prompt,
            max_tokens=2000
        )

        import json
        try:
            start = response.find('{')
            end = response.rfind('}') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            return {"raw_response": response}

    def compare_images(
        self,
        image_sources: list[tuple[str, bool]],
        comparison_prompt: str
    ) -> str:
        """Comparatively analyze multiple images."""
        content = []

        for source, is_url in image_sources:
            content.append(self._prepare_image_content(source, is_url))

        content.append({"type": "text", "text": comparison_prompt})

        response = self.client.messages.create(
            model=self.model,
            max_tokens=2000,
            messages=[{"role": "user", "content": content}]
        )

        return response.content[0].text

    def analyze_ui_design(self, ui_screenshot_path: str) -> str:
        """Analyze a UI design screenshot."""
        prompt = """Please analyze this UI screenshot from a UX/UI expert perspective:

Analysis areas:
1. Layout structure
2. Color palette
3. Typography
4. Usability assessment
5. Accessibility issues
6. Improvement suggestions

Return in JSON format:
{
  "layout": "layout description",
  "color_palette": ["main colors"],
  "typography": "typography assessment",
  "usability_score": 0-10,
  "usability_issues": ["issues"],
  "accessibility_issues": ["accessibility problems"],
  "improvements": ["improvement suggestions"]
}"""

        return self.analyze(
            ui_screenshot_path,
            prompt,
            is_url=False,
            max_tokens=1500
        )

# Usage
analyzer = ClaudeVisionAnalyzer()

result = analyzer.analyze(
    "https://example.com/product.jpg",
    "Describe the features of this product in detail and suggest a potential target audience.",
    is_url=True
)
print(result)

9. Multimodal RAG

Multimodal RAG Overview

Multimodal RAG indexes and retrieves not just text, but also images, tables, charts, and other content types.

Image Indexing Strategy

import torch
import numpy as np
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
import chromadb
import base64
import io

class MultimodalRAGSystem:
    """Multimodal RAG system."""

    def __init__(self):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

        self.chroma_client = chromadb.Client()
        self.image_collection = self.chroma_client.get_or_create_collection(
            name="images",
            metadata={"hnsw:space": "cosine"}
        )

    def get_image_embedding(self, image: Image.Image) -> np.ndarray:
        """Convert an image to a CLIP embedding."""
        inputs = self.clip_processor(images=image, return_tensors="pt")
        with torch.no_grad():
            features = self.clip_model.get_image_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def get_text_embedding(self, text: str) -> np.ndarray:
        """Convert text to a CLIP embedding."""
        inputs = self.clip_processor(
            text=[text],
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        with torch.no_grad():
            features = self.clip_model.get_text_features(**inputs)
            features = torch.nn.functional.normalize(features, p=2, dim=-1)
        return features.numpy()[0]

    def index_image(
        self,
        image: Image.Image,
        image_id: str,
        metadata: dict = None
    ):
        """Index an image."""
        embedding = self.get_image_embedding(image)

        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        image_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

        doc_metadata = {"image_b64": image_b64}
        if metadata:
            doc_metadata.update(metadata)

        self.image_collection.add(
            embeddings=[embedding.tolist()],
            ids=[image_id],
            metadatas=[doc_metadata]
        )

    def search_images_by_text(
        self,
        query: str,
        n_results: int = 5
    ) -> list[dict]:
        """Search for images using text."""
        query_embedding = self.get_text_embedding(query)

        results = self.image_collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["metadatas", "distances", "ids"]
        )

        retrieved = []
        for i in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][i]
            image_b64 = metadata.pop('image_b64', None)

            image = None
            if image_b64:
                image_bytes = base64.b64decode(image_b64)
                image = Image.open(io.BytesIO(image_bytes))

            retrieved.append({
                "id": results['ids'][0][i],
                "distance": results['distances'][0][i],
                "metadata": metadata,
                "image": image
            })

        return retrieved

    def multimodal_rag_query(
        self,
        question: str,
        vision_model_fn,
        n_image_results: int = 3
    ) -> str:
        """Perform a multimodal RAG query."""
        relevant_images = self.search_images_by_text(question, n_image_results)

        if not relevant_images:
            return vision_model_fn(question=question, images=[])

        retrieved_images = [r["image"] for r in relevant_images if r["image"]]
        metadata_info = [
            f"Image {i+1}: {r['metadata']}"
            for i, r in enumerate(relevant_images)
        ]

        enhanced_prompt = f"""
Question: {question}

Retrieved image information:
{chr(10).join(metadata_info)}

Please answer the question based on these images.
Specifically cite relevant content from each image.
"""

        return vision_model_fn(question=enhanced_prompt, images=retrieved_images)

ColPali: PDF Page Retrieval

# ColPali: direct PDF page retrieval using vision-language models
# pip install colpali-engine

from colpali_engine.models import ColPali, ColPaliProcessor
import torch

class ColPaliPDFSearch:
    """PDF page search using ColPali."""

    def __init__(self, model_name: str = "vidore/colpali-v1.2"):
        self.model = ColPali.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.processor = ColPaliProcessor.from_pretrained(model_name)

    def index_pdf_pages(
        self,
        page_images: list[Image.Image]
    ) -> torch.Tensor:
        """Index PDF page images."""
        all_embeddings = []
        batch_size = 4

        for i in range(0, len(page_images), batch_size):
            batch = page_images[i:i + batch_size]
            inputs = self.processor.process_images(batch)
            inputs = {k: v.to("cuda") for k, v in inputs.items()}

            with torch.no_grad():
                embeddings = self.model(**inputs)

            all_embeddings.append(embeddings)

        return torch.cat(all_embeddings, dim=0)

    def search(
        self,
        query: str,
        page_embeddings: torch.Tensor,
        top_k: int = 3
    ) -> list[int]:
        """Search for relevant PDF pages using a query."""
        query_inputs = self.processor.process_queries([query])
        query_inputs = {k: v.to("cuda") for k, v in query_inputs.items()}

        with torch.no_grad():
            query_embedding = self.model(**query_inputs)

        # MaxSim score computation (ColPali's key mechanism)
        scores = self.processor.score_multi_vector(
            query_embedding,
            page_embeddings
        )

        top_indices = scores[0].argsort(descending=True)[:top_k]
        return top_indices.tolist()

10. Open-Source Multimodal Models

Phi-3 Vision (Microsoft)

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

class Phi3VisionModel:
    """Microsoft Phi-3 Vision model."""

    def __init__(self):
        model_id = "microsoft/Phi-3-vision-128k-instruct"

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            _attn_implementation='flash_attention_2'
        )

        self.processor = AutoProcessor.from_pretrained(
            model_id,
            trust_remote_code=True
        )

    def analyze(self, image: Image.Image, prompt: str) -> str:
        """Analyze an image."""
        messages = [
            {"role": "user", "content": f"<|image_1|>\n{prompt}"}
        ]

        prompt_text = self.processor.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            prompt_text,
            [image],
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=500,
                eos_token_id=self.processor.tokenizer.eos_token_id
            )

        generated = output[0][inputs['input_ids'].shape[1]:]
        return self.processor.decode(generated, skip_special_tokens=True)

Qwen-VL (Alibaba)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

class QwenVLModel:
    """Qwen2-VL multimodal model."""

    def __init__(self, model_name: str = "Qwen/Qwen2-VL-7B-Instruct"):
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(
            model_name,
            min_pixels=256*28*28,
            max_pixels=1280*28*28
        )

    def analyze_image(self, image_path: str, question: str) -> str:
        """Analyze an image."""
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image_path},
                    {"type": "text", "text": question}
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        image_inputs, video_inputs = process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        ).to("cuda")

        with torch.no_grad():
            output_ids = self.model.generate(**inputs, max_new_tokens=512)

        generated_ids = [
            output_ids[len(input_ids):]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        return self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

Local Execution Guide (Ollama)

# Run multimodal models locally with Ollama
# Install Ollama from ollama.ai

# Download and run the LLaVA model
ollama pull llava:13b

# Run the model
ollama run llava:13b
import ollama
from pathlib import Path

class OllamaVisionModel:
    """Local vision model using Ollama."""

    def __init__(self, model: str = "llava:13b"):
        self.model = model

    def analyze(self, image_path: str, prompt: str) -> str:
        """Analyze an image using a local model."""
        response = ollama.chat(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                    "images": [image_path]
                }
            ]
        )
        return response["message"]["content"]

    def batch_analyze(
        self,
        image_paths: list[str],
        prompt: str
    ) -> list[str]:
        """Analyze multiple images sequentially."""
        results = []
        for path in image_paths:
            result = self.analyze(path, prompt)
            results.append(result)
        return results

# Usage
model = OllamaVisionModel("llava:13b")
result = model.analyze(
    "/path/to/image.jpg",
    "What is in this image? Please describe it in detail."
)
print(result)

11. Video Understanding AI

The Challenges of Video Understanding

Video understanding is a multimodal task that includes temporal information, making it far more complex than static image understanding.

Key challenges:

  • Temporal dependencies: Understanding temporal relationships between frames
  • Large data volume: 1 minute of video at 30fps = ~1,800 frames
  • Action recognition: Capturing motion patterns
  • Multi-scale: Understanding both short actions and long events simultaneously

Video Feature Extraction with VideoMAE

from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np

class VideoFeatureExtractor:
    """Video feature extractor using VideoMAE."""

    def __init__(self, model_name: str = "MCG-NJU/videomae-base"):
        self.processor = VideoMAEImageProcessor.from_pretrained(model_name)
        self.model = VideoMAEModel.from_pretrained(model_name)

    def extract_video_features(
        self,
        video_frames: list,
        num_frames: int = 16
    ) -> torch.Tensor:
        """Extract features from video frames."""
        total_frames = len(video_frames)
        indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
        sampled_frames = [video_frames[i] for i in indices]

        inputs = self.processor(sampled_frames, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(**inputs)

        return outputs.last_hidden_state

# Extract frames from video with OpenCV
import cv2

def extract_frames_from_video(
    video_path: str,
    target_fps: int = 1
) -> list:
    """Extract frames from a video."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / target_fps)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            from PIL import Image
            pil_frame = Image.fromarray(frame_rgb)
            frames.append(pil_frame)

        frame_count += 1

    cap.release()
    return frames

Long Video Understanding with Gemini

import google.generativeai as genai
import time

class LongVideoUnderstanding:
    """Long video understanding system using Gemini 1.5 Pro."""

    def __init__(self):
        self.model = genai.GenerativeModel("gemini-1.5-pro")

    def analyze_long_video(
        self,
        video_path: str,
        analysis_tasks: list[str]
    ) -> dict:
        """Analyze videos up to 1 hour in length."""
        print("Uploading video...")
        video_file = genai.upload_file(
            path=video_path,
            display_name="long_video_analysis"
        )

        while video_file.state.name == "PROCESSING":
            print(f"Processing... (state: {video_file.state.name})")
            time.sleep(15)
            video_file = genai.get_file(video_file.name)

        if video_file.state.name != "ACTIVE":
            raise RuntimeError(f"Video processing failed: {video_file.state.name}")

        print(f"Upload complete (URI: {video_file.uri})")

        results = {}
        for task in analysis_tasks:
            print(f"Analyzing: {task}")
            response = self.model.generate_content(
                [video_file, task],
                request_options={"timeout": 900}
            )
            results[task] = response.text

        genai.delete_file(video_file.name)
        print("File cleanup complete")

        return results

    def create_video_summary(self, video_path: str) -> dict:
        """Generate a comprehensive summary of a video."""
        tasks = [
            "Summarize the overall content of this video in 3-5 sentences.",
            "List the major scenes with timestamps. Format: MM:SS - description",
            "List the main people, objects, and locations that appear in the video.",
            "What are the key messages or conclusions emphasized in this video?",
            "Who is the target audience and what is the purpose of this video?"
        ]

        return self.analyze_long_video(video_path, tasks)

# Usage
video_analyzer = LongVideoUnderstanding()
summary = video_analyzer.create_video_summary("lecture.mp4")

for task, result in summary.items():
    print(f"\n{'='*50}")
    print(f"Question: {task}")
    print(f"Answer: {result}")

Conclusion

Multimodal AI is advancing rapidly, with the ability to holistically understand text, images, and video becoming increasingly powerful.

Key takeaways from this guide:

  • CLIP: Maps images and text into a shared space via contrastive learning, forming the foundation for zero-shot classification
  • BLIP/BLIP-2: Efficient multimodal learning through bootstrapping and Q-Former
  • LLaVA: The standard for open-source vision-language assistants
  • GPT-4V / Claude Vision: The highest-performing commercial multimodal LLMs
  • Gemini 1.5: 1M token context for processing long videos and documents
  • Multimodal RAG: Building searchable knowledge bases from images using CLIP embeddings
  • Open-source ecosystem: Powerful locally-runnable models including Phi-3 Vision and Qwen-VL

The field is heading toward longer video understanding, 3D spatial understanding, and real-time multimodal processing. This area evolves very quickly, making continuous learning essential.

References