Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — The year video became searchable data

Between late 2025 and early 2026, the way enterprises handle video changed fundamentally. We moved from a world where meeting recordings sat on disk to one where engineers type "the segment of last quarter where we discussed a price increase" and get the right minute back. CCTV stopped being something humans scrub through in real time and became something you query — "show me anyone in a red shirt entering the front gate yesterday" — and get matched frames in under a second. Content libraries shifted from human-tagged metadata to multimodal embeddings that turn every scene into a searchable semantic unit.

Three concurrent advances made this possible.

- **Multimodal embedding accuracy** — From OpenCLIP through Google SigLIP2 (late 2024, ImageNet zero-shot near 84 percent), Cohere Embed v3 Multimodal, Voyage Multimodal, Nomic Embed Multimodal and Jina CLIP v2, the cost of putting one line of text and one image or clip in the same vector space dropped to roughly one one-hundredth of what it cost in 2022.

- **Video-native foundation models** — Twelve Labs released Pegasus 1.2 (Nov 2024) and Marengo 2.7, the first commercial models that treat video as a first-class citizen. Google's Gemini 1.5 and 2.0 Pro accept up to an hour of video in a single context window. OpenAI exposed GPT-4o's video API in December 2024.

- **Multimodal modes in vector databases** — Every major vector store now treats text, image and video embeddings as first-class entries inside the same index.

This piece maps the 2026 landscape end to end: video-native APIs like Twelve Labs, hyperscaler video AI, multimodal vector databases, object detection, foundation models, asset metadata, real-world use cases, captioning and licensing, Korean and Japanese local vendors, and storage and egress economics.

1 — Why video search matters in 2026

The variety of video an organisation owns has exploded.

- **Meeting recordings** — Zoom, Google Meet and Microsoft Teams now auto-record by default. A mid-sized company accumulates thousands to tens of thousands of hours per year. Otter, Granola, Fathom and Read.ai build search and summarisation on top.

- **CCTV and security cameras** — Cloud NVRs from Verkada, Rhombus and Eagle Eye Networks stream petabytes per customer into the cloud.

- **Content libraries** — Media companies hold petabyte-scale footage and VOD archives.

- **User-generated content** — TikTok, YouTube and Instagram Reels ingest hundreds of hours per minute.

- **E-commerce video** — 360-degree product video, unboxings and reviews increasingly are searchable assets, not just landing-page decoration.

- **Autonomous-vehicle and robotics data** — Fleets generate petabytes per week, and that footage powers model training and debugging.

What ties all of this together is one shared problem: "I've seen it but I can't find it." Plain text has grep. Video does not — until now. The 2026 video-search stack is the infrastructure that closes the gap.

The canonical scenarios look like this.

- Meetings: "Find the segments of last quarter's deals over 100K ACV where pricing was negotiated."

- Security: "Show me when a white SUV passed the front gate yesterday between 22:00 and 23:00."

- Content: "The scene where the two leads talk in the rain."

- E-commerce: "Find videos featuring a hoodie similar to this one."

- Live broadcast: "Flag profanity or sensitive speech the moment it airs."

All five run on the same underlying stack: embeddings plus a vector database plus targeted detectors.

2 — Twelve Labs — the leader in video-native foundation models

Twelve Labs, founded in 2021 by Korean-American co-founders, built the first commercial line of models that treat video as a first-class modality.

- **Marengo 2.7** — Embedding model that places video, image, text and audio into a shared 1024-dimensional space. Announced September 2024.

- **Pegasus 1.2** — Generative model that takes video as input and produces summaries, Q&A and captions. Announced November 2024.

- **Marengo Search API** — Natural-language query into matched clip timestamps. Returns start and end times plus confidence scores.

- **Embed API** — Converts video into multimodal embeddings combining visual, auditory and textual signals.

- **Generate API** — Free-form prompting over a video, summarisation and chapter segmentation.

Pricing combines per-minute indexing and per-token generation. Indexing sits near 0.05 USD per minute as of 2026; generation runs about 1.5 USD per million tokens. The free tier is ten hours per month.

The strength versus other vendors is robustness across video length. The same API handles a one-minute clip and a one-hour meeting recording with one- to two-second timestamp precision. The weakness is that Korean and Japanese subtitle data is thinner than English, so a fallback ASR pass is sometimes needed.

The challengers in the same seat:

- **Cloudglue** — 2025 entrant. Focus on content moderation and ad matching.

- **VideoDB** — Managed video infrastructure that bundles indexing, streaming and generation behind a single SDK.

- **Mixpeek** — Multimodal RAG platform that indexes images, video and documents in one space.

3 — Multimodal embedding models — from CLIP to SigLIP2

The heart of video search is the embedding model. A text query and a video frame have to land in the same vector space.

- **OpenAI CLIP (2021)** — ViT-B/32 and ViT-L/14 were the de facto standard. Trained on 400 million English image-text pairs. Weak Korean and Japanese support.

- **OpenCLIP (LAION)** — Re-training of CLIP on LAION-5B as open weights. ViT-G/14 hits roughly 80 percent zero-shot ImageNet.

- **Google SigLIP (2023)** — Trained with a sigmoid loss in place of softmax. Better precision-recall stability on the same data.

- **Google SigLIP2 (December 2024)** — Multilingual training. Substantial Korean and Japanese gains, ImageNet zero-shot near 84 percent.

- **Jina CLIP v2 (2024)** — Multilingual plus long text (8K tokens). Matryoshka-style training lets you truncate the embedding to 64-1024 dimensions.

- **BGE Multimodal (BAAI)** — Chinese-English open model from BAAI.

- **Cohere Embed v3 Multimodal (October 2024)** — Image and text in the same space via API. 1024-dimensional.

- **Voyage Multimodal (voyage-multimodal-3, November 2024)** — Text, image, table and chart in the same space. Tuned for RAG accuracy.

- **Nomic Embed Multimodal (December 2024)** — Open weights plus hosted API. Image and text with some Korean support.

- **VideoCLIP, X-CLIP, VideoLLM** — Video-specific variants that embed sequences of frames with a time axis.

Selection rules are simple. Need Korean or Japanese? SigLIP2 or Jina CLIP v2. Tables and charts in meeting recordings? Voyage Multimodal. Need fully open weights? Nomic Embed Multimodal. Plain English use cases? OpenCLIP ViT-L/14 remains the best price-performance trade-off.

4 — Hyperscaler video AI APIs

Outside the specialised vendors, all three major clouds offer video AI.

- **Google Cloud Video Intelligence API** — Label detection, shot change, object tracking, OCR, explicit content and person detection. Pricing around 0.10 USD per minute.

- **AWS Rekognition Video** — Face recognition, object detection, text, content moderation and celebrity recognition. Supports both batch and live streams.

- **Azure Video Indexer** (formerly Video Analyzer for Media) — Faces, sentiment, OCR, keyframes, speech recognition, translation and topic extraction in one product. Auto-captions in 30-plus languages.

- **AWS Bedrock + Anthropic Claude 3.5 Sonnet** — Extract frames and query them with a vision model for free-form analysis.

How to choose.

- Already on GCP? Video Intelligence is the natural pick. Its label-detection accuracy is the most consistent across domains.

- Need live-stream moderation? Rekognition Video.

- Want auto-captioning plus multilingual translation plus an Insights UI in one place? Azure Video Indexer is the most polished end-to-end.

- Need open-ended querying? Bedrock plus Claude or Nova.

5 — Multimodal modes in vector databases

Once embeddings exist they need to be stored and searched. By 2026 every major vector database treats multimodal indexing as first-class.

- **Pinecone** (Multimodal mode, GA September 2025) — Stores text, image and video embeddings in the same index. Managed with auto-embedding.

- **Weaviate** (multi2vec-clip module) — Plug CLIP or SigLIP in as a module, with auto-embedding on ingest.

- **Qdrant** — Collections of payloads plus vectors; combines freely with external CLIP or SigLIP embeddings.

- **Milvus / Zilliz Cloud** — Billions of vectors. Multi-vector fields per document for text, image and audio.

- **Chroma** — Local development and small scale. Multimodal collections supported.

- **pgvector + HNSW** — Postgres extension. Cost-effective at modest scale.

- **Turbopuffer** — 2024 entrant. Object-storage-based pricing, roughly one-tenth the cost of incumbents.

Pick by scale. Under a million vectors? Chroma or pgvector. Up to a hundred million? Pinecone or Weaviate. Beyond that? Milvus or Turbopuffer.

6 — Object detection and activity recognition

Part of video search is not embeddings but classification per frame.

- **Roboflow Video Inference + Workflows** — Frame-by-frame detection, post-processing and alerting, with a no-code workflow builder.

- **Ultralytics YOLO** (v8 and v11) — The real-time-detection standard. Eighty-plus classes at over 30 FPS.

- **Detectron2 / MMDetection** — Academic-grade from Meta and OpenMMLab. Accuracy-first.

- **OpenCV + MediaPipe** — Client-side standard for face, pose and hand detection.

- **NVIDIA DeepStream + Metropolis** — GPU-accelerated pipelines. Handles hundreds of CCTV channels per box.

- **Hailo / Coral Edge TPU** — Edge-device detection for CCTV and robotics.

Activity recognition (motion-based labels) needs separate models. SlowFast, VideoMAE and TimeSformer are the academic baselines, but practitioners often shortcut with keyframe extraction plus CLIP embeddings.

7 — Foundation video models in 2026 — Sora, Veo, Runway, Gemini, GPT-4o, Claude

Video generation and video understanding now live inside the same model line.

- **Sora** (OpenAI, ChatGPT Plus and Pro from December 2024) — Generation plus understanding. Up to one minute at 1080p. Limited API in early 2026.

- **Veo 2** (Google DeepMind, December 2024) — Cinematic camera work and precise physics. Integrated through Google Cloud Vertex AI.

- **Runway Gen-3 Alpha + Aleph** (2024-2025) — Aleph is the editing mode for generation plus masking.

- **Gemini 1.5 and 2.0 Pro video** — Up to one hour of video in a single context. Natural-language Q&A and summarisation.

- **GPT-4o video API** (December 2024) — Frames plus audio processed jointly. Real-time voice plus video.

- **Claude 3.5 and 4 Sonnet + vision frames** — Frame extraction followed by single-pass analysis. Strong tool-use integration.

- **InternVL 2/3 and MiniCPM-V** (open) — Self-hostable, with strong Korean and Japanese OCR.

- **Pika Labs, Luma Dream Machine, Kling** (Kuaishou) and **Hailuo MiniMax** — Generation-focused.

For understanding (search and summarisation), Twelve Labs Pegasus plus Gemini 2.0 Pro is the dominant stack. For generation, Sora, Veo, Runway, Kling and Hailuo each hold parts of the market.

8 — Video asset metadata — Mux, Cloudflare Stream, JW Player

Separate from generation and understanding lies the streaming and management plane.

- **Mux** (since 2017) — Analytics, encoding and live plus Asset Metadata. Auto-detection plus custom key-value tags. Mux Data covers viewing-quality analytics.

- **Cloudflare Stream** — Video encoding plus a global CDN plus AI captions. On the same network as R2 object storage, so egress is zero between them.

- **JW Player + AI Discovery** — Indexing plus auto topic classification, strong with CMS-driven publishers.

- **Bitmovin** — Encoders and analytics for media. 4K HDR optimisation.

- **api.video** — French-origin clean API for encoding, streaming and captions in one call.

- **Vimeo OTT / Brightcove** — Enterprise OTT.

- **AWS MediaConvert / Elemental** — AWS-native encoding.

Two keywords matter. (1) Asset Metadata: free-form key-value tagging that makes videos searchable. (2) AI captions: automatic English and multilingual subtitles, chapters and keywords generated at upload time. Both Cloudflare Stream and Mux now follow this pattern.

9 — Captioning infrastructure — Rev, 3Play Media, Whisper

The first searchable signal in any video is its captions. Audio to text to embedding is the most cost-effective route.

- **OpenAI Whisper** (v3 and large-v3-turbo) — Open weights, 100-plus languages.

- **AssemblyAI** — Speaker diarisation plus sentiment plus auto-keywords.

- **Deepgram** — Live and batch. Fast-improving Korean accuracy.

- **Rev.com** — Human-checked plus AI. Suitable for medical and legal.

- **3Play Media** — The standard for US media companies. Captions plus audio description.

- **Verbit** — Education and legal markets.

- **Otter, Granola, Fathom, Read.ai** — Auto-captioning plus summarisation for meeting recordings. Otter and Granola use proprietary models; Fathom and Read sit workflow-side on top of third-party models.

For large volumes plus cost priority, self-hosted Whisper is the standard. For accuracy priority, Rev or 3Play with human review remains the bar.

10 — Meeting search — the largest market

More than half of enterprise video is meetings. Meeting search is the biggest single demand driver.

- **Otter** — Auto-recording plus search plus action-item extraction. Effectively the standard in 2026.

- **Granola** (2024-) — Mac-native, with notes auto-written in a sidebar.

- **Fathom** — Zoom and Meet auto-recording plus clip sharing. CRM integrations.

- **Read.ai** — Meeting-efficiency scoring plus auto-summary.

- **Microsoft Teams Premium + Copilot** — Native Teams integration; search tied to the Teams search index.

- **Zoom AI Companion** — Native inside Zoom.

- **Google Meet + Gemini** — Auto meeting notes.

- **Tactiq / Sembly** — Cross-platform meeting notes.

- **Avoma** — Sales-call focus.

Sample query: "Segments of last quarter where pricing was negotiated." Match via captions plus speaker embeddings. The result is a video timestamp, a speaker label and a caption excerpt.

11 — Security-camera search — finding people and vehicles

CCTV by definition produces "video no human can watch in real time." Natural-language search cuts labour by roughly one-hundred-fold.

- **Verkada** — Cloud NVR plus AI search. Queries like "white shirt + front gate" are first-class.

- **Rhombus** — US mid-market building standard.

- **Eagle Eye Networks** — Global cloud NVR.

- **Avigilon Unity** — Motorola Solutions; government and enterprise.

- **Genetec** — Canadian; security plus access control.

- **Spot AI** — AI-first NVR built around natural-language query.

- **Hanwha Vision** (Korea) — Domestic and global; AI Box for on-device analysis.

- **Axis Communications** — Camera hardware plus analytics modules.

Three core features. (1) Person, vehicle and licence-plate detection. (2) Natural-language query like "red shirt." (3) Anomaly alerts such as falls, running or weapons.

12 — Content-library search — media archives

Broadcasters, OTTs and studios hold petabyte-scale archives. Human-tagged metadata had always been the search ceiling.

- **GrayMeta** — AI metadata for media archives.

- **Veritone** — Speech, face, logo and OCR indexing in one place. Strong with broadcast and advertising.

- **AWS Elemental MediaTailor** — Ad insertion plus AI indexing.

- **Anvato** (Google Cloud) — Broadcast encoding plus metadata.

- **Iconik** — Media asset management plus AI tagging.

- **Frame.io + Adobe AI** — Video collaboration plus auto-tagging.

- **Twelve Labs Enterprise** — Natural-language search integration for media companies.

Sample query: "The scene where the two leads talk in the rain." Combines captions, visual embeddings and object detection.

13 — E-commerce video — the next step of product search

E-commerce has validated that video outperforms static images on conversion, so the demand to make video a searchable asset is large.

- **Syte** — Visual search for image and video. Fashion and lifestyle.

- **Vue.ai** — Catalogue plus AI tagging plus virtual models.

- **YouCam / Perfect Corp.** — Cosmetics virtual try-on plus search.

- **Pixyle.ai** — Automatic fashion tagging.

- **Coveo + video** — Enterprise search.

- **Algolia + image** — Visual embeddings layered on classical search.

Sample query: "Videos featuring a hoodie similar to this one." Combines CLIP or SigLIP embeddings with fashion classifiers.

14 — Live-broadcast moderation

Live streams have no post-processing window. Detection and blocking has to happen the moment something airs.

- **Hive Moderation** — Live vision plus audio moderation. Used by Twitch and Reddit.

- **AWS Rekognition Streaming** — Kinesis Video Streams plus real-time analytics.

- **Sensity AI** — Deepfake detection.

- **Spectrum Labs** — Voice plus chat integrated.

- **Two Hat / Microsoft Community Sift** — Games and UGC platforms.

- **OpenAI Moderation API + vision** — Frames plus text in one call.

For live the metric is latency. Results need to land within 200 ms for pre-broadcast blocking.

15 — Inside YouTube and TikTok

Platform-native search is a separate stack.

- **YouTube Chapter Search** — Surfaces chapters as search results; auto-generated plus creator-edited.

- **YouTube Search by Voice and Hum** — Voice-based song search.

- **TikTok For You + video understanding** — Watch patterns plus content embeddings; recommendation is the core surface.

- **Meta CLIP + Reels recommendations** — Meta's CLIP variant powers Reels recommendation.

- **Instagram Reels search** — Captions plus visual embeddings plus audio.

The platforms do not publish their models, but research papers from Meta and Google reveal the structure: captions plus visual embeddings plus watch-time signals.

16 — Korean video AI

The Korean market has serious local players.

- **NAVER Clova Vision API / Video OCR** — Character detection plus indexing in video. Strong on news and entertainment subtitles.

- **Kakao Enterprise Kakao i Video AI** — Video analytics API. Works for both content libraries and CCTV.

- **VESPER** — Korean video-AI startup. Live and batch.

- **Hyperconnect / Azar** — Live-video moderation technology.

- **Maum AI (MindsLab)** — Integrated voice plus video AI platform.

- **DeepBrain AI** — AI human avatars plus video generation.

- **Lunit** — Medical imaging. Not video search per se, but a major axis of Korean visual AI.

- **Hanwha Vision** — CCTV cameras plus AI Box; native search analytics.

- **Wisenet Wave** — Hanwha Vision's NVR software.

- **Synamedia / Verimatrix Korea** — Broadcast plus DRM plus indexing.

KBS, SBS and JTBC each run internal archive search systems on NAVER Cloud or proprietary models. Korean OCR and ASR accuracy is ahead of the global median.

17 — Japanese video AI

Japan combines a large broadcasting and licensing market with active local solutions.

- **DeepMind Tokyo video research** — Part of Veo development is anchored in Tokyo.

- **TBS NDL + AI video search** — TBS's news digital library, with AI subtitles and topic search.

- **NHK STRL (Science and Technology Research Laboratories)** — Archive search, auto-subtitles and AI anchors.

- **Sony video-understanding AI** — Cameras plus cloud plus AI; film and sports.

- **Fuji Soft + AI video search** — Enterprise video search.

- **Preferred Networks (PFN)** — Training infrastructure for autonomous driving and robot video.

- **rinna / NTT video models** — Japanese-language video-understanding research.

- **NEC / Fujitsu video search** — Government and transport.

Sports broadcasting (NPB and J.League) with auto-highlights is a strong vertical. NTT provides live analytics layered on its telecom infrastructure.

18 — Storage cost — the real bill of video RAG

Video search costs more in storage and egress than in embeddings.

- **Object-storage unit prices** — S3 Standard at 0.023 USD per GB-month, GCS Standard similar, Azure Blob Hot similar, Cloudflare R2 at 0.015 USD per GB-month. One PB is 15-23K USD per month.

- **Infrequent access** — S3 IA at 0.0125, Glacier Flexible at 0.0036, Deep Archive at 0.00099. One PB in Deep Archive is about 1000 USD per month.

- **Egress** — S3 at 0.09 USD per GB is the baseline. One TB downloaded costs 90 USD. R2 and Cloudflare Stream offer zero egress.

- **Video-analysis unit cost** — 0.05 to 0.15 USD per minute. Ten thousand hours (600K minutes) of analysis runs 30-90K USD.

- **Vector DB** — Pinecone managed standard is around 70 USD per month per million vectors. Turbopuffer is roughly one-tenth.

Three cost-reduction strategies. (1) Move cold data to Glacier. (2) Use Cloudflare R2 and Stream for zero egress. (3) Embed only keyframes; skip full-frame decoding.

19 — Reference architecture — Twelve Labs + Pinecone + Cloudflare R2

The most common 2026 video-search stack looks like this.

[Video upload (Mux or Cloudflare Stream)]

[Cloudflare R2 (original storage, zero egress)]

+--> [Whisper / Deepgram (caption generation)]

+--> [Twelve Labs Marengo (per-clip video embedding)]

+--> [SigLIP2 / Voyage Multimodal (keyframe embedding, extra signal)]

+--> [Roboflow / YOLO (object detection, metadata)]

[Pinecone Multimodal Index]

[Natural-language query] -> [Twelve Labs Search or Pinecone Hybrid]

[Result: video ID + start/end timestamps + caption + object labels]

[Mux Player + jump-to-time + caption highlights]

The cost shape for a 100-hour corpus: R2 at 5 USD per month, Twelve Labs indexing at 300 USD one-time, Pinecone at 70 USD per month and captioning at 50 USD one-time. Initial indexing is a one-time 350 USD and steady-state operation runs about 75 USD per month.

20 — Privacy and compliance

Video is the most personally identifying data class.

- **Facial recognition** — The EU AI Act effectively bans real-time facial recognition in public spaces (in force February 2026). The US has state-level rules (Illinois BIPA requires consent).

- **Meeting recording** — Some US states (California) require two-party consent.

- **CCTV** — GDPR requires proportionality and a legitimate-interest assessment.

- **Deepfakes** — Korea, Japan and the EU all strengthened synthetic-content labelling rules in 2025-2026.

- **Automated moderation** — Live moderation false positives still need a human appeal path.

Before introducing video search, an organisation should settle three things. (1) Store face embeddings separately. (2) Define retention and auto-deletion. (3) Define consent flows.

21 — Open-source video-search stack

If self-hosting is preferred, this combination is the 2026 standard.

- **Embedding**: SigLIP2 (Hugging Face) plus Whisper large-v3.

- **Vector DB**: Qdrant managed or Milvus at scale.

- **Object detection**: Ultralytics YOLO v11.

- **Video decoding**: FFmpeg plus GPU acceleration.

- **Workflow**: Apache Airflow or Prefect.

- **Storage**: MinIO or SeaweedFS.

- **Player**: Video.js or hls.js.

Cost is dominated by one or two GPUs plus storage. Running 10K hours of indexed video at 2-3K USD per month is reachable.

22 — Trends and what comes next — H2 2026 outlook

- **One-hour context becomes standard** — Gemini and GPT both ingest hours of video in one call.

- **Agents plus video** — Agents that take video as input (browser-use, robots, AV debug) become standard.

- **On-device video AI** — iPhone Neural Engine and Snapdragon 8 Gen 4 run CLIP variants in real time.

- **Synthetic data** — Sora and Veo generate training data when real footage is scarce.

- **Finer temporal resolution** — Precision moves from 1-2 seconds toward 100 ms.

- **Joint audio plus visual** — GPT-4o-class models are the new norm.

- **Regulation tightens** — EU AI Act in force, Korean and Japanese synthetic-content labelling mandatory.

Conclusion — Video is finally searchable data

In 2026, video is no longer a thing you watch; it is data you search, summarise, cite and train on. Starting from Twelve Labs and adding a Pinecone Multimodal index, Roboflow object detection, Cloudflare R2 plus Mux Asset Metadata, Whisper captions, hyperscaler tools from Google Video Intelligence to AWS Rekognition to Azure Video Indexer, foundation models like Sora, Veo, Gemini, GPT-4o and Claude, and local players from NAVER, Kakao, VESPER and Hanwha Vision in Korea to NHK STRL, Sony and NTT in Japan — a pragmatic combination of the tools in this guide makes it possible to search petabytes of video with a single natural-language sentence in under a second.

The pivot is one decision: treat video as data. Once that's settled, the answer for almost every use case becomes the same recipe — embeddings plus a vector database plus captions plus object detection. The same infrastructure powers meetings, CCTV, content, e-commerce and live simultaneously.

References — Twelve Labs, SigLIP, Pinecone, Mux and more

- [Twelve Labs Documentation — Marengo and Pegasus](https://docs.twelvelabs.io/)

- [Twelve Labs API Reference](https://docs.twelvelabs.io/reference/api-reference)

- [Google SigLIP2 Paper (arXiv, 2024)](https://arxiv.org/abs/2502.14786)

- [OpenCLIP GitHub (LAION)](https://github.com/mlfoundations/open_clip)

- [Jina CLIP v2 Announcement](https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/)

- [Cohere Embed v3 Multimodal](https://cohere.com/blog/multimodal-embed-3)

- [Voyage Multimodal 3](https://blog.voyageai.com/2024/11/12/voyage-multimodal-3/)

- [Nomic Embed Multimodal](https://blog.nomic.ai/posts/nomic-embed-multimodal)

- [Google Cloud Video Intelligence API](https://cloud.google.com/video-intelligence/docs)

- [AWS Rekognition Video Developer Guide](https://docs.aws.amazon.com/rekognition/latest/dg/video.html)

- [Azure Video Indexer Documentation](https://learn.microsoft.com/en-us/azure/azure-video-indexer/)

- [Pinecone Multimodal Search Guide](https://docs.pinecone.io/guides/data/multimodal-search)

- [Weaviate multi2vec-clip Module](https://weaviate.io/developers/weaviate/modules/multi2vec-clip)

- [Qdrant Multimodal Search Tutorial](https://qdrant.tech/articles/multimodal-search/)

- [Milvus Multimodal Search](https://milvus.io/docs/multimodal_rag_with_milvus.md)

- [Roboflow Video Inference Docs](https://docs.roboflow.com/deploy/video-inference)

- [Ultralytics YOLOv11 Release](https://docs.ultralytics.com/models/yolo11/)

- [NVIDIA DeepStream SDK](https://developer.nvidia.com/deepstream-sdk)

- [Sora System Card (OpenAI, 2024)](https://openai.com/index/sora-system-card/)

- [Google Veo 2 Announcement](https://deepmind.google/technologies/veo/veo-2/)

- [Runway Gen-3 Alpha Documentation](https://help.runwayml.com/hc/en-us/articles/30586818553107)

- [Gemini 1.5 Pro Long Context Paper](https://arxiv.org/abs/2403.05530)

- [Mux Asset Metadata API](https://docs.mux.com/guides/video/add-custom-metadata-to-an-asset)

- [Cloudflare Stream + AI Captions](https://developers.cloudflare.com/stream/edit-videos/captions/)

- [JW Player AI Discovery](https://www.jwplayer.com/products/discovery-engagement/)

- [OpenAI Whisper Paper](https://arxiv.org/abs/2212.04356)

- [NHK STRL Research Annual Report](https://www.nhk.or.jp/strl/publica/annual/index.html)

- [NAVER Clova Video OCR API](https://www.ncloud.com/product/aiService/ocr)

- [EU AI Act Final Text (2024)](https://eur-lex.europa.eu/eli/reg/2024/1689/oj)