Why did Nexla partner with NVIDIA and Nebius for this travel AI pipeline?

Nexla orchestrates data movement, transformation, and routing; NVIDIA provides GPU-accelerated vision-language models; and Nebius provisions these models at production scale. Together, they enable real-time processing of millions of travel videos into structured itineraries.

How does the pipeline convert short-form videos into multi-day travel plans?

The pipeline samples video frames, runs them through NVIDIA Nemotron VL models for visual analysis, transcribes audio with Nexla, and summarizes results with Nemotron Super 3. Nexla orchestrates the workflow, integrating data from all sources into a structured itinerary.

Blog Artificial Intelligence

The Architecture That Converts Travel Inspiration Into Itineraries

By Nexla Team

The Architecture That Converts Travel Inspiration Into Itineraries

How an NVIDIA-Powered Video AI Pipeline Turns Social Media Travel Content into Structured Travel Itineraries

Every day, millions of travelers post short-form videos to TikTok, Instagram Reels, and YouTube Shorts, capturing real restaurants, hidden gems, and walking routes that no travel guide replicates. People bookmark them, but turning them into an actual itinerary means manually tracking down every location, cross-referencing maps, and stitching together a plan by hand.

We built a pipeline that does this automatically. It watches the video, identifies locations, and generates a structured multi-day travel plan. This post walks through how it works, the model choices behind it, and the cost benchmarks that make running it at production scale practical.

This blog walks through the technical architecture of an AI video analysis pipeline developed using NVIDIA’s multimodal model stack, provisioned via Nebius, and orchestrated through Nexla’s data platform. We cover how the pipeline works, why we made the model choices we did, and the cost benchmarks that make running this at production scale practical.

Starting From a Strong Foundation: The NVIDIA Video Intelligence Blueprint

The architecture builds on NVIDIA’s Production Scale Solution (PSS) blueprint for video search and summarization. Originally designed for use cases like car crash detection and emergency response timing, the blueprint provides a high-throughput foundation: frame ingestion, entity tracking, and visual model integration to extend the use cases beyond that.

For travel, we replaced the Computer Vision (CV) based tracking pipeline with landmark detection and location intelligence. Instead of monitoring for incidents, the system now identifies restaurants, streets, neighborhoods, and visual cues like signage or storefronts.

Nexla’s AI Integration Platform handles the movement, transformation, and routing of video frames and model outputs across the pipeline. This modular design makes the pipeline extensible: what works for TikTok travel clips can be adapted for hotel reviews, food blogs, or any short-form video format.

How the Pipeline Works

The pipeline operates in four stages, moving from raw video input to a structured travel itinerary.

Stage 1: Configurable Frame Sampling

Videos are ingested and sampled at configurable frame rates: 3, 5, 10, or 15 frames per second depending on the quality setting required. Higher frame rates capture more visual detail but increase processing cost. For most travel content, the Standard setting at 5 fps delivers a strong balance of accuracy and efficiency.

Stage 2: Frame Analysis with Nemotron VL Nano 12B v2

Each sampled frame is passed to Nemotron VL (Vision-Language) Nano 12B v2, NVIDIA’s frame-by-frame visual analysis model. For every frame, the model generates a textual description that identifies:

Landmarks and geographic markers
On-screen text such as restaurant names, street signs, and labels
Objects, people, and entities in the scene

This stage handles the core challenge of translating visual content into structured language that downstream systems can use. Landmark detection accuracy at this stage directly determines the quality of the itinerary produced downstream.

Stage 3: Audio Transcription Powered by Nexla

While NVIDIA’s model stack handles visual analysis, a video-only pipeline doesn’t take advantage of other signals, to help with this, Nexla adds an audio transcription layer running OpenAI’s Whisper in parallel with frame analysis, for example, capturing restaurant names dropped in voiceover, neighborhood call-outs, and travel tips that never appear on screen. Wiring this into the pipeline requires no custom integration work. Nexla handles the routing of audio streams alongside video frames, synchronizes outputs from both modalities, and passes the combined signal downstream to the summarization stage. As an added benefit, in dense urban environments where visual landmark detection is hardest, this extra layer meaningfully improves itinerary accuracy. It’s a practical example of how the right orchestration layer closes gaps between model capabilities without rebuilding the pipeline from scratch.

Stage 4: Summarization with Nemotron Super 3

A single video can generate hundreds or thousands of frame descriptions. Nemotron Super 3 compresses this volume into a cohesive output, synthesizing location data, activity flow, and thematic content from the full video into a structured travel itinerary. Typical output covers several days of planned activities.

All NVIDIA models are provisioned via Nebius, enabling API-based access without managing model infrastructure directly.

Performance Benchmarks: Cloud API vs. Dedicated Inference at Scale

One of the strongest arguments for this architecture is cost efficiency at scale. This becomes critical when a travel company brings the use case at scale where millions of videos might need to be processed. Nexla ran preliminary benchmark tests comparing cloud-based API costs against running NVIDIA Inference Microservices (NIMS) on dedicated hardware. Both approaches were tuned to deliver comparable output quality for accurate detection from social media travel videos and subsequent itinerary generation.

Results were tested over a variety of sample videos varying in length and quality. Thereafter, the results were projected to real-world representative workload for one million videos, reflecting a production scenario where this capability is used heavily by end users on a continuous basis taking advantage of dedicated 24/7 hardware.

Quality Setting	Frame Rate	Cloud API Cost (1M videos)	NIMS Cost (1M videos)	Savings vs. Cloud API
Basic	3 fps	$190K	$45K-$70K	63-76%
Standard	5 fps	$290K	$60K-$90K	69-79%
Production	10 fps	$632K	$100K-$150K	75-85%
Premium	15 fps	$960K	$140K-$210K	78-85%

At every quality setting, dedicated NIMS infrastructure delivers substantial savings over cloud APIs. At the Production setting (10 fps), appropriate for high-detail content, cloud API costs reach $632K per million videos while NIMS costs run between $100K and $150K. At the Premium setting (15 fps), savings top out at up to 85%.

The scalability advantage becomes most pronounced at higher frame rates, where cost per frame compounds quickly under cloud-based pricing but stays relatively stable with dedicated hardware running at full utilization.

The Future of Travel Planning at Traveler’s Fingertips

The pipeline delivers strong results for landmark detection and structured itinerary generation. Several clear paths forward make it significantly more powerful.

External Data Integration

Utilizing Nexla platform, adding user preferences and review and rating data from travel platforms, as context input to Nemotron Super 3 would allow the model to filter and rank recommendations based on user preferences: cuisine type, location, budget, travel style and reviews.

Improving Upstream Accuracy

Current testing shows meaningful differences between models in their ability to identify specific landmarks and locations from social media content, particularly in dense urban environments. Improving detection accuracy at the frame level has a direct downstream effect on itinerary quality. We are evaluating model options and tuning approaches for this stage.

Expanding Beyond Video

Augmenting planning with additional input types beyond short-form video, including travel blog posts, review text, and conversational prompts from users. This would allow users to combine video and text signals in a single pipeline and would extend the platform to other content-rich travel use cases.

Conclusion

The convergence of multimodal AI, scalable inference infrastructure, and data integration creates a new class of products that could not exist a few years ago. What began originally for life saving use cases like car crash detection and emergency response timing, is now able to generate personalized travel plans from the content real travelers create every day.

Building a pipeline like this means stitching together frame samplers, vision models, transcription, and summarization and making sure data moves reliably between all of them at scale. That’s exactly the problem Nexla is built to solve enabled by Nebius and NVIDIA technologies.

Want To Build Multimodal AI Pipelines Like This?

If you’re working on multimodal AI pipelines and spending more time on data plumbing than on model quality, email us at info@nexla.com

FAQs

How does AI generate travel itineraries from videos?

AI samples frames from a video, analyzes landmarks and on-screen text with vision language models, transcribes narration, and summarizes the results into a structured travel plan.

Why is multimodal AI important for travel video analysis?

Travel videos contain visual signals, spoken narration, and contextual clues. Multimodal AI combines vision and audio signals to identify locations and activities more accurately.

Why use dedicated inference infrastructure for video AI?

Running inference on dedicated hardware dramatically reduces cost when processing large video volumes because cloud APIs charge per request and per frame.

How do NVIDIA, Nebius, and Nexla power this AI video itinerary pipeline?

NVIDIA provides the multimodal AI models used for video analysis and summarization. Nebius supplies the GPU infrastructure and inference services to run those models at scale. Nexla orchestrates the pipeline, routing video frames, synchronizing audio transcription, and delivering structured travel itineraries from social media content.

Tags: Agent-ready data AI Agents AI data orchestration Data Integration Data Products Multi-agent AI pipeline Multimodal AI Travel AI Planner

Join Our Newsletter

Blog Home

Related Blogs

Nexla Blog: Nexla + Nebius: Orchestrating Multi-Agent AI From Data to Production

Artificial Intelligence, Data Integration, Data Leaders

Nexla at NVIDIA GTC: Orchestrating Multi-Agent AI From Data to Production

At NVIDIA GTC 2026, Nexla and Nebius showcase a live multi-agent AI pipeline that turns video input into structured travel itineraries using scalable AI infrastructure.

By Nexla Team

Nexla Blog: From Hallucinations to Trust

Artificial Intelligence, Data Engineering, Data Products, GenAI

From Hallucinations to Trust: Context Engineering for Enterprise AI

Context engineering is the systematic practice of designing and controlling the information AI models consume at runtime, ensuring outputs are accurate, auditable, and compliant.

By Niket Sourabh

Artificial Intelligence, Data Integration, GenAI

Nexla and NVIDIA: Essential for AI Success

Discover how Nexla and NVIDIA simplify AI adoption with seamless data integration, accelerated GenAI workflows, and scalable solutions for business success.