The ‘No More Individual Contributors’ Framework: AI Team Management in Enterprise
In episode eight of DatAInnovators & Builders Podcast, Michael Domanic, VP at UserTesting, explains how enterprises run AI teams of three to drive transformation.
Every day, millions of travelers post short-form videos to TikTok, Instagram Reels, and YouTube Shorts, capturing real restaurants, hidden gems, and walking routes that no travel guide replicates. People bookmark them, but turning them into an actual itinerary means manually tracking down every location, cross-referencing maps, and stitching together a plan by hand.
We built a pipeline that does this automatically. It watches the video, identifies locations, and generates a structured multi-day travel plan. This post walks through how it works, the model choices behind it, and the cost benchmarks that make running it at production scale practical.
This blog walks through the technical architecture of an AI video analysis pipeline developed using NVIDIA’s multimodal model stack, provisioned via Nebius, and orchestrated through Nexla’s data platform. We cover how the pipeline works, why we made the model choices we did, and the cost benchmarks that make running this at production scale practical.
The architecture builds on NVIDIA’s Production Scale Solution (PSS) blueprint for video search and summarization. Originally designed for use cases like car crash detection and emergency response timing, the blueprint provides a high-throughput foundation: frame ingestion, entity tracking, and visual model integration to extend the use cases beyond that.
For travel, we replaced the Computer Vision (CV) based tracking pipeline with landmark detection and location intelligence. Instead of monitoring for incidents, the system now identifies restaurants, streets, neighborhoods, and visual cues like signage or storefronts.
Nexla’s AI Integration Platform handles the movement, transformation, and routing of video frames and model outputs across the pipeline. This modular design makes the pipeline extensible: what works for TikTok travel clips can be adapted for hotel reviews, food blogs, or any short-form video format.

The pipeline operates in four stages, moving from raw video input to a structured travel itinerary.
Videos are ingested and sampled at configurable frame rates: 3, 5, 10, or 15 frames per second depending on the quality setting required. Higher frame rates capture more visual detail but increase processing cost. For most travel content, the Standard setting at 5 fps delivers a strong balance of accuracy and efficiency.
Each sampled frame is passed to Nemotron VL (Vision-Language) Nano 12B v2, NVIDIA’s frame-by-frame visual analysis model. For every frame, the model generates a textual description that identifies:
This stage handles the core challenge of translating visual content into structured language that downstream systems can use. Landmark detection accuracy at this stage directly determines the quality of the itinerary produced downstream.
While NVIDIA’s model stack handles visual analysis, a video-only pipeline doesn’t take advantage of other signals, to help with this, Nexla adds an audio transcription layer running OpenAI’s Whisper in parallel with frame analysis, for example, capturing restaurant names dropped in voiceover, neighborhood call-outs, and travel tips that never appear on screen. Wiring this into the pipeline requires no custom integration work. Nexla handles the routing of audio streams alongside video frames, synchronizes outputs from both modalities, and passes the combined signal downstream to the summarization stage. As an added benefit, in dense urban environments where visual landmark detection is hardest, this extra layer meaningfully improves itinerary accuracy. It’s a practical example of how the right orchestration layer closes gaps between model capabilities without rebuilding the pipeline from scratch.
A single video can generate hundreds or thousands of frame descriptions. Nemotron Super 3 compresses this volume into a cohesive output, synthesizing location data, activity flow, and thematic content from the full video into a structured travel itinerary. Typical output covers several days of planned activities.
All NVIDIA models are provisioned via Nebius, enabling API-based access without managing model infrastructure directly.
One of the strongest arguments for this architecture is cost efficiency at scale. This becomes critical when a travel company brings the use case at scale where millions of videos might need to be processed. Nexla ran preliminary benchmark tests comparing cloud-based API costs against running NVIDIA Inference Microservices (NIMS) on dedicated hardware. Both approaches were tuned to deliver comparable output quality for accurate detection from social media travel videos and subsequent itinerary generation.
Results were tested over a variety of sample videos varying in length and quality. Thereafter, the results were projected to real-world representative workload for one million videos, reflecting a production scenario where this capability is used heavily by end users on a continuous basis taking advantage of dedicated 24/7 hardware.
| Quality Setting | Frame Rate | Cloud API Cost (1M videos) |
NIMS Cost (1M videos) |
Savings vs. Cloud API |
|---|---|---|---|---|
| Basic | 3 fps | $190K | $45K-$70K | 63-76% |
| Standard | 5 fps | $290K | $60K-$90K | 69-79% |
| Production | 10 fps | $632K | $100K-$150K | 75-85% |
| Premium | 15 fps | $960K | $140K-$210K | 78-85% |
At every quality setting, dedicated NIMS infrastructure delivers substantial savings over cloud APIs. At the Production setting (10 fps), appropriate for high-detail content, cloud API costs reach $632K per million videos while NIMS costs run between $100K and $150K. At the Premium setting (15 fps), savings top out at up to 85%.
The scalability advantage becomes most pronounced at higher frame rates, where cost per frame compounds quickly under cloud-based pricing but stays relatively stable with dedicated hardware running at full utilization.
The pipeline delivers strong results for landmark detection and structured itinerary generation. Several clear paths forward make it significantly more powerful.
Utilizing Nexla platform, adding user preferences and review and rating data from travel platforms, as context input to Nemotron Super 3 would allow the model to filter and rank recommendations based on user preferences: cuisine type, location, budget, travel style and reviews.
Current testing shows meaningful differences between models in their ability to identify specific landmarks and locations from social media content, particularly in dense urban environments. Improving detection accuracy at the frame level has a direct downstream effect on itinerary quality. We are evaluating model options and tuning approaches for this stage.
Augmenting planning with additional input types beyond short-form video, including travel blog posts, review text, and conversational prompts from users. This would allow users to combine video and text signals in a single pipeline and would extend the platform to other content-rich travel use cases.
The convergence of multimodal AI, scalable inference infrastructure, and data integration creates a new class of products that could not exist a few years ago. What began originally for life saving use cases like car crash detection and emergency response timing, is now able to generate personalized travel plans from the content real travelers create every day.
Building a pipeline like this means stitching together frame samplers, vision models, transcription, and summarization and making sure data moves reliably between all of them at scale. That’s exactly the problem Nexla is built to solve enabled by Nebius and NVIDIA technologies.
If you’re working on multimodal AI pipelines and spending more time on data plumbing than on model quality, email us at info@nexla.com
AI samples frames from a video, analyzes landmarks and on-screen text with vision language models, transcribes narration, and summarizes the results into a structured travel plan.
Travel videos contain visual signals, spoken narration, and contextual clues. Multimodal AI combines vision and audio signals to identify locations and activities more accurately.
Running inference on dedicated hardware dramatically reduces cost when processing large video volumes because cloud APIs charge per request and per frame.
NVIDIA provides the multimodal AI models used for video analysis and summarization. Nebius supplies the GPU infrastructure and inference services to run those models at scale. Nexla orchestrates the pipeline, routing video frames, synchronizing audio transcription, and delivering structured travel itineraries from social media content.
In episode eight of DatAInnovators & Builders Podcast, Michael Domanic, VP at UserTesting, explains how enterprises run AI teams of three to drive transformation.
In episode seven of DatAInnovators & Builders Podcast, Rowan Trollope, CEO of Redis, explains how teams hit 95% cache and cut LLM costs 70% using agent memory, semantic layers, and production-grade AI infrastructure.
Nexla and Vespa.ai partner to simplify real-time enterprise AI search, connecting 500+ data sources to power RAG, vector retrieval, and AI apps.