Voice AILogisticsHindi NLPFastAPI

Hindi Voice Agent for Logistics

Turning raw call audio into structured truck, route, and pricing intelligence

WeBuildTech·December 5, 2025

At a Glance

SectorLogistics / Freight

Solution typeHindi voice agent

Primary usersHindi-speaking truck owners and field users

Data capturedTruck type, dimensions, capacity, base city, favourite route, bidirectional route pricing, and additional destination-price pairs

Core stackFastAPI, Faster-Whisper / Deepgram, Mistral / Gemini, Coqui TTS / Sarvam AI, WebSocket-enabled interaction

The Client Challenge

Critical truck and route information often sits inside conversations, not structured systems.
End users are more likely to answer naturally in Hindi over a call than complete long digital forms.
The workflow needed to capture both static fleet attributes and variable commercial inputs such as route pricing.
The system had to feel conversational, not robotic, while still keeping the questioning disciplined and sequential.

What the business needed

A low-friction channel to gather operational details from field users quickly and consistently.

What WeBuildTech designed

A guided call flow that asked one question at a time, preserved context, and converted every turn into machine-usable information.

Why a Voice-First Workflow Made Sense

Natural for the user

Truck owners can answer in spoken Hindi without learning a new interface or adapting to form-heavy workflows.

Useful for the business

Each conversation becomes a source of structured fleet and route intelligence instead of staying trapped in informal verbal exchanges.

Ready for scale

The same flow can later connect to telephony, automated extraction, QA scoring, and operational dashboards.

The Solution WeBuildTech Built

At its core, the solution is a turn-based conversational backend. It greets the user, captures speech, transcribes the response, appends it to conversation history, generates the next best prompt using domain-specific instructions, and returns audio back to the user. The architecture is intentionally modular so speech, reasoning, and voice layers can be swapped as the product matures.

Hindi-first interaction design for real operator comfort.
One-question-at-a-time prompt strategy to keep the flow controlled.
Conversation history tracking to avoid losing context between turns.
Pluggable STT / LLM / TTS layers for faster experimentation and future production hardening.

What the architecture captures

User-facing interaction through recorded audio, streamed audio, or browser-led capture.
Speech-to-text through either local or API-based engines.
Reasoning grounded in prompt rules and accumulated conversation history.
Hindi audio response generation for the next question in the flow.
Persistent recordings and transcripts that make auditing and future analytics possible.

Data Model Hidden Inside the Conversation

Although the code does not yet show a final structured extraction layer, the prompt design makes the target schema clear. The conversation is meant to gather:

Truck identity and physical profile: type, length, width, and weight or load capacity.
Operating anchor: base city and favourite route.
Commercial intelligence: route-wise pricing in both directions.
Market expansion inputs: three additional destinations with their corresponding price points.

Conversation Design and Product Logic

The most important product decision was not the choice of model. It was the design of the questioning sequence. The prompt explicitly tells the agent to stay in Hindi, ask one question at a time, move forward when an answer is weak, and repeat numeric values carefully. That is exactly the kind of control logic that makes a voice agent usable in operational settings.

Conversation history is stored centrally so the next question can build on the previous answer.
The prompt does not try to do everything at once; it sequences the call around a very specific business objective.
The system has explicit tolerance for incomplete answers, which is important in noisy real-world voice interactions.
The backend supports both file-oriented and near-real-time interaction patterns, which is a strong foundation for product iteration.

From Proof of Concept to Production

One of the strongest aspects of this project is the visible maturation path. The codebase moved from a local experiment into a cleaner services-based structure with faster API integrations and WebSocket-enabled interaction — demonstrating product management discipline, not just model experimentation.

STT

Faster-Whisper (local) → Deepgram Nova-2 — more flexible performance and API-friendly deployment.

LLM

Mistral via llama.cpp → Gemini 1.5 Flash — faster orchestration with lower operational complexity.

TTS

Coqui TTS → Sarvam AI Hindi TTS — better fit for a Hindi-first product experience.

Interaction

File upload + simple HTML → WebSocket transcript loop — closer to real-time conversational behaviour.

Business Value Delivered

Immediate value

A repeatable voice workflow for collecting truck and route information. Lower friction for end users who prefer phone-style interactions. A modular backend that supports ongoing iteration.

Next-phase opportunities

Telephony integration for actual outbound and inbound calling. Entity extraction into structured JSON or database records. Confidence scoring, agent QA, review dashboards, and human fallback.

Want something similar built?

Let's talk about your problem and how we can design a solution around it.

Book Discussion