Build a Cost-Aware AI Support Triage API with Dynamic Model Routing

Many AI applications begin with a single model handling all tasks. While suitable for prototypes, this approach quickly becomes inefficient and costly when an endpoint needs to manage diverse, complex tasks like classification, urgency scoring, customer replies, and summarization. Each of these tasks benefits from different model characteristics regarding cost, latency, and quality. This tutorial demonstrates how to build a FastAPI support triage API that intelligently routes requests to the most appropriate AI model for each specific job, ensuring optimal performance and cost efficiency without hardcoding model choices into your application.

Step 1: Understanding Dynamic AI Model Routing

At the heart of an efficient AI system for varied tasks is the concept of dynamic model routing.

The Challenge with Single Models

Many AI applications rely on a single model for all tasks, which quickly becomes inefficient and costly. Diverse operations like classification, summarization, and drafting replies each benefit from different model characteristics regarding cost, latency, and quality. Using one powerful model for everything means overpaying for simple tasks and complicating application logic.

The Solution: Serverless Inference and Inference Routers

An Inference Router acts as an intelligent intermediary. It allows you to define tasks (named jobs), model pools (sets of candidate models), and selection policies (rules like lowest cost or latency). Your application requests a task, and the router dynamically selects the optimal model. Serverless inference complements this by providing API access to models without infrastructure management, simplifying deployment and scaling.

Step 2: Setting Up Your Development Environment

Before building the API, ensure your environment is ready. You'll need Python, a few libraries, and access to a serverless inference service.

Install Python: Ensure you have Python 3.10 or newer installed on your system.
Create a Project Directory: Set up a new directory for your project.
```
mkdir support-triage
cd support-triage
```

Set up a Virtual Environment: It's good practice to isolate your project dependencies.

python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install Dependencies: Create a requirements.txt file with the following content. FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard Python type hints.
```
fastapi
uvicorn[standard]
python-dotenv
httpx
pydantic
```
Then install them:
```
pip install -r requirements.txt
```
Configure Environment Variables: Create a .env file in your project root. You will need an API endpoint URL for your inference router and an access key. For example, if using DigitalOcean's Inference Router, your .env might look like this:
```
INFERENCE_ROUTER_URL="https://inference.do-ai.run/v1/"
INFERENCE_ACCESS_KEY="your_inference_access_key_here"
```
Replace "your_inference_access_key_here" with your actual access key. This key secures your requests to the inference router.

Step 3: Defining Triage Tasks and Router Policies

The exact configuration of an inference router varies by provider, but the core principles are consistent. You'll define the specific tasks your triage system needs and assign models to them. For this API, we'll use four tasks:

classify_ticket: To categorize the support issue (e.g., 'billing', 'bug', 'how-to', 'account').
score_urgency: To assess the severity and sentiment of the customer's message.
draft_reply: To generate a concise, customer-facing response.
summarize_escalation: To create a structured brief for a human agent, especially for complex tickets.

When configuring your router, you typically:

Create Tasks: Define each task with a clear description.
Assign Model Pools: For each task, specify a pool of available AI models (e.g., smaller models for classification, larger for summarization).
Set Selection Policies: Apply policies to each task's model pool (e.g., lowest cost for classification, high quality with fallback for summarization).

Your FastAPI application will send the task name and prompt, and the router handles the rest.

Step 4: Building the FastAPI Triage Endpoint

Now, let's create the FastAPI application that will interact with your configured inference router. Create a file named main.py in your project directory.

import os
from typing import Dict, Any, Optional

import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI(
    title="AI Support Triage API",
    description="API for dynamically routing support tickets to specialized AI models."
)

# Retrieve router URL and access key from environment variables
INFERENCE_ROUTER_URL = os.getenv("INFERENCE_ROUTER_URL")
INFERENCE_ACCESS_KEY = os.getenv("INFERENCE_ACCESS_KEY")

if not INFERENCE_ROUTER_URL or not INFERENCE_ACCESS_KEY:
    raise ValueError("INFERENCE_ROUTER_URL and INFERENCE_ACCESS_KEY must be set in .env")

# Pydantic model for the incoming support ticket payload
class SupportTicket(BaseModel):
    ticket_id: str
    customer_message: str
    context: Optional[str] = None # Additional context like previous interactions

# Pydantic model for the response from the triage API
class TriageResponse(BaseModel):
    ticket_id: str
    classification: str
    urgency_score: float
    drafted_reply: str
    escalation_summary: Optional[str] = None

async def call_inference_router(task_name: str, prompt: str) -> Dict[str, Any]:
    """Helper function to call the inference router for a specific task."""
    headers = {
        "Authorization": f"Bearer {INFERENCE_ACCESS_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": task_name, # The router uses 'model' field to identify the task
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }
    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(
                f"{INFERENCE_ROUTER_URL}chat/completions",
                headers=headers,
                json=payload,
                timeout=30.0 # Increased timeout for potential longer inference times
            )
            response.raise_for_status() # Raise an exception for bad status codes
            return response.json()
        except httpx.HTTPStatusError as e:
            print(f"HTTP error for task {task_name}: {e.response.status_code} - {e.response.text}")
            raise HTTPException(status_code=e.response.status_code, detail=f"Inference router error for {task_name}: {e.response.text}")
        except httpx.RequestError as e:
            print(f"Request error for task {task_name}: {e}")
            raise HTTPException(status_code=500, detail=f"Network error communicating with inference router for {task_name}")

@app.post("/triage", response_model=TriageResponse)
async def triage_ticket(ticket: SupportTicket):
    """
    Receives a support ticket and processes it through various AI models
    via the inference router for classification, urgency, reply drafting, and summarization.
    """
    full_message = f"Customer Message: {ticket.customer_message}"
    if ticket.context:
        full_message += f"\nContext: {ticket.context}"

    # 1. Classify the ticket
    classification_prompt = f"Classify the following customer support message into one category: billing, bug, how-to, account, feature request, other. Only return the category name.\n\n{full_message}"
    classification_response = await call_inference_router("classify_ticket", classification_prompt)
    classification = classification_response["choices"][0]["message"]["content"].strip().lower()

    # 2. Score urgency/sentiment
    urgency_prompt = f"Rate the urgency and sentiment of the following customer support message on a scale of 0.0 (low urgency/negative sentiment) to 1.0 (high urgency/positive sentiment). Provide only the numerical score.\n\n{full_message}"
    urgency_response = await call_inference_router("score_urgency", urgency_prompt)
    try:
        urgency_score = float(urgency_response["choices"][0]["message"]["content"])
    except ValueError:
        urgency_score = 0.5 # Default if parsing fails

    # 3. Draft a customer reply
    reply_prompt = f"Draft a concise, polite, and helpful customer-facing reply to the following support message. Keep it under 50 words.\n\n{full_message}"
    reply_response = await call_inference_router("draft_reply", reply_prompt)
    drafted_reply = reply_response["choices"][0]["message"]["content"].strip()

    # 4. Conditionally summarize for escalation (e.g., if urgency is high or classification suggests complexity)
    escalation_summary: Optional[str] = None
    if urgency_score > 0.7 or classification in ["bug", "feature request"]:
        summary_prompt = f"Summarize the following customer support message and context into a brief for a human agent, highlighting key issues and required actions. Keep it under 150 words.\n\n{full_message}"
        summary_response = await call_inference_router("summarize_escalation", summary_prompt)
        escalation_summary = summary_response["choices"][0]["message"]["content"].strip()

    return TriageResponse(
        ticket_id=ticket.ticket_id,
        classification=classification,
        urgency_score=urgency_score,
        drafted_reply=drafted_reply,
        escalation_summary=escalation_summary
    )

Step 5: Testing Your Dynamic Triage API

With your FastAPI application coded, it's time to test the dynamic routing in action.

Run the FastAPI Application:
```
uvicorn main:app --reload
```
This will start your API server, typically on http://127.0.0.1:8000.

Send a Test Request: You can use a tool like curl or a Python script to send a POST request to your /triage endpoint.

Example curl request:

curl -X POST "http://127.0.0.1:8000/triage" \
     -H "Content-Type: application/json" \
     -d '{ "ticket_id": "TICKET-001", "customer_message": "My account is locked and I can't access my billing information. This is urgent!", "context": "User tried resetting password twice without success." }'

Example Python script (test_api.py):

import httpx
import asyncio
async def test_triage_api():
url = “http://127.0.0.1:8000/triage”
headers = {“Content-Type”: “application/json”}
payload = {
“ticket_id”: “TICKET-002”,
“customer_message”: “I need to know how to change my profile picture on your platform. Can you help me find the settings?”,
“context”: “New user, first interaction.”
}
1
async with httpx.AsyncClient() as client:
2
    try:
3
        response = await client.post(url, headers=headers, json=payload)
4
        response.raise_for_status()
5
        print("API Response:")
6
        print(response.json())
7
    except httpx.HTTPStatusError as e:
8
        print(f"Error response {e.response.status_code}: {e.response.text}")
9
    except httpx.RequestError as e:
10
        print(f"An error occurred while requesting {e.request.url!r}: {e}")

if name == “main”:
asyncio.run(test_triage_api())

Observe the Output: The API will return a JSON object with processed information. Behind the scenes, the inference router dynamically selected the optimal AI model for each task based on your defined policies. This pattern offers significant benefits: clean application code, externalized model selection, and cost savings with improved performance through intelligent model use and built-in fallbacks.

To explore how Yammbo can help you build powerful web applications that integrate with services like these, visit Yammbo Web.