Overview
This page documents the full embedding lifecycle in MCPHub, from the moment indexing is triggered to the moment vectors are persisted in PostgreSQL. It also covers provider request execution, adaptive pacing, retry behavior, queue-based serialization, and the follow-up resynchronization logic used when vector dimensions change. The lifecycle applies to two major use cases:- Tool indexing, where MCPHub creates embeddings for MCP tools and stores them in the vector database.
- Query embedding, where MCPHub creates an embedding for a user search query in order to run semantic similarity search.
When Embedding Creation Starts
Embedding creation can begin from several entry points:- When an MCP server connects successfully and MCPHub loads its tool list.
- When an MCP server reconnects and its tools are reloaded.
- When an OpenAPI-backed server is initialized successfully.
- When a single tool is re-synced explicitly.
- When Smart Routing configuration changes and MCPHub triggers a full sync of all connected servers.
- When a full resync is scheduled because the configured embedding dimensions no longer match the database schema.
- When Smart Routing handles a user query and needs an embedding for semantic search.
End-to-End Flow
Tool Indexing Flow
Tool indexing starts in the server lifecycle layer after tool discovery succeeds. MCPHub builds a searchable text payload per tool by combining:- Tool name
- Tool description
- Top-level schema property names
- Nested input schema property names
Important Property
Every tool is processed individually, but provider calls are not executed independently. All embedding provider requests share one queue, which means parallel syncs do not bypass rate limiting.Query Embedding Flow
Search queries use the same embedding generator as tool indexing. This is important because:- Query-time requests use the same provider selection rules.
- Query-time requests use the same queue.
- Query-time requests use the same pacing and retry logic.
Provider Selection and Request Construction
MCPHub currently supports three main execution paths:1. Azure OpenAI Path
- Validates Azure endpoint, API key, API version, and deployment name.
- Truncates text using the configured underlying Azure embedding model.
- Sends a direct HTTP request with axios.
- Wraps the provider request with the shared queue and retry logic.
2. OpenAI-Compatible Path
- Loads Smart Routing provider settings.
- Normalizes whitespace before tokenization.
- Applies token truncation based on the configured embedding model.
- Applies an extra token safety factor for providers such as SiliconFlow, where local and server-side token counts can differ slightly.
- Determines whether to request embeddings in float or base64 format.
- Uses the OpenAI SDK client for the provider call.
- Wraps the provider request with the shared queue and retry logic.
3. Fallback Embedding Path
If a required provider key is missing, MCPHub generates a deterministic low-dimensional fallback vector locally. This fallback path:- Does not call any external embedding provider.
- Does not use the shared provider queue.
- Still produces a vector that can be stored in PostgreSQL.
- May trigger a dimension mismatch if the database currently expects a different vector size.
Shared Queue, Pacing, and Retries
The most important protection mechanism in the embedding pipeline is the shared queue. Every provider-backed embedding request passes through one promise chain. This guarantees that parallel sync jobs cannot send requests concurrently and accidentally bypass pacing rules.Queue and Retry Subprocess
Pacing Rules
The pacing system is adaptive and precisely aligned with provider rate-limit windows (typically 1-minute RPM):- MCPHub starts with a 0ms base delay between provider calls by default. No pre-emptive wait is introduced — the system reacts to rate-limit responses rather than guessing in advance.
- When a 403 or 429 error is detected, the pacing delay jumps immediately to 63000ms.
- That delay remains in effect for subsequent queued provider calls while rate-limit responses continue.
- After 63 seconds without new rate-limit errors, the delay immediately resets to the configured base level.
- 63s step / 63s maximum: A single 403/429 is treated as a signal that retrying earlier is not useful, so MCPHub enters full protection immediately.
- 63s cooldown: This tracks the 1-minute RPM reset window with a small safety margin before returning to normal throughput.
- Retry-After precedence: If the provider returns
Retry-After, MCPHub honors that value instead of applying the local 63-second fallback.
- Normal state:
0mspacing delay - After first 403/429:
63000mspacing delay - After 63s without new 403/429: Immediate return to base level (
0msby default)
Retry Strategy
MCPHub uses different retry strategies depending on the error type, reflecting that rate-limit errors and infrastructure errors have fundamentally different characteristics. For rate-limit errors (403/429):- If the server provides a
Retry-Afterheader, MCPHub honors it unconditionally — no attempt budget or time budget overrides the server’s instruction. - If no
Retry-Afteris present, MCPHub applies a 63-second cooldown per retry, bounded by a 5-minute total budget. This prevents indefinite retries when the provider gives no guidance.
- MCPHub uses a fixed exponential sequence of exactly 7 attempts: 4s, 8s, 16s, 30s, 60s, 120s, 240s.
- No time budget is imposed — all 7 attempts are always made regardless of elapsed time.
- Random jitter is added to each delay to prevent thundering herd effects.
- The first 403/429 is examined immediately for
Retry-After. - If missing, the system waits 63 seconds (guaranteeing the 60-second hard window has passed).
- If the next attempt is rate-limited again, MCPHub repeats the same rule:
Retry-Afterif present, otherwise another 63-second wait. - Without
Retry-After, the operation fails definitively once the 5-minute budget is exhausted.
Retry Rules
Retries apply only to specific status codes:- 403 (Forbidden / Rate Limit)
- 429 (Too Many Requests)
- 503 (Service Unavailable)
- 504 (Gateway Timeout)
1. Rate Limit Errors (403/429) - Priority to Server Information
When a 403/429 occurs, MCPHub:- On every 403/429 failure: Checks for
Retry-Afterresponse header.
- If present, honors the server’s wait time unconditionally — no time budget applies.
- If absent, applies a 63-second hardcoded cooldown as the local fallback.
- Retries the same provider request after the wait.
- Repeats the same rule for the next 403/429 failure.
- When
Retry-Afteris absent: stops when the next 63s wait would exceed the 5-minute retry budget.
2. Server Errors (503/504) - Exponential Backoff
For transient server errors:- Uses a fixed exponential sequence: 4s, 8s, 16s, 30s, 60s, 120s, 240s — exactly 7 retry attempts.
- No time budget is imposed. All 7 attempts are always made, regardless of total elapsed time (~8 minutes maximum when the full sequence is exhausted).
- Does not apply the 63-second rate-limit cooldown.
Example Retry Sequence for SiliconFlow (10 RPM, no Retry-After header)
This example reflects the observed SiliconFlow behavior described above, including the undocumented 11th-request-within-a-minute 403 response seen in practice. It should not be read as an official provider guarantee.Error Logging During Embedding Requests
When a provider request fails after retries, MCPHub emits structured diagnostic logs rather than dumping the raw error object. The logged fields include:- Error name
- Message
- Extracted status
- Provider code when available
- Response status when available
- Nested provider error message when available
- Request ID when present in response headers
Database Persistence Flow
For tool indexing, each successful embedding is persisted together with:- Entity type
- Composite tool identifier
- Searchable text
- Embedding vector
- Tool metadata, including server name, tool name, description, and input schema
- Embedding model identifier
Dimension Validation and Schema Management
Before persisting the first tool embedding of a server sync, MCPHub validates that the database vector column can store the generated vector length.Index Strategy
MCPHub chooses an index strategy based on vector size:- Up to 2000 dimensions: standard vector with HNSW, with IVFFlat as a fallback.
- 2001 to 4000 dimensions: halfvec with HNSW when supported by pgvector.
- More than 4000 dimensions: no optimized vector index is created.
Full Resynchronization Behavior
Full resync is intentionally guarded so that MCPHub does not schedule multiple identical jobs at once. The scheduler uses two flags:- One flag to indicate that a resync has already been scheduled.
- One flag to indicate that a resync is currently running.
- Enumerates all known servers.
- Selects only connected servers that currently expose tools.
- Re-runs tool indexing for each eligible server.
Practical Notes
Why the Queue Exists
Without the shared queue, two parallel sync operations could both wait for the same pacing interval and then send requests at the same time. That would defeat the purpose of adaptive pacing. The queue prevents this by ensuring only one provider-backed embedding request is in flight at a time.Why Retries and Pacing Are Separate
Pacing controls the baseline rate of new calls. Retries handle transient failures after a call has already been attempted. MCPHub keeps both mechanisms because provider rate limits and upstream instability are related but not identical problems. Rate-limit-aware retry strategy is implemented on top of pacing: when a rate limit (403/429) is detected, the retry logic checks forRetry-After headers and applies them immediately, or uses a 63-second fallback on that retry. If a later retry is rate-limited again, MCPHub applies the same rule again. This allows retries to respect the exact window timing that the server demands, while pacing ensures that the baseline request rate stays sustainable across all operations.
Why Different Retry Strategies per Error Type
Rate-limit errors (403/429 without Retry-After): time-based budget A fixed attempt count is unreliable here because the number of retries needed depends on the provider’s quota window. A 5-minute budget instead:- Enforces a hard deadline without requiring a separately tuned maximum count.
- Scales naturally: a provider with a 2-minute window fits ~2 retries; one with a strict 60-second window fits ~4.
Retry-Afterheaders bypass the budget entirely, so the server always has the final word.
- The exponential sequence (4s→8s→16s→30s→60s→120s→240s) already has a well-defined total cost (~8 minutes for all 7 attempts).
- Fixed counts are more predictable than “retry until N minutes” when the wait intervals are already known.
- A time budget could cut the sequence short unpredictably depending on accumulated jitter.
Why Query Embeddings Use the Same Queue
Search traffic and indexing traffic both consume provider quota. Putting them behind the same queue avoids a situation where search requests starve indexing, or indexing floods the provider while user queries are also running.Why Fallback Embeddings Still Matter
Fallback vectors are lower quality than provider-generated embeddings, but they keep Smart Routing operational when provider credentials are missing or unavailable. They also make it possible to keep a consistent database pipeline even when external embedding services are temporarily not in use.Summary
The embedding lifecycle in MCPHub is built around four core ideas:- Multiple triggers can start indexing or query embedding generation.
- All provider-backed embedding requests flow through one serialized queue.
- Adaptive pacing and bounded retries protect the system from provider throttling and transient failures.
- Database persistence includes dimension validation and index management so that stored vectors remain queryable and structurally consistent.