A production tool that runs 9 LLMs across 5 providers in parallel to evaluate a stock, then makes them debate when they disagree — every answer grounded in SEC filings.
Understanding whether a company is actually financially healthy is gated by time and expertise.
A single 10-Q filing is dozens of pages of dense numbers and legalese; reading enough of them to form an opinion takes hours and assumes you already know what to look for. The easy shortcut — asking one LLM — trades one slow process for one biased opinion. Models hallucinate numbers, cherry-pick data points, and bias their own conclusions. You can't tell which it's doing.
Agora Financials makes nine LLMs across five providers analyze the same filings in parallel, then forces them to debate the metrics they disagree on. Consensus from disagreement instead of a single opinion you can't verify.
One FastAPI service, four phases: ingest → analyze → harmonize → debate. Each phase is its own endpoint and runs to completion before the next starts — the frontend orchestrates the chain, holds intermediate state, and renders progress per phase so the user sees the pipeline live.
Nine models run in parallel via asyncio.gather, all routed through OpenRouter for unified billing, with direct-provider fallback when OpenRouter degrades. pgvector holds the SEC filing embeddings in the same Postgres that holds app data — one database, not two.
OpenRouter as the unified gateway, with direct-provider fallback. Five providers (xAI, OpenAI, Anthropic, Google, Mistral) behind one API key, one billing dashboard, one cost-reporting format. Each model call routes through OpenRouter first; if it fails or stalls, the same call retries directly against the provider's native endpoint. The cost is one extra network hop on the happy path; the gain is being able to add or swap a model via env var instead of an integration.
Two-tier model split — fast vs. deep, not cheap vs. expensive. The fast tier (four models, 3–8s, ~$0.05 per metric) gets the pre-formatted financial tables only. The deep tier (five models, 21–52s, ~$0.12–0.25 per metric) gets the same tables plus a retrieval tool over pgvector for qualitative context from MD&A and segment notes. The discovery worth stating: the quality jump is fast→deep, not deep→premium. Cheap models often skip the retrieval tool entirely no matter how the prompt asks them not to, so paying for "premium fast" loses to plain "deep."
Multi-round debate with sliding-window history compression. When models disagree on a metric, the debate orchestrator runs N rounds in parallel: each model defends, reviews others' arguments, and can ACCEPT / CHANGE / REJECT. Position changes are tracked structurally ("Claude: Good → Neutral"). The non-obvious part: debate history grows fast. A naive implementation pastes every prior response into every next prompt and token costs balloon — R5 was ~5× R2 in early tests. The fix is a sliding window: keep the last two rounds verbatim, summarize everything older into a compact table via one cheap mistral_fast call. R5 flattens to roughly R2 cost.
Three-layer caching, because every step in the pipeline costs money. Each expensive operation has a cache key in front of it.
(ticker, year, quarter) — re-ingesting a filing that's already in the chunks table skips the SEC download, the boilerplate filtering, the chunking, and the embedding API call.
(ticker, latest_filing_date) in a separate financial_statements table — the analysis route reads from here only, never re-hits SEC at request time.
(ticker, llm_model, latest_filing_date) in llm_financial_analysis — the full structured response stored as JSONB. At request time, the code splits the requested models into "already cached" and "needs to run" and only fires LLM calls for the latter. The same ticker re-evaluated against the same filing costs nothing in API spend.
Four groups, each choice with a short note on the reasoning behind it.
create_agent + LangGraph for tool-calling orchestration — each model becomes an agent with optional retrieval-tool access (deep tier only).
LangGraph's tool-call budget is set at invoke time via {"recursion_limit": 12} on .ainvoke(), not when the agent is created. Each tool call costs ~2 steps, so 12 steps caps deep models at ~5 retrievals per metric. Without this, some models skip the retrieval tool entirely — the prompt alone doesn't tell them they're budget-constrained.
RecursiveCharacterTextSplitter at 2500 chars / 300 overlap, respecting markdown headers. A 53-pattern boilerplate blocklist strips SEC legalese, exhibit references, signatures, form headers, and forward-looking-statement disclaimers before chunking; anything under 100 characters is dropped. Embedded with OpenAI's text-embedding-3-small (1536 dims) via OpenRouter — same gateway as the LLM calls, so embedding spend rolls into the same cost surface. Retrieval at query time is cosine similarity with the <-> operator, top-8 chunks filtered by ticker.
mistral_fast call. Compression overhead is <0.5%; savings at five rounds are ~30%.
Gemini 2.5 Flash was pulled from the fast tier after a single evaluation produced 308 API calls and 794K input tokens because the model wouldn't stop calling the retrieval tool — $3.25 on one analysis versus ~$0.08 for the other fast models. Lesson: cheap stops being cheap when instruction-following breaks down.
agora_session cookie. Zero DB rows until they sign up.
checkout.session.completed sets the tier, invoice.paid refills monthly, customer.subscription.deleted resets to free. Token costs are 2 (fast model) or 10 (deep model) per analysis.
Observability is wired today on the heaviest route — the multi-LLM analysis — and the deeper evaluation layers (LLM-as-judge for structured output, tool use, debate consistency) are the next thing on the roadmap. The plan is to build them on the same Langfuse foundation that's already in place.
Every analysis request opens a Langfuse trace named financial-analysis. LangChain's CallbackHandler auto-captures each model call as a child span, with token counts and cost lifted from OpenRouter's response (the extra_body={"usage": {"include": True}} flag is on, so per-call cost is visible per provider). The trace_id is returned in the API response, ready for user-feedback scores to be attached later.
What that buys today: latency, token usage, cost, and errors per request, per model, queryable in one place.
Live Langfuse view — one trace per analysis, four recent requests across tickers ZETA, NOW, NBIS, PLTR.
Each one builds on traces Langfuse is already collecting, so the work is judge prompts and Langfuse UI configuration, not new instrumentation. The three planned judges:
asyncio.gather, all behind OpenRouter with direct-provider fallback.year, quarter, and filing_date per chunk, but retrieval today only filters by ticker. Adding them to the WHERE clause unlocks "the latest quarter only" or "the last fiscal year" as scoping options for the deep-tier prompts — one query change away.