Live Product

Agora Financials:
Multi-LLM Stock Evaluator

A production tool that runs 9 LLMs across 5 providers in parallel to evaluate a stock, then makes them debate when they disagree — every answer grounded in SEC filings.

Agora Financials Demo
LLMs
9

Core

Python · FastAPI · Next.js 16 · PostgreSQL + pgvector

AI Layer

9 LLMs · 5 Providers · OpenRouter · LangChain

Observability

Langfuse (analysis route)

Infra

Docker · Railway · Stripe

The Problem

Understanding whether a company is actually financially healthy is gated by time and expertise.

A single 10-Q filing is dozens of pages of dense numbers and legalese; reading enough of them to form an opinion takes hours and assumes you already know what to look for. The easy shortcut — asking one LLM — trades one slow process for one biased opinion. Models hallucinate numbers, cherry-pick data points, and bias their own conclusions. You can't tell which it's doing.

Agora Financials makes nine LLMs across five providers analyze the same filings in parallel, then forces them to debate the metrics they disagree on. Consensus from disagreement instead of a single opinion you can't verify.

One-Minute Walkthrough

System Design

One FastAPI service, four phases: ingest → analyze → harmonize → debate. Each phase is its own endpoint and runs to completion before the next starts — the frontend orchestrates the chain, holds intermediate state, and renders progress per phase so the user sees the pipeline live.

Nine models run in parallel via asyncio.gather, all routed through OpenRouter for unified billing, with direct-provider fallback when OpenRouter degrades. pgvector holds the SEC filing embeddings in the same Postgres that holds app data — one database, not two.

Four design choices worth discussing

OpenRouter as the unified gateway, with direct-provider fallback. Five providers (xAI, OpenAI, Anthropic, Google, Mistral) behind one API key, one billing dashboard, one cost-reporting format. Each model call routes through OpenRouter first; if it fails or stalls, the same call retries directly against the provider's native endpoint. The cost is one extra network hop on the happy path; the gain is being able to add or swap a model via env var instead of an integration.

Two-tier model split — fast vs. deep, not cheap vs. expensive. The fast tier (four models, 3–8s, ~$0.05 per metric) gets the pre-formatted financial tables only. The deep tier (five models, 21–52s, ~$0.12–0.25 per metric) gets the same tables plus a retrieval tool over pgvector for qualitative context from MD&A and segment notes. The discovery worth stating: the quality jump is fast→deep, not deep→premium. Cheap models often skip the retrieval tool entirely no matter how the prompt asks them not to, so paying for "premium fast" loses to plain "deep."

Multi-round debate with sliding-window history compression. When models disagree on a metric, the debate orchestrator runs N rounds in parallel: each model defends, reviews others' arguments, and can ACCEPT / CHANGE / REJECT. Position changes are tracked structurally ("Claude: Good → Neutral"). The non-obvious part: debate history grows fast. A naive implementation pastes every prior response into every next prompt and token costs balloon — R5 was ~5× R2 in early tests. The fix is a sliding window: keep the last two rounds verbatim, summarize everything older into a compact table via one cheap mistral_fast call. R5 flattens to roughly R2 cost.

Three-layer caching, because every step in the pipeline costs money. Each expensive operation has a cache key in front of it.

Agora Financials system architecture diagram. Four columns: INGEST (SEC EDGAR via edgartools, chunk and embed, pgvector, yfinance fallback for non-US tickers), ANALYZE (9 LLMs across 5 providers via OpenRouter, fast tier of 4 models doing numbers-only in 3-8 seconds at ~$0.05, deep tier of 5 models with RAG over pgvector in 21-52 seconds at ~$0.12), HARMONIZE (aligned by majority wins, or flagged for debate), and DEBATE (parallel rounds with ACCEPT/CHANGE/REJECT, sliding-window compression, consensus or COMPLEX, final PDF report). A Langfuse observability lane at the bottom traces latency, tokens, cost, and errors — wired today on the ANALYZE step only.

Stack & Why

Four groups, each choice with a short note on the reasoning behind it.

Backend

Frontend

AI Layer

Infrastructure & Auth

Evaluation

Observability is wired today on the heaviest route — the multi-LLM analysis — and the deeper evaluation layers (LLM-as-judge for structured output, tool use, debate consistency) are the next thing on the roadmap. The plan is to build them on the same Langfuse foundation that's already in place.

What's wired today — Langfuse on the analysis route

Every analysis request opens a Langfuse trace named financial-analysis. LangChain's CallbackHandler auto-captures each model call as a child span, with token counts and cost lifted from OpenRouter's response (the extra_body={"usage": {"include": True}} flag is on, so per-call cost is visible per provider). The trace_id is returned in the API response, ready for user-feedback scores to be attached later.

What that buys today: latency, token usage, cost, and errors per request, per model, queryable in one place.

Langfuse traces for Agora Financials showing the financial-analysis trace name, per-request latency (28-48s), token counts (21K-83K combined), and total cost ($0.001-$0.011 per analysis).

Live Langfuse view — one trace per analysis, four recent requests across tickers ZETA, NOW, NBIS, PLTR.

Where it's going — three LLM-as-judge layers, on the same trace data

Each one builds on traces Langfuse is already collecting, so the work is judge prompts and Langfuse UI configuration, not new instrumentation. The three planned judges:

Results

Roadmap

Agora Financials