Live Product

TubeText:
AI Transcript Engine

A production tool that transcribes, translates, and summarizes video content, with AI quality evaluation built in — currently with organic paying clients.

TubeText Demo
Languages
5

Core

Python · FastAPI · Next.js 16 · PostgreSQL

AI Layer

Deepgram · OpenRouter · Cerebras

Observability

Langfuse · LLM-as-judge · Sentry

Infra

Docker · Railway · Stripe

The Problem

YouTube is one of the best places to learn — but learning from it is slow.

A 40-minute talk hides 5 minutes of substance, and the language barrier makes the cost higher: if the talk you need is in a language you don't speak, you either skip it or sit through machine-translated captions that miss the nuance.

TubeText turns any YouTube video into a structured transcript, a focused summary, and a translation in 20+ languages. Watch what matters, in the language you read fastest.

One-Minute Walkthrough

Video coming soon A one-minute screen recording covering the four flows: free transcript, premium transcript, summary, translation.

System Design

One FastAPI service. Four pipelines. Every call traced end-to-end through Langfuse.

No event queue, no worker pool — the async runtime handles current scale on its own, with asyncio.to_thread bridging the sync SDKs (yt-dlp, Deepgram) into the async stack.

Three design choices worth discussing

Residential proxy + error classification, not naive retry. YouTube blocks cloud IPs aggressively. Every request routes through Webshare's rotating residential pool, and exceptions from both youtube-transcript-api and yt-dlp are classified into five buckets: transient, no_captions, unavailable, bad_input, unknown. Only transient retries; the rest map to user-facing messages ("try Premium", "video is private") instead of generic failure.

Async orchestration with bounded retry budgets. The free path retries 3 times with (0.5s, 1s, 2s) backoff because each attempt is cheap. The premium path retries only 2 times with 3s backoff because each attempt is 30–90s of yt-dlp + Deepgram work — same pattern, different budget.

Three-tier freemium without DB pollution. Anonymous users (5 videos/month) are tracked by a signed HTTP-only cookie. Zero database writes until they create an account — saves storage and avoids "what do we do with abandoned anonymous data?" questions cleanly.

TubeText system architecture diagram showing four pipelines (free transcript, premium audio, AI summary, AI translation), the OpenRouter and Cerebras provider stack, and the Sentry and Langfuse observability layers.

Stack & Why

Four groups, each choice with a short note on the reasoning behind it.

Backend

Frontend

AI Layer

Infrastructure & Auth

Evaluation

Three layers of observability run in production. This is the part most AI portfolio projects skip — and the part interviewers actually want to ask about.

1. Langfuse traces — every call, every cost

A two-level trace pattern wraps every user action:

Why two levels: LLM judges need plain text I/O at the top, not nested chunk-level logs. The outer span is the user's action; the inner generation has the cost details. Costs roll up from inner to outer automatically — margin per request is visible in Langfuse, not a spreadsheet.

2. LLM-as-judge — adequacy on translations

Configured in the Langfuse UI, an adequacy judge runs on every translation trace. The judge is openai/gpt-4o-mini via OpenRouter, and it walks an idea-by-idea checklist over source vs. translation — each idea labelled PRESERVED, MISSING, DISTORTED, or ADDED — producing a 1–5 score plus written reasoning.

What this catches that BLEU and ROUGE don't: hallucinated additions, dropped sentences, register shifts. A summary-faithfulness judge is wired on the same pattern, pending UI activation.

3. Human feedback — sampled, idempotent, attached to traces

The frontend shows a thumbs prompt on 10% of outputs (configurable, to avoid feedback fatigue). Submissions write a Langfuse score with an idempotent ID — {trace_id}:{name} — so resubmits upsert instead of duplicate. Thumbs-down comments (up to 500 chars) flow into the trace as searchable feedback, and a per-IP rate limit (30 / 5 min) blocks spam without forcing an auth wall.

Sessions — per-video cohort analysis

On top of the three layers above, sessions group everything by source video. The frontend generates a crypto.randomUUID() per YouTube URL; the backend attaches it via an X-Session-Id header. All three premium outputs (transcript / summary / translation) for the same video roll up into one Langfuse Session — letting me answer "did this video produce a bad transcript and a bad summary, or just one of them?" instead of guessing.

Results

Roadmap

TubeText