YouTube Transcripts for AI & RAG Pipelines

The problem

AI knowledge bases go stale the moment you stop updating them

YouTube is one of the richest sources of expert knowledge on the internet — technical tutorials, conference talks, product demos, research walkthroughs, earnings calls. But getting that content into a vector store requires someone to find the video, pull the transcript, clean it, chunk it, embed it, and upsert it. Every. Single. Time.

Most teams skip it or batch it weekly. The result is a knowledge base that's always three steps behind the conversation your users want to have.

VidProxy automates the entire pipeline from "new video posted" to "transcript in your vector store." One webhook subscription replaces an ongoing manual workflow.

What the pipeline looks like without VidProxy

Someone on your team monitors channels manually. They find new videos, download transcripts (if they can), clean the text, paste it into a script, run the embedding, and update the store. Or you write a cron job that scrapes YouTube, handles errors when it's blocked, parses the page HTML, extracts captions from a JavaScript object buried in the page source, and tries to keep up with YouTube's changing markup. This breaks constantly.

What the pipeline looks like with VidProxy

You subscribe to a channel. VidProxy polls it on your behalf — every 2 minutes on Pro. When a new video is detected, it fetches the transcript, formats it as structured JSON, and POSTs it to your webhook URL. Your webhook receiver gets the full text plus timestamped segments. It chunks the transcript, calls your embedding API, and upserts to your vector store. Done. No human involvement.

Code example

Webhook receiver → OpenAI embeddings → Pinecone

A minimal Node.js handler that receives a VidProxy webhook and upserts the transcript to Pinecone.

Node.js

Python

// POST /webhooks/vidproxy
app.post('/webhooks/vidproxy', async (req, res) => {
  // Acknowledge immediately — VidProxy retries on non-2xx
  res.sendStatus(200);

  const { video, transcript, subscription } = req.body;
  if (!transcript.available) return;

  // Chunk the transcript into ~500-word windows
  const chunks = chunkText(transcript.text, 500);

  // Embed all chunks in parallel
  const embeddings = await Promise.all(
    chunks.map(chunk =>
      openai.embeddings.create({ model: 'text-embedding-3-small', input: chunk })
    )
  );

  // Upsert to Pinecone
  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${video.url}-chunk-${i}`,
      values: embeddings[i].data[0].embedding,
      metadata: {
        videoTitle: video.title,
        videoUrl: video.url,
        publishedAt: video.published,
        channel: subscription.channel_name,
        tag: subscription.label,
        chunk: chunk
      }
    }))
  );

  console.log(`Indexed ${chunks.length} chunks from "${video.title}"`);
});

# POST /webhooks/vidproxy
@app.route('/webhooks/vidproxy', methods=['POST'])
def handle_vidproxy():
    payload = request.get_json()
    # Acknowledge immediately
    response = make_response('', 200)

    transcript = payload.get('transcript', {})
    if not transcript.get('available'):
        return response

    video = payload['video']
    subscription = payload['subscription']
    chunks = chunk_text(transcript['text'], max_words=500)

    # Embed via OpenAI
    embeddings = openai.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    ).data

    # Upsert to Pinecone
    vectors = [
        {
            "id": f"{video['url']}-chunk-{i}",
            "values": emb.embedding,
            "metadata": {
                "title": video["title"],
                "url": video["url"],
                "channel": subscription["channel_name"],
                "text": chunk
            }
        }
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
    ]
    pinecone_index.upsert(vectors=vectors)

    return response

Works with

Drop into your existing AI stack

VidProxy outputs standard JSON. Everything downstream is your choice.

Vector stores

Pinecone, Supabase pgvector, Weaviate, Chroma, Qdrant, Milvus — any store that accepts an embedding vector and a metadata object.

Embedding models

OpenAI text-embedding-3-small, Cohere embed-v3, Voyage AI, or any local model via Ollama or Hugging Face. VidProxy is model-agnostic — it hands you the text.

RAG frameworks

LangChain, LlamaIndex, Haystack, or a hand-rolled retriever. VidProxy is upstream of your framework — it feeds the ingestion pipeline, not the query layer.

AI models

GPT-4o, Claude, Gemini — your vector store is model-agnostic. You get better answers at query time because your knowledge base has current, expert-level YouTube content.

Automation platforms

n8n, Make, and Zapier all support custom webhook triggers. Wire VidProxy as the trigger and your embedding step as the action — no custom code required.

Pull API for batch ingestion

Use GET /api/videos?since=7d to bootstrap a new knowledge base or backfill a gap. One call returns everything from the window you specify.

Keep Your AI Knowledge Base
Fresh from YouTube — Automatically

AI knowledge bases go stale the moment you stop updating them

What the pipeline looks like without VidProxy

What the pipeline looks like with VidProxy

Three steps to a self-updating knowledge base

Subscribe to channels

Receive structured transcripts

Embed and upsert

Webhook receiver → OpenAI embeddings → Pinecone

Use AI Enrichment as pre-built metadata

On-demand lookup: fill gaps in your knowledge base

Keyword Alerts: only ingest what's relevant

Drop into your existing AI stack

Vector stores

Embedding models

RAG frameworks

AI models

Automation platforms

Pull API for batch ingestion

Common questions

Start feeding your pipeline fresh YouTube transcripts

Keep Your AI Knowledge BaseFresh from YouTube — Automatically

AI knowledge bases go stale the moment you stop updating them

What the pipeline looks like without VidProxy

What the pipeline looks like with VidProxy

Three steps to a self-updating knowledge base

Subscribe to channels

Receive structured transcripts

Embed and upsert

Webhook receiver → OpenAI embeddings → Pinecone

Use AI Enrichment as pre-built metadata

On-demand lookup: fill gaps in your knowledge base

Keyword Alerts: only ingest what's relevant

Drop into your existing AI stack

Vector stores

Embedding models

RAG frameworks

AI models

Automation platforms

Pull API for batch ingestion

Common questions

Start feeding your pipeline fresh YouTube transcripts

Keep Your AI Knowledge Base
Fresh from YouTube — Automatically