Keep Your AI Knowledge Base
Fresh from YouTube — Automatically

Your RAG pipeline is only as good as the data feeding it. VidProxy delivers full YouTube transcripts to your vector store the moment a new video posts — no scraping, no polling, no manual work.

AI knowledge bases go stale the moment you stop updating them

YouTube is one of the richest sources of expert knowledge on the internet — technical tutorials, conference talks, product demos, research walkthroughs, earnings calls. But getting that content into a vector store requires someone to find the video, pull the transcript, clean it, chunk it, embed it, and upsert it. Every. Single. Time.

Most teams skip it or batch it weekly. The result is a knowledge base that's always three steps behind the conversation your users want to have.

VidProxy automates the entire pipeline from "new video posted" to "transcript in your vector store." One webhook subscription replaces an ongoing manual workflow.

What the pipeline looks like without VidProxy

Someone on your team monitors channels manually. They find new videos, download transcripts (if they can), clean the text, paste it into a script, run the embedding, and update the store. Or you write a cron job that scrapes YouTube, handles errors when it's blocked, parses the page HTML, extracts captions from a JavaScript object buried in the page source, and tries to keep up with YouTube's changing markup. This breaks constantly.

What the pipeline looks like with VidProxy

You subscribe to a channel. VidProxy polls it on your behalf — every 2 minutes on Pro. When a new video is detected, it fetches the transcript, formats it as structured JSON, and POSTs it to your webhook URL. Your webhook receiver gets the full text plus timestamped segments. It chunks the transcript, calls your embedding API, and upserts to your vector store. Done. No human involvement.

Three steps to a self-updating knowledge base

From zero to automated transcript ingestion in under ten minutes.

1

Subscribe to channels

Paste a YouTube channel URL or handle into the VidProxy dashboard. Set a webhook URL pointing to your ingestion endpoint. Label the subscription so your receiver knows which topic cluster to target.

2

Receive structured transcripts

The moment a new video is detected, VidProxy POSTs a JSON payload to your webhook. The payload includes the full plain-text transcript, per-sentence timestamped segments, and (on Pro) an AI-generated summary and topic tags you can use as metadata.

3

Embed and upsert

Your endpoint chunks the transcript, calls your embedding API (OpenAI, Cohere, or a local model), and upserts the vectors to Pinecone, Supabase pgvector, Chroma, or Weaviate. Your knowledge base is current within minutes of each publish.

Webhook receiver → OpenAI embeddings → Pinecone

A minimal Node.js handler that receives a VidProxy webhook and upserts the transcript to Pinecone.

// POST /webhooks/vidproxy
app.post('/webhooks/vidproxy', async (req, res) => {
  // Acknowledge immediately — VidProxy retries on non-2xx
  res.sendStatus(200);

  const { video, transcript, subscription } = req.body;
  if (!transcript.available) return;

  // Chunk the transcript into ~500-word windows
  const chunks = chunkText(transcript.text, 500);

  // Embed all chunks in parallel
  const embeddings = await Promise.all(
    chunks.map(chunk =>
      openai.embeddings.create({ model: 'text-embedding-3-small', input: chunk })
    )
  );

  // Upsert to Pinecone
  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${video.url}-chunk-${i}`,
      values: embeddings[i].data[0].embedding,
      metadata: {
        videoTitle: video.title,
        videoUrl: video.url,
        publishedAt: video.published,
        channel: subscription.channel_name,
        tag: subscription.label,
        chunk: chunk
      }
    }))
  );

  console.log(`Indexed ${chunks.length} chunks from "${video.title}"`);
});
# POST /webhooks/vidproxy
@app.route('/webhooks/vidproxy', methods=['POST'])
def handle_vidproxy():
    payload = request.get_json()
    # Acknowledge immediately
    response = make_response('', 200)

    transcript = payload.get('transcript', {})
    if not transcript.get('available'):
        return response

    video = payload['video']
    subscription = payload['subscription']
    chunks = chunk_text(transcript['text'], max_words=500)

    # Embed via OpenAI
    embeddings = openai.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    ).data

    # Upsert to Pinecone
    vectors = [
        {
            "id": f"{video['url']}-chunk-{i}",
            "values": emb.embedding,
            "metadata": {
                "title": video["title"],
                "url": video["url"],
                "channel": subscription["channel_name"],
                "text": chunk
            }
        }
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
    ]
    pinecone_index.upsert(vectors=vectors)

    return response

Use AI Enrichment as pre-built metadata

On Pro and Agency plans, every transcript is automatically processed by Claude to produce a summary, key takeaways, and topic tags. These land in the webhook payload under the enrichment key — no extra API calls on your side.

Store the summary and topics as vector metadata. Your retrieval layer can filter by topic tag before running the semantic search, dramatically cutting noise in multi-topic knowledge bases.

enrichment.topics → filter field in your vector store metadata

enrichment.summary → human-readable context injected into your system prompt alongside retrieved chunks

enrichment.key_takeaways → high-signal text that improves retrieval precision when stored as an additional vector

On-demand lookup: fill gaps in your knowledge base

Not every video you want to index comes from a channel you're monitoring. Use GET /api/transcript?url= to fetch the transcript for any YouTube video by URL or ID — no subscription required. One API call, instant result. This is useful for backfilling a specific video someone shared, indexing a one-off talk, or enriching your store with content from channels you don't need to watch continuously. Available on all paid plans (500–10,000 lookups/month).

Keyword Alerts: only ingest what's relevant

On a broad channel, not every video is relevant to your domain. Use Keyword Alerts (Pro) to specify a comma-separated list of topics per subscription. VidProxy will only fire your webhook if the transcript contains at least one match. Your ingestion pipeline stays focused — and your vector store stays signal-dense.

Drop into your existing AI stack

VidProxy outputs standard JSON. Everything downstream is your choice.

Vector stores

Pinecone, Supabase pgvector, Weaviate, Chroma, Qdrant, Milvus — any store that accepts an embedding vector and a metadata object.

Embedding models

OpenAI text-embedding-3-small, Cohere embed-v3, Voyage AI, or any local model via Ollama or Hugging Face. VidProxy is model-agnostic — it hands you the text.

RAG frameworks

LangChain, LlamaIndex, Haystack, or a hand-rolled retriever. VidProxy is upstream of your framework — it feeds the ingestion pipeline, not the query layer.

AI models

GPT-4o, Claude, Gemini — your vector store is model-agnostic. You get better answers at query time because your knowledge base has current, expert-level YouTube content.

Automation platforms

n8n, Make, and Zapier all support custom webhook triggers. Wire VidProxy as the trigger and your embedding step as the action — no custom code required.

Pull API for batch ingestion

Use GET /api/videos?since=7d to bootstrap a new knowledge base or backfill a gap. One call returns everything from the window you specify.

Common questions

What format does the transcript come in?
The transcript.text field is plain text, ready to chunk and embed. The transcript.segments array provides per-sentence entries with start (seconds), duration, and text — useful for building time-indexed retrieval or for storing segment-level vectors with precise video timestamps.
How do I handle videos without transcripts?
When a video has no available transcript, transcript.available is false. Your webhook receiver should check this field and skip embedding. VidProxy still delivers the payload so you can log the video's metadata for record-keeping.
Can I ingest historical videos too?
VidProxy is designed for new video detection going forward. For historical backfill, use the Pull API — you can query any channel's stored videos by time window. Alternatively, add the channel subscription now; VidProxy will start collecting from the current point and you can supplement with historical data from other sources.
How do I avoid re-embedding the same video twice?
Each video has a unique URL. Use the video URL (or the video_id from the URL) as the namespace prefix for your vector IDs, as shown in the code example above. If the same video ID is upserted again, your vector store will overwrite the existing vectors — no duplicates.

Start feeding your pipeline fresh YouTube transcripts

Free tier includes 3 channels and webhook delivery. No credit card required.

No credit card · Free tier forever