ClipClap - Transcribe your media and clip it using the words

Live Demo: travis-seng.fr/clipclap

A tiny proof of concept: drop a video, get a transcript generated locally with Whisper (via transformers.js), click words to define a sub‑segment, export the trimmed clip. No server. Everything happens in your browser.

Core Flow

  1. Load a video (drag & drop / file picker)
  2. Run model → Whisper transcribes locally (word timestamps)
  3. Click words to mark start and end (UI turns the earliest + latest selection into boundaries)
  4. Preview the subclip
  5. Export: download trimmed video + transcript

Features

  • In‑browser Whisper (transformers.js) — on‑device inference, privacy friendly
  • Word level timestamps rendered as selectable chips
  • Instant boundary selection: first and last clicked word define trim range
  • Live preview of the clipped segment before export
  • Transcript download (raw text)
  • Video export of just the selected span
  • Generation time indicator for performance feedback
  • Language selector (multi‑language capable; defaults to English)

Tech Stack

  • transformers.js + Whisper (small / tiny model for faster load)
  • WebAssembly backends for on‑device inference
  • HTML5 Video + Canvas time mapping for precise trimming
  • Client side media slicing (no upload) using MediaSource / offscreen processing
  • Vanilla TS/JS UI (lightweight, experimental)

Why I Built It

Wanted to explore:

  • Learning about using transformers.js
  • Running speech‑to‑text fully client side (no API keys / latency)
  • Mapping word timestamps to frame‑accurate trim points
  • Minimal UX for creating short quote clips out of a longer source
  • Performance tradeoffs of Whisper in the browser vs native / server

Notes / Limitations

  • First load requires model download (cache persists subsequent runs)
  • Uses smaller Whisper model for speed — accuracy is acceptable, not perfect
  • Long videos: memory + processing time scale with duration
  • Simple trimming (one contiguous segment) — not an editor

Future Ideas

  • Multi‑segment selection → concatenated highlight reel
  • SRT / WebVTT export
  • Per‑word confidence shading
  • Option to choose larger model when bandwidth / patience allows

(POC stage. Feedback welcome.)