ClipClap - Transcribe your media and clip it using the words
Live Demo: travis-seng.fr/clipclap
A tiny proof of concept: drop a video, get a transcript generated locally with Whisper (via transformers.js), click words to define a sub‑segment, export the trimmed clip. No server. Everything happens in your browser.
Core Flow
- Load a video (drag & drop / file picker)
- Run model → Whisper transcribes locally (word timestamps)
- Click words to mark start and end (UI turns the earliest + latest selection into boundaries)
- Preview the subclip
- Export: download trimmed video + transcript
Features
- In‑browser Whisper (transformers.js) — on‑device inference, privacy friendly
- Word level timestamps rendered as selectable chips
- Instant boundary selection: first and last clicked word define trim range
- Live preview of the clipped segment before export
- Transcript download (raw text)
- Video export of just the selected span
- Generation time indicator for performance feedback
- Language selector (multi‑language capable; defaults to English)
Tech Stack
- transformers.js + Whisper (small / tiny model for faster load)
- WebAssembly backends for on‑device inference
- HTML5 Video + Canvas time mapping for precise trimming
- Client side media slicing (no upload) using MediaSource / offscreen processing
- Vanilla TS/JS UI (lightweight, experimental)
Why I Built It
Wanted to explore:
- Learning about using transformers.js
- Running speech‑to‑text fully client side (no API keys / latency)
- Mapping word timestamps to frame‑accurate trim points
- Minimal UX for creating short quote clips out of a longer source
- Performance tradeoffs of Whisper in the browser vs native / server
Notes / Limitations
- First load requires model download (cache persists subsequent runs)
- Uses smaller Whisper model for speed — accuracy is acceptable, not perfect
- Long videos: memory + processing time scale with duration
- Simple trimming (one contiguous segment) — not an editor
Future Ideas
- Multi‑segment selection → concatenated highlight reel
- SRT / WebVTT export
- Per‑word confidence shading
- Option to choose larger model when bandwidth / patience allows
(POC stage. Feedback welcome.)