Video & Audio Transcriber — Word-Level + SRT/VTT

Transcribe any video or audio URL into text with word-level timestamps and ready SRT, VTT, and TXT files. Auto language detection, batch mode.

Run this in the cloudRun on Apify →

YouTube & Creator Tools

How it works

1
Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
2
Set the inputs
Adjust mediaUrl, mediaUrls, language (sensible defaults are pre-filled).
3
Click Run
The tool runs on Apify’s cloud and collects the data for you.
4
Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.

Inputs

Field	What it does	Type
`mediaUrl`	Public URL to a video or audio file (mp4, mov, mp3, wav, m4a, webm). Use this for a single file, or mediaUrls for a batch.	string
`mediaUrls`	Transcribe several files in one run — one dataset row per URL. Each item is a public video/audio URL.	array
`language`	Spoken language ISO code, or 'auto' to detect.	string
`wordTimestamps`	Return per-word start/end times (great for karaoke captions).	boolean
`outputFormats`	Which subtitle/text files to also produce: srt, vtt, txt.	array
`openaiApiKey`	Your OpenAI (Whisper) key. Kept private.	string
`model`	Transcription model. Default whisper-1.	string
`baseUrl`	OpenAI-compatible base URL. Default https://api.openai.com/v1.	string

What you get

A structured dataset — each result includes fields like:

_demo_noticedurationSecondslanguagesegmentCountsegmentssourceUrlsrtKeytextvttKeywordCount

Export every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.

3 ready-to-run use cases

MP4 to SRT: Generate Timed Subtitles From a Video URL

Turn an MP4 video URL into a timed SRT subtitle file with auto-detected language, ready to upload as captions to YouTube or Vimeo.

Podcast MP3 to Text: Transcribe Episodes to Transcript

Podcasters get a clean text transcript from any MP3 episode URL, ready to paste into show notes, a blog post, or searchable archives.

Word-Level Timestamps for Karaoke & TikTok Captions

Need word-by-word captions that pop in sync? This returns per-word start and end times for animated TikTok and Reels karaoke subtitles.

Video & Audio Transcriber

Give it a public video or audio URL and it returns accurate text with segment and word-level timestamps, plus ready-to-use SRT, VTT, and TXT files. It detects the spoken language automatically. Built for people who need captions, searchable transcripts, or source text to repurpose into clips, articles, or show notes.

How it works

The actor downloads your media, extracts the audio track with ffmpeg, and sends it to OpenAI's Whisper on your own API key. The timestamps and subtitle files come straight from the model's segment and word data, so timing lines up with the actual speech.

Input

Field	Required	Notes
`mediaUrl`	yes	Public URL to a video or audio file (mp4, mov, mp3, wav, m4a, webm, and similar).
`language`	no	ISO code of the spoken language, or `auto` to detect it. Defaults to `auto`.
`wordTimestamps`	no	Return per-word start/end times. Useful for karaoke-style captions. On by default.
`outputFormats`	no	Which files to generate: any of `srt`, `vtt`, `txt`. Defaults to `srt` and `vtt`.
`openaiApiKey`	yes	Your OpenAI (Whisper) key. Kept private and used only for this run.

There are two advanced fields if you need them: model (defaults to whisper-1) and baseUrl for an OpenAI-compatible endpoint.

Output

One dataset record per run. It includes the detected language, the full text, segments with start/end times, and words when word timestamps are enabled, along with wordCount, segmentCount, and durationSeconds. Each requested subtitle file is saved to the key-value store and referenced by srtKey/srtUrl, vttKey/vttUrl, and txtKey/txtUrl.

Example

{
  "mediaUrl": "https://example.com/podcast.mp3",
  "language": "auto",
  "wordTimestamps": true,
  "outputFormats": ["srt", "vtt", "txt"],
  "openaiApiKey": "sk-..."
}

Pricing

$0.04 per minute of audio, pay per result, no subscription. You bring your own OpenAI key, so Whisper usage is billed by OpenAI separately.

Notes

The mediaUrl has to be directly downloadable. Pages that require login or stream behind a player won't work, so point it at the raw file. Long files take longer and cost more since billing is per minute of audio.