No. It is a ready-to-run cloud tool — set the inputs and click run. Built on Video & Audio Transcriber — Word-Level + SRT/VTT.

Word-Level Timestamps for Karaoke & TikTok Captions

Need word-by-word captions that pop in sync? This returns per-word start and end times for animated TikTok and Reels karaoke subtitles.

Run this use case nowRun on Apify →

How it works

1
Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
2
Set the inputs
Adjust mediaUrl, mediaUrls, language (sensible defaults are pre-filled).
3
Click Run
The tool runs on Apify’s cloud and collects the data for you.
4
Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.

Field	What it does	Type
`mediaUrl`	Public URL to a video or audio file (mp4, mov, mp3, wav, m4a, webm). Use this for a single file, or mediaUrls for a batch.	string
`mediaUrls`	Transcribe several files in one run — one dataset row per URL. Each item is a public video/audio URL.	array
`language`	Spoken language ISO code, or 'auto' to detect.	string
`wordTimestamps`	Return per-word start/end times (great for karaoke captions).	boolean
`outputFormats`	Which subtitle/text files to also produce: srt, vtt, txt.	array
`openaiApiKey`	Your OpenAI (Whisper) key. Kept private.	string
`model`	Transcription model. Default whisper-1.	string
`baseUrl`	OpenAI-compatible base URL. Default https://api.openai.com/v1.	string

A structured dataset — each result includes fields like:

_demo_noticedurationSecondslanguagesegmentCountsegmentssourceUrlsrtKeytextvttKeywordCount

Export every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.