Internet Archive Scraper
Search archive.org by keyword and export clean items (title, creator, year, downloads, item URL). Filter by media type, sort by popularity or date.
How it works
- 1Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
- 2Set the inputs
Adjust
query,mediaType,sort(sensible defaults are pre-filled). - 3Click Run
The tool runs on Apify’s cloud and collects the data for you.
- 4Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.
Inputs
| Field | What it does | Type |
|---|---|---|
query | Keywords to search the Internet Archive for (e.g. "nasa apollo", "jazz"). Supports Lucene operators used by archive.org, e.g. "title:(grateful dead) AND year:[1 | string |
mediaType | Restrict results to one media type, or leave empty for any. texts = books/documents, audio = music/recordings, movies = video/film, software, image, web (archiv | string |
sort | Order of results. downloads = most-downloaded first, date = newest item date first, publicdate = most recently added to archive.org first, relevance = the archi | string |
maxItems | Maximum number of unique items to return. The actor paginates 100 per request until this many items are collected or the result set is exhausted. | integer |
notionConnector | Optional. Write each item as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors, | string |
notionParentId | Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your wo | string |
What you get
A structured dataset — each result includes fields like:
creatordatedescriptiondownloadsidentifiermediaTypepublicdatesubjectstitleurlyearExport every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.
2 ready-to-run use cases
Archive.org Book Search by Keyword to JSON
Free public-domain books from archive.org's text collection by keyword, with author, publication year and item link for every title. Ideal for researchers.
Newest Archive.org Uploads for Any Search Term
Track recently added archive.org items for any topic, sorted newest first by upload date, each with its title, date and direct link. Great for monitoring.
Internet Archive Scraper
Search the Internet Archive (archive.org) by keyword and get back clean, structured items — title, creator, year, downloads, subjects, description and the item URL. No API key, no login.
Built on the public advancedsearch.php JSON API. Filter by media type (texts, audio, movies, software, image, …), sort by downloads, date, or relevance, and paginate transparently up to your item limit.
What you get per item
identifier, title, creator, year, date, mediaType, downloads, subjects (array), description (first ~500 chars), publicdate, and url (https://archive.org/details/{identifier}).
Fields that can be null
title,creator,year,date,description,publicdate— null when archive.org's metadata doesn't include that field for an item.subjects— empty array when the item has no subject tags.downloads—0when not reported.
Input
| Field | Notes |
|---|---|
query | Required. Keywords, e.g. nasa apollo, jazz. Supports archive.org Lucene operators, e.g. title:(grateful dead) AND year:[1977 TO 1980]. |
mediaType | Restrict to one type: texts, audio, movies, software, image, web, data, collection. Empty = any. |
sort | downloads (default), date, publicdate, or relevance. |
maxItems | Max unique items to return (default 100). Paginates 100 per request until reached or exhausted. |
Output
One dataset row per item. Pricing is pay-per-result: you are only charged for genuine item rows (ok: true). Diagnostic rows are never charged — this includes:
- empty/invalid input (
errorCode: "BAD_INPUT"— empty query or an unknownmediaType), - no results for the query (
NO_RESULTS), - rate limits or network errors (
RATE_LIMITED/NETWORK/SERVER_ERROR).
Results are de-duplicated by identifier.
Proxy
The archive.org advancedsearch API is a public, no-auth JSON endpoint with no anti-bot, so no proxy is required and the default runs without one (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.
Troubleshooting
- Getting a
BAD_INPUTrow? Provide a non-emptyquery, and if you setmediaTypemake sure it's one of the allowed values. NO_RESULTS? The query matched nothing on archive.org — broaden the keywords or remove the media-type filter.- Want fewer/more results? Adjust
maxItems. The archive can return very large result sets for broad queries.
Example
{ "query": "jazz", "mediaType": "audio", "sort": "downloads", "maxItems": 50 }
Notes
The actor calls advancedsearch.php with output=json, requesting identifier, title, creator, year, date, mediatype, downloads, description, subject, and publicdate, then maps each doc to a clean row. Pagination uses page with 100 rows per request until your maxItems is reached or the numFound total is exhausted.