Internet Archive Scraper

Search archive.org by keyword and export clean items (title, creator, year, downloads, item URL). Filter by media type, sort by popularity or date.

Run this in the cloudRun on Apify →

Developer & Research Tools

How it works

1
Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
2
Set the inputs
Adjust query, mediaType, sort (sensible defaults are pre-filled).
3
Click Run
The tool runs on Apify’s cloud and collects the data for you.
4
Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.

Inputs

Field	What it does	Type
`query`	Keywords to search the Internet Archive for (e.g. "nasa apollo", "jazz"). Supports Lucene operators used by archive.org, e.g. "title:(grateful dead) AND year:[1	string
`mediaType`	Restrict results to one media type, or leave empty for any. texts = books/documents, audio = music/recordings, movies = video/film, software, image, web (archiv	string
`sort`	Order of results. downloads = most-downloaded first, date = newest item date first, publicdate = most recently added to archive.org first, relevance = the archi	string
`maxItems`	Maximum number of unique items to return. The actor paginates 100 per request until this many items are collected or the result set is exhausted.	integer
`notionConnector`	Optional. Write each item as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors,	string
`notionParentId`	Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your wo	string

What you get

A structured dataset — each result includes fields like:

creatordatedescriptiondownloadsidentifiermediaTypepublicdatesubjectstitleurlyear

Export every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.

2 ready-to-run use cases

Archive.org Book Search by Keyword to JSON

Free public-domain books from archive.org's text collection by keyword, with author, publication year and item link for every title. Ideal for researchers.

Newest Archive.org Uploads for Any Search Term

Track recently added archive.org items for any topic, sorted newest first by upload date, each with its title, date and direct link. Great for monitoring.

Internet Archive Scraper

Search the Internet Archive (archive.org) by keyword and get back clean, structured items — title, creator, year, downloads, subjects, description and the item URL. No API key, no login.

Built on the public advancedsearch.php JSON API. Filter by media type (texts, audio, movies, software, image, …), sort by downloads, date, or relevance, and paginate transparently up to your item limit.

What you get per item

identifier, title, creator, year, date, mediaType, downloads, subjects (array), description (first ~500 chars), publicdate, and url (https://archive.org/details/{identifier}).

Fields that can be null

title, creator, year, date, description, publicdate — null when archive.org's metadata doesn't include that field for an item.
subjects — empty array when the item has no subject tags.
downloads — 0 when not reported.

Input

Field	Notes
`query`	Required. Keywords, e.g. `nasa apollo`, `jazz`. Supports archive.org Lucene operators, e.g. `title:(grateful dead) AND year:[1977 TO 1980]`.
`mediaType`	Restrict to one type: `texts`, `audio`, `movies`, `software`, `image`, `web`, `data`, `collection`. Empty = any.
`sort`	`downloads` (default), `date`, `publicdate`, or `relevance`.
`maxItems`	Max unique items to return (default 100). Paginates 100 per request until reached or exhausted.

Output

One dataset row per item. Pricing is pay-per-result: you are only charged for genuine item rows (ok: true). Diagnostic rows are never charged — this includes:

empty/invalid input (errorCode: "BAD_INPUT" — empty query or an unknown mediaType),
no results for the query (NO_RESULTS),
rate limits or network errors (RATE_LIMITED / NETWORK / SERVER_ERROR).

Results are de-duplicated by identifier.

Proxy

The archive.org advancedsearch API is a public, no-auth JSON endpoint with no anti-bot, so no proxy is required and the default runs without one (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.

Troubleshooting

Getting a BAD_INPUT row? Provide a non-empty query, and if you set mediaType make sure it's one of the allowed values.
NO_RESULTS? The query matched nothing on archive.org — broaden the keywords or remove the media-type filter.
Want fewer/more results? Adjust maxItems. The archive can return very large result sets for broad queries.

Example

{ "query": "jazz", "mediaType": "audio", "sort": "downloads", "maxItems": 50 }

Notes

The actor calls advancedsearch.php with output=json, requesting identifier, title, creator, year, date, mediatype, downloads, description, subject, and publicdate, then maps each doc to a clean row. Pagination uses page with 100 rows per request until your maxItems is reached or the numFound total is exhausted.