Crossref Scholarly Works Scraper
Search 150M+ scholarly works on Crossref and export DOI, authors, journal, citation count, and abstracts as JSON or CSV. Filter by type and date.
How it works
- 1Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
- 2Set the inputs
Adjust
query,filterType,fromDate(sensible defaults are pre-filled). - 3Click Run
The tool runs on Apify’s cloud and collects the data for you.
- 4Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.
Inputs
| Field | What it does | Type |
|---|---|---|
query | Keywords to search Crossref for across titles, authors, abstracts, and metadata (e.g. "deep learning", "CRISPR gene editing", "climate change adaptation"). | string |
filterType | Only return works of this Crossref type. Leave empty for all types. "journal-article" is the most common for research papers. | string |
fromDate | Only return works published on or after this date, in YYYY-MM-DD format (e.g. 2020-01-01). Leave empty for no date floor. | string |
sort | How to order results. "Relevance" matches the query best; "Most cited" surfaces influential papers; "Newest first" sorts by publication date descending. | string |
maxItems | Maximum number of scholarly works to return. Uses deep cursor pagination to fetch beyond 100 reliably. | integer |
notionConnector | Optional. Write each result as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connector | string |
notionParentId | Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your wo | string |
What you get
A structured dataset — each result includes fields like:
abstractauthorscitationsdoiissnjournalpublishedDatepublishersubjectstitletypeurlExport every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.
2 ready-to-run use cases
Most-Cited CRISPR Papers Ranked by Citations | Crossref
Rank CRISPR gene-editing papers by citation count from Crossref's 150M-work index. DOIs, titles, authors, and journals for literature reviews.
Microplastics Literature Search: All Crossref Works
Every microplastics publication on Crossref in one dataset: journal articles, books, datasets, and preprints with DOIs for systematic reviews.
Crossref Scholarly Works Scraper
Search the Crossref catalog of 150M+ scholarly works (journal articles, preprints, books, datasets, and more) via its public REST API — no API key, no login, no anti-bot.
The actor is a polite Crossref client: it identifies itself with a contact User-Agent and a mailto query parameter so Crossref routes it to the faster "polite pool", and it uses deep cursor pagination (cursor=* → next-cursor) which is the only reliable way to page past 1,000 rows.
Input
| Field | Type | Default | Description |
|---|---|---|---|
query | string (required) | deep learning | Keywords searched across titles, authors, abstracts and metadata. |
filterType | string | _all_ | Restrict to a Crossref work type, e.g. journal-article. |
fromDate | string YYYY-MM-DD | _none_ | Only works published on/after this date. |
sort | enum | relevance | relevance, is-referenced-by-count (most cited), or published (newest). |
maxItems | integer | 100 | Max works to return (cursor pagination handles >100). |
proxyConfiguration | object | _none_ | Optional and off by default; Crossref is a public, no-key API with no anti-bot, so a proxy adds no benefit. Only enable it if you hit IP-level rate limits. |
Output
Each successful row:
{
"ok": true,
"doi": "10.1038/nature14539",
"title": "Deep learning",
"authors": ["Yann LeCun", "Yoshua Bengio", "Geoffrey Hinton"],
"journal": "Nature",
"publisher": "Springer Science and Business Media LLC",
"type": "journal-article",
"publishedDate": "2015-05-28",
"citations": 70000,
"subjects": ["Multidisciplinary"],
"issn": ["0028-0836", "1476-4687"],
"abstract": null,
"url": "https://doi.org/10.1038/nature14539"
}
authorsare formatted"Given Family"(organizational authors fall back to their name).publishedDateis assembled from Crossref'sdate-parts(may be year-only or year-month for older records).citationsis Crossref'sis-referenced-by-count.abstractis the JATS-XML abstract stripped to plain text, ornullwhen Crossref has none.- Nullable fields:
title,journal,publisher,type,publishedDate,abstract, andurlmay benull, andauthors,subjects, andissnmay be empty arrays, depending on what the publisher deposited with Crossref.doiis always present (rows without a DOI are dropped).citationsdefaults to0when absent.
Results are deduplicated by DOI. Charging is per successful work (work event). Diagnostic / empty / blocked rows (ok: false with an errorCode) are never charged — this includes BAD_INPUT (empty query or malformed fromDate), NO_RESULTS, and any network/block error.
Troubleshooting
BAD_INPUTrow, no results: you leftqueryempty orfromDateisn'tYYYY-MM-DD. Fix the input and re-run — you were not charged.NO_RESULTSrow: your query/filter combination matched nothing in Crossref. Try broader keywords or drop the type/date filters.RATE_LIMITED/BLOCKEDrow: rare for Crossref. The actor already retries with backoff; if it persists, enable a proxy to use a different IP.
Notes
- Powered entirely by the public Crossref REST API (
https://api.crossref.org/works). Please be considerate of the shared, free service. - Citation counts and abstracts depend on what publishers deposit with Crossref; coverage varies by record.