ClinicalTrials.gov – Discovery Tool and Research Data Source

Posted on April 10, 2025 by Dina

As ClinicalTrials.gov celebrates its 25^th anniversary, reaches its half-million registered studies milestone, and completes its modernization, it’s a good time to appreciate this invaluable research tool that has been around since 2000. In 2008, NLM launched the ClinicalTrials.gov results database, which now (as of 12/2024) has >70K registered studies posted with results.

Openly available to all with “about 90 thousand visitors per day and 2 million unique visitors every month”, ClinicalTrials.gov is a registry where individuals can identify both ongoing and completed registered trials from “50 States and in 229 countries and territories”.

Some functionality that has been added over the last few years (related to how you can search the database using Complex Search Queries and how you can download and use the search results/records from ClinicalTrials.gov) has made this database increasingly attractive as a data source for answering research questions.

From: https://clinicaltrials.gov/find-studies

In addition to having search functionality that allows for very precise searching, it is now possible to download search results from ClinicalTrials.gov in the RIS file format that can be imported into citation management tools like EndNote and Covidence (used for managing systematic review projects).

It is important to note that the data fields included in the RIS download (which is not customizable), differ from those included in the CSV file download data fields (which a user can select from a menu of options), which differ from the JSON format (which can include every available data field for each study being downloaded). The ClinicalTrials.gov API option allows the ClinicalTrials.gov database to be accessed on a large scale, automated way by researchers and developers.

From: https://clinicaltrials.gov/find-studies/how-to-use-search-results

Examples of research projects that have leveraged ClinicalTrials.gov data:

Alhajahjeh A, Rotter LK, Stempel JM, Grimshaw AA, Bewersdorf JP, Blaha O, Kewan T, Podoltsev NA, Shallis RM, Mendez L, Stahl M, Zeidan AM. Global Disparities in the Characteristics and Outcomes of Leukemia Clinical Trials: A Cross-Sectional Study of the ClinicalTrials.gov Database. JCO Glob Oncol. 2024 Nov;10:e2400316. doi: 10.1200/GO-24-00316. Epub 2024 Dec 2. PMID: 39621951.
Chen D, Parsa R, Chauhan K, Lukovic J, Han K, Taggar A, Raman S. Review of brachytherapy clinical trials: a cross-sectional analysis of ClinicalTrials.gov. Radiat Oncol. 2024 Feb 13;19(1):22. doi: 10.1186/s13014-024-02415-8. PMID: 38351013; PMCID: PMC10863227.
Falade AS, Adeoye O, Van Loon K, Buckle GC. Clinical Trials in Gastroesophageal Cancers: An Analysis of the Global Landscape of Interventional Trials From ClinicalTrials.gov. JCO Glob Oncol. 2024 Aug;10:e2400169. doi: 10.1200/GO.24.00169. PMID: 39173083.
Pearce FJ, Cruz Rivera S, Liu X, Manna E, Denniston AK, Calvert MJ. The role of patient-reported outcome measures in trials of artificial intelligence health technologies: a systematic evaluation of ClinicalTrials.gov records (1997-2022). Lancet Digit Health. 2023 Mar;5(3):e160-e167. doi: 10.1016/S2589-7500(22)00249-7. PMID: 36828608.
Yang A, Baxi S, Korenstein D. ClinicalTrials.gov for Facilitating Rapid Understanding of Potential Harms of New Drugs: The Case of Checkpoint Inhibitors. J Oncol Pract. 2018 Feb;14(2):72-76. doi: 10.1200/JOP.2017.025114. Epub 2018 Jan 3. PMID: 29298113; PMCID: PMC5812307.

Questions? Ask Us at the MSK Library!

Google Dataset Search, a dataset-discovery tool

Posted on March 12, 2025 by Dina

With data sharing increasingly being encouraged in academic research and datasets increasingly being added to data repositories and being published on the Web, it makes sense that a Web browser company like Google would dedicate resources towards the goal of developing a Web discovery tool that is optimized for finding datasets.

How does it work?

Google Dataset Search, a dataset-discovery tool, basically uses Google’s web crawl technology to search for datasets that have been made available on the Web, identifying them based on their metadata (standardized descriptions of the datasets added to them by their owners/publishers).“ Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable.”

For an in-depth overview of how Google Dataset Search has been developed – please see:

Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. “Discovering datasets on the web scale: Challenges and recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4 (2024).

How can you search it?

To get started with using Google Dataset Search, go to: Dataset Search at https://datasetsearch.research.google.com/

If you are looking for something specific, you can refine your search results by limiting your search to a particular website domain (for example, site:nih.gov) or adding additional terms to your search. You can also filter your results by when the dataset was last updated, by format, by usage rights, topic/discipline, and whether the dataset is freely-available. Furthermore, you can save your search results, link-out to the external source website where you can download the datasets, and you can easily cite the dataset by copying the citation information that is generated when you click on the citation button (i.e. the quotation mark button).

To learn more – see:

Dataset Search Quick Start Guide –
https://newsinitiative.withgoogle.com/resources/trainings/dataset-search-quickstart-guide/

User Support Center – https://datasetsearch.research.google.com/help

Dataset Developer Page –
https://developers.google.com/search/docs/appearance/structured-data/dataset

How is it being used?

It appears that biomedical researchers have already started using Google Dataset Search in their scholarly projects. Some examples focusing on finding image datasets include:

Abbad Andaloussi M, Maser R, Hertel F, Lamoline F, Husch AD. Exploring adult glioma through MRI: A review of publicly available datasets to guide efficient image analysis. Neurooncol Adv. 2025;7(1):vdae197. Epub 20250128. doi: 10.1093/noajnl/vdae197. PubMed PMID: 39877749; PMCID: PMC11773385.
Rozhyna A, Somfai GM, Atzori M, DeBuc DC, Saad A, Zoellin J, Müller H. Exploring Publicly Accessible Optical Coherence Tomography Datasets: A Comprehensive Overview. Diagnostics (Basel). 2024;14(15). Epub 20240801. doi: 10.3390/diagnostics14151668. PubMed PMID: 39125544; PMCID: PMC11312046.
Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64-e74. Epub 20211109. doi: 10.1016/s2589-7500(21)00252-1. PubMed PMID: 34772649.
Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, Keane PA, Sebire NJ, Burton MJ, Denniston AK. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51-e66. Epub 20201001. doi: 10.1016/s2589-7500(20)30240-5. PubMed PMID: 33735069.

Questions? Ask Us at the MSK Library!

Scientific Writing Resources

Posted on February 11, 2025 by Dina

As generative AI tools have become increasingly available to academic researchers, so too have the reports of GPT-fabricated scientific papers creeping into the public scholarly record, for example, this 2024 report from the Harvard Kennedy School:

GPT-fabricated scientific papers on Google Scholar: Key features, spread, and implications for preempting evidence manipulation | HKS Misinformation Review

Developing strong scientific writing skills has always been an important component of graduate training in the basic sciences, however, not all scientific authors have the same degree of exposure to writing classes and authorship opportunities. As the burden of recognizing fake papers is falling more and more on the readers of scientific works, there couldn’t be a better way to protect yourself against fraudulent articles than by becoming an expert at scientific writing yourself.

Here’s some resources to explore if you wish to develop your scientific writing skills:

1) E-books from the MSK Library’s collection and full-text book chapters available online

2) Duke Graduate School Scientific Writing Resource
https://sites.duke.edu/scientificwriting/
“The Scientific Writing Resource is online course material that teaches how to write effectively. The material is not about correctness (grammar, punctuation, etc.), but about communicating what you intend to the reader. It can be used either in a science class or by individuals. It is intended for science students at the graduate level.”

“This guide to scientific writing was originally created in 2010-2011 by Nathan Sheffield for the Duke University Graduate School and funded by a Duke University Graduate School Teaching mini-grant. This current site is maintained by the Duke Graduate School. If you have questions about this site, please contact gradschool@duke.edu.”

The MSK Library also provides access to writing support tools, including:

1) Citation Management tools – https://libguides.mskcc.org/citationmanagement

Find out about a variety of citation management software tools that can save you time when you are formatting your manuscript’s references and bibliography.

2) Trinka AI – https://libguides.mskcc.org/trinka

“Trinka is an AI-powered writing assistant designed for academic and technical writing. Trinka corrects advanced grammar errors and contextual spelling mistakes by providing writing suggestions in real-time. It helps academicians write in a formal, concise, and engaging manner. In addition to correcting grammatical errors, Trinka allows you to paraphrase the text and improve consistency, enabling you to enhance the quality of your writing based on your requirements.”

3) iThenticate – https://libguides.mskcc.org/ithenticate

“iThenticate is a tool for researchers and writers to check their original works for potential plagiarism. This resource will check against 93% of Top Cited Journal content and 70+ billion current and archived web pages.”

Questions? Ask Us at the MSK Library!

MSK Library Blog

Sharing Research, Resources & News

Category Archives: Resource Highlights

ClinicalTrials.gov – Discovery Tool and Research Data Source

Google Dataset Search, a dataset-discovery tool

Scientific Writing Resources

msklibrary