Google Dataset Search, a dataset-discovery tool 

With data sharing increasingly being encouraged in academic research and datasets increasingly being added to data repositories and being published on the Web, it makes sense that a Web browser company like Google would dedicate resources towards the goal of developing a Web discovery tool that is optimized for finding datasets.

How does it work?

Google Dataset Search, a dataset-discovery tool, basically uses Google’s web crawl technology to search for datasets that have been made available on the Web, identifying them based on their metadata (standardized descriptions of the datasets added to them by their owners/publishers).“ Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable.”

For an in-depth overview of how Google Dataset Search has been developed – please see:

Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. “Discovering datasets on the web scale: Challenges and recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4 (2024).

How can you search it?

To get started with using Google Dataset Search, go to: Dataset Search at https://datasetsearch.research.google.com/

If you are looking for something specific, you can refine your search results by limiting your search to a particular website domain (for example, site:nih.gov) or adding additional terms to your search. You can also filter your results by when the dataset was last updated, by format, by usage rights, topic/discipline, and whether the dataset is freely-available. Furthermore, you can save your search results, link-out to the external source website where you can download the datasets, and you can easily cite the dataset by copying the citation information that is generated when you click on the citation button (i.e. the quotation mark button).

To learn more – see:

Dataset Search Quick Start Guide –
https://newsinitiative.withgoogle.com/resources/trainings/dataset-search-quickstart-guide/

User Support Center – https://datasetsearch.research.google.com/help

Dataset Developer Page –
https://developers.google.com/search/docs/appearance/structured-data/dataset

How is it being used?

It appears that biomedical researchers have already started using Google Dataset Search in their scholarly projects. Some examples focusing on finding image datasets include:

  1. Abbad Andaloussi M, Maser R, Hertel F, Lamoline F, Husch AD. Exploring adult glioma through MRI: A review of publicly available datasets to guide efficient image analysis. Neurooncol Adv. 2025;7(1):vdae197. Epub 20250128. doi: 10.1093/noajnl/vdae197. PubMed PMID: 39877749; PMCID: PMC11773385.

  2. Rozhyna A, Somfai GM, Atzori M, DeBuc DC, Saad A, Zoellin J, Müller H. Exploring Publicly Accessible Optical Coherence Tomography Datasets: A Comprehensive Overview. Diagnostics (Basel). 2024;14(15). Epub 20240801. doi: 10.3390/diagnostics14151668. PubMed PMID: 39125544; PMCID: PMC11312046.

  3. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64-e74. Epub 20211109. doi: 10.1016/s2589-7500(21)00252-1. PubMed PMID: 34772649.

  4. Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, Keane PA, Sebire NJ, Burton MJ, Denniston AK. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51-e66. Epub 20201001. doi: 10.1016/s2589-7500(20)30240-5. PubMed PMID: 33735069.

Questions? Ask Us at the MSK Library!

Availability of Removed Federal Data

As you may have read in the news or experienced while looking for government data and websites, recent federal government mandates have led to online information removal.

A 2/5/25 screenshot of the Youth Risk Behavior Surveillance System website that includes the message, "CDC’s website is being modified to comply with President Trump’s Executive Orders."

A 2/5/25 screenshot of the Youth Risk Behavior Surveillance System website that includes the message, “CDC’s website is being modified to comply with President Trump’s Executive Orders.”

Several sources have worked to preserve deleted information:

General
End-of-Term Project
This project has been in existence since the 2008 administration change.
GovDiff.com
A tool to compare government websites before and after January 20, 2025.

Climate
Climate and Economic Justice Screening Tool (Council on Environmental Quality, Executive Office of the President, copy)
Environmental Justice Index (CDC, 2022 and 2024 data) – Does not work on the VPN
Environmental Justice Scorecard (EPA, copy)
Sea level data (NOAA)

Health
Social Vulnerability Index (CDC, 2022 data) – Does not work on the VPN
Youth Risk Behavior Surveillance System (YRBSS) National Datasets, (CDC, 1991-2021 data)
Office of Research on Women’s Health website (NIH, copy)  
Additional CDC and NIH data
CDC data is also available for a fee through PolicyMap

The Healthy Regions & Policies (HeRoP) Lab at the University of Illinois is saving datasets from the CDC, EPA, Health Resources and Services Administration (HRSA), and more relating to social and structural determinants of health.

Harvard Law School Library’s Innovation Lab is working on a vault for government data, which should be made available soon.

Read more from 404 Media, The Journalist’s Resource, and Stat News. And follow the International Association for Social Science Information Service and Technology’s (IASSIST) Google Doc for a constantly updated list of resources.

New MSK Resource – Data Policy Finder: Search. Verify. Plan.

Getting ready to publish your research? Still in the early stages of research but already thinking about where you will publish? About to undertake research and thinking ahead?

Navigating publication requirements for data, code, and other supporting research output can be tricky, but the Library is here to help!

The MSK Library has launched a new homegrown resource, the Data Policy Finder.

The Data Policy Finder is a searchable database containing information and links for data, code, and materials policies. Search for data sharing and management policies by publisher or publication. Library staff review the policies and curate the database with links, direct quotes from the relevant sections of the policies, and notes to help you search, verify, and plan for your publication data requirements.

For any questions regarding this new resource or to schedule a demo, please reach out to Anthony Dellureficio, Associate Librarian, Research Data Management.