Google Dataset Search, a dataset-discovery tool 

With data sharing increasingly being encouraged in academic research and datasets increasingly being added to data repositories and being published on the Web, it makes sense that a Web browser company like Google would dedicate resources towards the goal of developing a Web discovery tool that is optimized for finding datasets.

How does it work?

Google Dataset Search, a dataset-discovery tool, basically uses Google’s web crawl technology to search for datasets that have been made available on the Web, identifying them based on their metadata (standardized descriptions of the datasets added to them by their owners/publishers).“ Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable.”

For an in-depth overview of how Google Dataset Search has been developed – please see:

Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. “Discovering datasets on the web scale: Challenges and recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4 (2024).

How can you search it?

To get started with using Google Dataset Search, go to: Dataset Search at https://datasetsearch.research.google.com/

If you are looking for something specific, you can refine your search results by limiting your search to a particular website domain (for example, site:nih.gov) or adding additional terms to your search. You can also filter your results by when the dataset was last updated, by format, by usage rights, topic/discipline, and whether the dataset is freely-available. Furthermore, you can save your search results, link-out to the external source website where you can download the datasets, and you can easily cite the dataset by copying the citation information that is generated when you click on the citation button (i.e. the quotation mark button).

To learn more – see:

Dataset Search Quick Start Guide –
https://newsinitiative.withgoogle.com/resources/trainings/dataset-search-quickstart-guide/

User Support Center – https://datasetsearch.research.google.com/help

Dataset Developer Page –
https://developers.google.com/search/docs/appearance/structured-data/dataset

How is it being used?

It appears that biomedical researchers have already started using Google Dataset Search in their scholarly projects. Some examples focusing on finding image datasets include:

  1. Abbad Andaloussi M, Maser R, Hertel F, Lamoline F, Husch AD. Exploring adult glioma through MRI: A review of publicly available datasets to guide efficient image analysis. Neurooncol Adv. 2025;7(1):vdae197. Epub 20250128. doi: 10.1093/noajnl/vdae197. PubMed PMID: 39877749; PMCID: PMC11773385.

  2. Rozhyna A, Somfai GM, Atzori M, DeBuc DC, Saad A, Zoellin J, Müller H. Exploring Publicly Accessible Optical Coherence Tomography Datasets: A Comprehensive Overview. Diagnostics (Basel). 2024;14(15). Epub 20240801. doi: 10.3390/diagnostics14151668. PubMed PMID: 39125544; PMCID: PMC11312046.

  3. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64-e74. Epub 20211109. doi: 10.1016/s2589-7500(21)00252-1. PubMed PMID: 34772649.

  4. Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, Keane PA, Sebire NJ, Burton MJ, Denniston AK. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51-e66. Epub 20201001. doi: 10.1016/s2589-7500(20)30240-5. PubMed PMID: 33735069.

Questions? Ask Us at the MSK Library!

Availability of Removed Federal Data

As you may have read in the news or experienced while looking for government data and websites, recent federal government mandates have led to online information removal.

A 2/5/25 screenshot of the Youth Risk Behavior Surveillance System website that includes the message, "CDC’s website is being modified to comply with President Trump’s Executive Orders."

A 2/5/25 screenshot of the Youth Risk Behavior Surveillance System website that includes the message, “CDC’s website is being modified to comply with President Trump’s Executive Orders.”

Several sources have worked to preserve deleted information:

General
End-of-Term Project
This project has been in existence since the 2008 administration change.
GovDiff.com
A tool to compare government websites before and after January 20, 2025.

Climate
Climate and Economic Justice Screening Tool (Council on Environmental Quality, Executive Office of the President, copy)
Environmental Justice Index (CDC, 2022 and 2024 data) – Does not work on the VPN
Environmental Justice Scorecard (EPA, copy)
Sea level data (NOAA)

Health
Social Vulnerability Index (CDC, 2022 data) – Does not work on the VPN
Youth Risk Behavior Surveillance System (YRBSS) National Datasets, (CDC, 1991-2021 data)
Office of Research on Women’s Health website (NIH, copy)  
Additional CDC and NIH data
CDC data is also available for a fee through PolicyMap

The Healthy Regions & Policies (HeRoP) Lab at the University of Illinois is saving datasets from the CDC, EPA, Health Resources and Services Administration (HRSA), and more relating to social and structural determinants of health.

Harvard Law School Library’s Innovation Lab is working on a vault for government data, which should be made available soon.

Read more from 404 Media, The Journalist’s Resource, and Stat News. And follow the International Association for Social Science Information Service and Technology’s (IASSIST) Google Doc for a constantly updated list of resources.

NIH All of Us Researcher Workbench – Data Browser

The NIH All of Us Research Program is “part of an effort to advance individualized health care by enrolling one million or more participants to contribute their health data over many years”.

All of Us data is derived from various sources, including surveys, shared electronic health records, and much more. This collected data is housed in the All of Us Research Hub, which uses a tiered-data access model that includes a Public Tier dataset that “displays high-level summaries of the data available for research. Through the Data Browser, one can explore anonymized, aggregated participant data and summary statistics.”

As Memorial Sloan Kettering Cancer Center is listed as registered Institution with a Data Use and Registration Agreement (DURA) in place, MSK researchers can proceed to register for an account if they wish to gain access the curated datasets beyond the Public Tier dataset. 
Note: Authorized users of the All of Us data are expected to conduct research that follows and conforms to the All of Us Research Program data use policies.

The interactive, public Data Browser is a great place to learn about the type and quantity of data that All of Us collects so that interested researchers can start thinking about potential research questions that this data could help answer. Here’s a glimpse at what it looks like – from https://databrowser.researchallofus.org:

The Data Browser can be searched using keywords across all data types, or users can choose to click on the browsable tiles to explore a particular data type or source. 

From: https://databrowser.researchallofus.org/survey/social-determinants-of-health

For example, the Social Determinants of Health tile will lead users to more specific information, including a view of the survey questions themselves, each presented with a link to “See Answers” that leads to a breakdown of the aggregated participant answers.

To learn more about the NIH All of Us Researcher Workbench and to get an idea of how other researchers are already using this data, please check out the following resources:

…or Ask Us at the MSK Library!