Google Dataset Search, a dataset-discovery tool 

With data sharing increasingly being encouraged in academic research and datasets increasingly being added to data repositories and being published on the Web, it makes sense that a Web browser company like Google would dedicate resources towards the goal of developing a Web discovery tool that is optimized for finding datasets.

How does it work?

Google Dataset Search, a dataset-discovery tool, basically uses Google’s web crawl technology to search for datasets that have been made available on the Web, identifying them based on their metadata (standardized descriptions of the datasets added to them by their owners/publishers).“ Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable.”

For an in-depth overview of how Google Dataset Search has been developed – please see:

Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. “Discovering datasets on the web scale: Challenges and recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4 (2024).

How can you search it?

To get started with using Google Dataset Search, go to: Dataset Search at https://datasetsearch.research.google.com/

If you are looking for something specific, you can refine your search results by limiting your search to a particular website domain (for example, site:nih.gov) or adding additional terms to your search. You can also filter your results by when the dataset was last updated, by format, by usage rights, topic/discipline, and whether the dataset is freely-available. Furthermore, you can save your search results, link-out to the external source website where you can download the datasets, and you can easily cite the dataset by copying the citation information that is generated when you click on the citation button (i.e. the quotation mark button).

To learn more – see:

Dataset Search Quick Start Guide –
https://newsinitiative.withgoogle.com/resources/trainings/dataset-search-quickstart-guide/

User Support Center – https://datasetsearch.research.google.com/help

Dataset Developer Page –
https://developers.google.com/search/docs/appearance/structured-data/dataset

How is it being used?

It appears that biomedical researchers have already started using Google Dataset Search in their scholarly projects. Some examples focusing on finding image datasets include:

  1. Abbad Andaloussi M, Maser R, Hertel F, Lamoline F, Husch AD. Exploring adult glioma through MRI: A review of publicly available datasets to guide efficient image analysis. Neurooncol Adv. 2025;7(1):vdae197. Epub 20250128. doi: 10.1093/noajnl/vdae197. PubMed PMID: 39877749; PMCID: PMC11773385.

  2. Rozhyna A, Somfai GM, Atzori M, DeBuc DC, Saad A, Zoellin J, Müller H. Exploring Publicly Accessible Optical Coherence Tomography Datasets: A Comprehensive Overview. Diagnostics (Basel). 2024;14(15). Epub 20240801. doi: 10.3390/diagnostics14151668. PubMed PMID: 39125544; PMCID: PMC11312046.

  3. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64-e74. Epub 20211109. doi: 10.1016/s2589-7500(21)00252-1. PubMed PMID: 34772649.

  4. Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, Keane PA, Sebire NJ, Burton MJ, Denniston AK. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51-e66. Epub 20201001. doi: 10.1016/s2589-7500(20)30240-5. PubMed PMID: 33735069.

Questions? Ask Us at the MSK Library!

Available Upon Request: Towards Meaningful Data Discoverability Webinar

How do we enable data discoverability, linkage, and re-use? Join us for presentations that will answer this question as we explore the importance of FAIR research data to foster transparency, reproducibility, and research integrity.

DateWednesday, March 31
Time12:00 PM – 1:30 PM
LocationZoom Webinar – REGISTER NOW

Elsevier, an early adopter and co-author of the FAIR Data and Data citation principles, has rolled out data sharing policies across its journals. They will provide insights on the implementation of the infrastructure that supports authors with complying with journal and funder data sharing mandates, and will discuss other resources for authors to manage their FAIR data and code.

The MSK Library will highlight their efforts in launching a new service focused on Research Data Management, including collaboration with internal and external stakeholders. They will demonstrate how this initiative facilitates best practices and enhances data discoverability at MSK. Data Stewardship & Integration will cover how this service fits into the overall management of data across the institution.

Speakers/Panelists:

Sarah Callaghan, PhD, Editor-in-Chief, Patterns, Cell Press, Elsevier
Sarah comes to Patterns from a 20-year career in creating, managing, and analyzing scientific data. Her research started as a combination of radio propagation engineering and meteorological modeling, then moved into data citation and publication, visualization, metadata, and data management for the environmental sciences. She was editor-in-chief of Data Science Journal for 4 years and has more than 100 publications. Her personal experience means she understands the frustrations that researchers can have with data. She believes that Patterns will bring together multidisciplinary groups to share knowledge and solutions to data-related problems, regardless of the original domain, for the benefit of humanity and the world.

Marina Soares E. Silva, PhD, Product Manager, RDM/Mendeley Data, Elsevier
Marina is the product manager responsible for Data and Code Linking in the context of article submission at Elsevier and contributes to internal and external initiatives on article-data linking. Marina has managed several partner relationships with universities testing the Mendeley Data repository with a focus on user research. Additionally, Marina was the Product Manager responsible for delivering a Research Object Composer that enables the publication of complex FAIR data objects in the cloud. This was work in the context of the NHLBI Data Stage project and a joint partnership with researchers at the University of Manchester and with Seven Bridges Genomics. Marina started as a Biology undergrad in Portugal and moved to the Netherlands to complete a PhD in experimental biophysics at the AMOLF Institute. In 2013, after one year as a postdoctoral researcher in Developmental Biology at the MSK/Sloan Kettering Institute, Marina joined Elsevier as Publisher. In this role she focused on improvements to the Peer Review process of the Biomaterials and Nanomaterials portfolio and launched the journal Materials Today Nano.

Anthony Dellureficio, MLS, MSc, Associate Librarian, Research Data Management, MSK
Anthony joined the organization in 2019 to help develop and launch this new service in support of our researchers’ workflow by introducing them to resources that focus on data management plan creation, data discovery, and data reuse. As the service continues to expand, collaboration with researchers, data science industries, and library colleagues will be key to ensure that the Library offers the right data management services. Anthony previously worked as the digital archivist at Cold Spring Harbor Laboratory, rare medical text cataloger at the Johns Hopkins Institute of the History of Medicine, and archivist at the Johns Hopkins Medical archives. Most recently he led the Library and Archives systems team at The New School for about ten years. His academic area of interest is in the history of genetics, and he is a regular reviewer for the Quarterly Review of Biology.

Theodora Bakker, MS, Director, Data Stewardship & Integration, MSK
Theodora has been at MSK for two and a half years as the head of Data Stewardship and Integration, and in 2020 also became the Product Manager for the Unified Data Fabric, an initiative to bring together high quality, standardized data across the MSK missions. Theodora has spent almost 20 years in academic medicine, including as a researcher, an information specialist, and has spent the last 9 years leading data stewardship. She has worked on standardizing and integrating research and clinical data to foster better data sharing for clinical, education, and research purposes, and partners with the MSK Library to achieve that goal throughout cancer care.

@Covidence Added to Library Resources

The Library has recently added Covidence to the collection. Covidence, a Cochrane technology platform, is a web-based systematic review tool designed to facilitate the process of screening, data extraction, and analysis of published literature. This resource has been integrated into the MSK Library’s Systematic Review service.

Users interested in utilizing this resource can create their personal sign-in information with Covidence before or after joining the institutional subscription. To request access to the institutional account in Covidence, use your current MSK email address. If you have already joined the MSK Library’s Covidence account, then you can log into Covidence with your email and password and start using this reference review management tool immediately.

Covidence may be found in our Database A-Z or by searching OneSearch.