Keywords: LLM, PDFs, scientific publications, air pollution, health
Project partners: Meltem Kutlar Joss, Simona Dobre, Aurelio di Pascuale
CeDA collaborator: Rodrigo C. G. Pena, Konstantinos Ntemos
Repository: ludok-tools

Context

The LUDOK project, hosted at the Swiss Tropical and Public Health Institute (Swiss TPH) catalogues the literature published worldwide on the subject of air pollution and health on behalf of the Federal Office for the Environment. Their database contains over 10'000 peer-reviewed works and they produce newsletters on current topics, conduct literature searches and write reports.

The LUDOK workflow involves four main stages:

  1. Literature search and selection
  2. Study data extraction
  3. Broad overview
  4. Synthesis

We partnered up with them to help make step 2 (study data extraction) easier with help from Large Language Models (LLMs). The plan is to adapt and fine-tune LLMs on PDFs and metadata from the LUDOK database to produce structured summaries of the studies, with constrains on certain keywords and other critical elements.

Project objectives

Set up a pipeline for automatic data extractions and summarization from scientific studies in the LUDOK database.