Dataset

StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos

StandUp4AI is a new multimodal dataset for humor detection, featuring over 330 hours of stand-up comedies in seven languages. This dataset surpasses existing ones in size and diversity, and experiments show it's a valuable resource for training models, with potential for improved performance when combined with enhanced audio speech recognition methods.

Aug 5, 2025

(Preprint) Scrapping The Web For Early Wildfire Detection: A New Annotated Dataset of Images and Videos of Smoke Plumes In-the-wild

PyroNear-2024 is a new dataset for smoke plume detection, featuring 150,000 annotations on 50,000 images and videos of 400 wildfires from France, Spain, and the US. This dataset surpasses existing ones in size and diversity, and experiments show it's a challenging but valuable resource for training models, with potential for improved performance when combined with other datasets.

Oct 1, 2024

The Touché23-ValueEval Dataset for Identifying Human Values behind Arguments

We've created the Touché23-ValueEval dataset, a large collection of over 9,300 arguments annotated with 54 human values, to help develop methods for analyzing the values that make arguments persuasive. Our dataset, which more than doubles the size of its predecessor, has already been used to achieve state-of-the-art results in identifying human values behind arguments, and has shown promising performance with large language models like Llama-2-7B.

May 1, 2024

CoFE: A New Dataset of Intra-Multilingual Multi-target Stance Classification from an Online European Participatory Democracy Platform

A new dataset for Stance Recognition using data from the Participatory Democracy platform of the Conference for the Future of Europe. This dataset contains highly-multilingual interactions, as the platform used Machine Translation, in the sense that users interacts in using their (different) native languages in the same thread.

Nov 1, 2022

Opinions in Interactions : New Annotations of the SEMAINE Database

We've added new opinion annotations to the SEMAINE dataset, which captures dyadic interactions between humans and virtual agents, resulting in a rich dataset with over 73,000 words and 6 hours of conversation. Our annotations and proposed baseline model using RoBERTa embeddings achieve promising results, with a F1-score of 0.72, making it a valuable resource for opinion detection in human-computer interactions.

Jun 1, 2022

Debating Europe: A Multilingual Multi-Target Stance Classification Dataset of Online Debates

A new dataset of 2,600 online debate comments has been created to improve stance classification models. Fine-tuning and semi-supervised learning can boost accuracy by 3.4% over a baseline model.

Jun 1, 2022