Data on NLP Research

The following list shows corpora and datasets close to the core research topics of the NLP group. More data can be found on the web page of our associated group, Webis.

 

 

Applications and services

2020

Argument Snippets Dataset
A dataset of 100 arguments retrieved from args.me on top queried topics, in which snippets are extracted by two human experts (Alshomary et al 2020b).  [163kb]
 

2019

Args.me corpus
args.me corpus comprises 387 606 arguments crawled from four debate portals in the middle of 2019. The debate portals are Debatewise, IDebate.org, Debatepedia, and Debate.org. The arguments are extracted using heuristics that are designed for each debate portal.  [849mb]

 

2018

Arg-Microtexts Synthesis Benchmark
130 logos-oriented and 130 pathos-oriented benchmark arguments for 10 topics, manually synthesized by 26 experts based on a pool of argumentative discourse units from the Arg-Microtexts corpus. [4mb]

In case you publish any results related to the this benchmark data, please cite our upcoming COLING 2018 paper on argumentation synthesis. [bib]

The ArguAna Counterargs Corpus
An English corpus for studying the retrieval of the best counterargument to an argument. It contains 6753 pairs of argument and best counterargument from the online debate portal idebate.org, along with different experiment files with up to millions of candidate pairs. [106mb]

In case you publish any results related to the ArguAna Counterargs corpus, please cite our upcoming ACL 2018 paper on counterarguments. [bib]

 

2017

The Dagstuhl-15512 ArgQuality Corpus
An English corpus for studying the assessment of argumentation quality. It contains 320 online debate portal arguments, annotated for 15 different quality dimensions by three annotators. [zip v1 1mb] [zip v2 1mb]

In version 2, the annotated XMI files have been changed according to a new underlying type system where each quality dimension is represented by an own annotation. This annotation contains not only the majority score of the respective dimension (as in version 1), but also the mean score and the scores of all annotators. We recommend to use version 2.

In case you publish any results related to the Dagstuhl-15512 ArgQuality corpus, please cite our EACL 2017 paper on argumentation quality. [pdf] [bib].

The Webis-ArgRank-17 Dataset
An English benchmark dataset for studying argument relevance. It contains 32 rankings as well a ground-truth argument graph with more than 30,000 argument units. In addition, we provide the source code to reproduce our ranking experiments based on the dataset. [zip 13mb]

In case you publish any results related to the Webis-ArgRank-17 dataset, please cite our EACL 2017 paper on argument relevance. [pdf] [bib]

 

2016

The Webis-Editorials-16 Corpus
An English corpus with 300 news editorials from three online news portals, annotated for the types of all argumentative discourse units. [zip 5mb]

In case you publish any results related to the Webis-Editorials-16 corpus, please cite our COLING 2016 paper on argumentation strategies. [pdf] [bib]

 

2014

The ArguAna TripAdvisor Corpus
An English corpus for studying local sentiment flows and aspect-based sentiment analysis. It contains 2100 hotel reviews balanced with respect to the reviews’ sentiment scores. All reviews are segmented into subsentence-level statements that have then been manually classified as a fact, a positive, or a negative opinion. Also, all hotel aspects mentioned in the reviews have been annotated as such. [zip v1 with software 10mb] [zip v2 8mb]

In addition, we provide nearly 200k further hotel reviews without manual annotations. [v1 upon request] [zip v2 265mb]

The corpus is free-to-use for scientific purposes, not for commercial applications. In version 2, the annotated XMI files have been changed according to a new underlying type system that is more easily extendable. Notice that some adaptations of the software of version 1 are necessary to make it work with version 2.

In case you publish any results related to the ArguAna TripAdvisor corpus, please cite our CICLing 2014 paper. [pdf] [bib]

Source code and feature files

COLING 2022. A Dialogue Corpus for Learning to Construct Explanations: Data and code for our explanation dialogues corpus (github link)

ACL 2020b. Extractive Snippet Generation for Arguments: Python implementation for our extractive snippet generation for arguments (github link).

ACL 2020a. Target Inference in Conclusion Generation: Python code to reproduce the results of our paper (github link).

ACL 2018a. Counterargument retrieval: Java source code of the experiments on retrieving the best counterargument to an argument without prior topic knowledge. [zip 302 mb]

EMNLP 2017b. Modeling overall argumentation with tree kernels: Java source code of the experiments on the impact of modeling overall argumentation with tree kernels on argumentation-related analysis tasks. [zip 98 mb]

ArgMining 2017b. Argument search framework: The Java source code of an argument search framework that is build upon Apache UIMA and Apache Lucene. The argument search engine args relies on this framework. [to come mid 2018]

ACL 2017. Argumentation quality assessment: Java source code of the experiments and results of the annotation study on argumentation quality assessment in theory and practice. [zip 2mb]

COLING 2016b. Argumentation-related essay scoring: Java source code and Weka ARFF feature files of the experiments on argumentation-related essay scoring. [zip 244mb]

EMNLP 2015. Sentiment flows: Java source code and Weka ARFF feature files of the experiments on the generality and domain robustness of sentiment flow as a model of web review argumentation. [zip 18mb]

COLING 2014. Sentiment flow patterns: Java source code and Weka ARFF feature files of the experiments on our approach to domain-robust sentiment analysis, which is based on a shallow model of review argumentation. [zip 47mb]