Resources

One of my research interests is to create new resources that can be used for research in linguistics and NLP. Here you can find some of them.

If you use any of these resources in your research, please refer to its respective description paper available in pdf.

Corpora and Datasets

  • Colonia: Corpus of Historical Portuguese Colonia Website Colonia at Linguateca Colonia at CorpusEye Colonia at Kaggle pdf pdf
    A Portuguese historical corpus containing texts from the 16th to the early 20th century, lemmatized and annotated with POS tags. The corpus is available to download and through a graphical CQPWeb-based interface. From May 2014, thanks to Diana Santos (University of Oslo), Colonia is also available at Linguateca. From October 2014, thanks to Eckhard Bick (University of Southern Denmark), a version of Colonia tagged using the PALAVRAS parsing system is available through CorpusEye. From August 2017, thanks to Rachael Tatman, Colonia is available at Kaggle.

  • CompLex: A Corpus for Lexical Complexity Prediction from Likert Scale Data CompLex pdf
    CompLex is an English multi-domain corpus compiled for lexical complexity annotated with a five-point Likert scale. It was the official dataset of the Lexical Complextity Prediction (LCP) shared task at SemEval 2021.

  • DSL Corpus Collection (DSLCC) DSLCC pdf
    A collection of journalistic corpora written in closely related languages and language varieties. The dataset has been used in the DSL Shared Tasks in 2014, 2015, 2016, and 2017.

  • LIdioms: A Multilingual Linked Idioms Data Set in Five Different Languages LIdioms pdf
    This is a multilingual linked idioms data set in five different languages (English, Portuese, Italian, German, Russian). Currently being expanded to other languages.

  • NLI-PT: A Portuguese Native Language Identification Dataset NLI-PT pdf
    A collection of 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish.

  • Offensive Greek Tweet Dataset (OGTD) OGTD pdf
    OFTD is an offensive language dataset for Greek annotated following the OLID guidelines. OGTD was used in the OffensEval 2020: Multilingual Offensive Language Identification in Social Media (SemEval 2020 - Task 12) shared task.

  • Offensive Language Identification Dataset (OLID) OLID pdf
    OLID contains a collection of annotated tweets using a hierarchical annotation model that encompasses following three levels: A: Offensive Language Detection; B: Categorization of Offensive Language; C: Offensive Language Target Identification. OLID was used in the OffensEval 2019: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task.

  • Semi-Supervised Offensive Language Identification Dataset (SOLID) SOLID pdf
    SOLID contains over 9 million tweets annotated following OLID's three-level taxonomy. SOLID was used in the OffensEval 2020: Multilingual Offensive Language Identification in Social Media (SemEval 2020 - Task 12) shared task.

Other Resources

  • Frequency lists from comparable Spanish corpora Word Unigrams POS and Morphology pdf
    These two frequency lists were produced to compare linguistic features of four Spanish varieties (Argentina, Mexico, Peru, and Spain) as described in this 2013 paper.

  • P-AWL: Portuguese Academic Word List P-AWL pdf
    The P-AWL was developed for Portuguese using the English Academic Word List (AWL) developed by Coxhead (2000). It contains 1,812 entries.


© 2020 - 2021 Marcos Zampieri | Last Update: July 2021 | Template: Plain Academic