OOIR: Observatory of International Research

Papers

(The TQCC of Language Resources and Evaluation is 3. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus	41
From LIMA to DeepLIMA: following a new path of interoperability	26
Hope speech detection in Spanish	25
Speech acts in the Dutch COVID-19 Press Conferences	25
Spelling errors made by people with dyslexia	25
A survey on geocoding: algorithms and datasets for toponym resolution	24
Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks	24
IIT Delhi Dialogue Corpus: a quantitative analysis of a spoken corpus of Hindi	21
Brazilian Portuguese corpora for teaching and translation: the CoMET project	21
Prompting encoder models for zero-shot classification: a cross-domain study in Italian	18
The narratives of war (NoW) corpus of written testimonies of the Russia-Ukraine war	16
AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata	15
The Visual Language Research Corpus (VLRC): an annotated corpus of comics from Asia, Europe, and the United States	14
A new evaluation method: evaluation data and metrics for Chinese grammatical error correction	14
Understanding conversational interaction in multiparty conversations: the EVA Corpus	14
Quality assessment of Tibetan–Chinese poetry translation: integrating automated metrics and qualitative insights through a cross-system comparison of dedicated NMT engines and a prompted LLM	13
A study on methods for revising dependency treebanks: in search of gold	13
Speech recognition in edge environments: an exploration of support and impact of model compression	12
Spontaneous, controlled acts of reference between friends and strangers	12
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning	11
Construction of Amharic information retrieval resources and corpora	11
Human–machine interaction in building an English reference dataset for natural language processing tasks	11
TLEX: an efficient method for extracting exact timelines from TimeML temporal graphs	11
The properties of panels in global comics: frequency and size of 76 K panels in 1,030 comics from 144 countries	11
Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese	10

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models	10
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI	10
adaptNMT: an open-source, language-agnostic development environment for neural machine translation	10
Perspectivist approaches to natural language processing: a survey	9
Ma’aks: manually-curated parallel dataset for Arabic text sentiment swap	9
Conversion of the Spanish WordNet databases into a Prolog-readable format	8
Automatic readability assessment for sentences: neural, hybrid and large language models	8
UHated: hate speech detection in Urdu language using transfer learning	8
Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype	8
A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: navigating the trade-offs in model size and performance	8
CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese	8
Chinese-DiMLex: a lexicon of Chinese discourse connectives	7
Uzbek news corpus for named entity recognition	7
DoSLex: automatic generation of all domain semantically rich sentiment lexicon	7
Slovenian parliamentary corpus siParl	7
The Sanskrit Sembank	7
TCMeta: a multilingual dataset of COVID tweets for relation-level metaphor analysis	7
VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability	6
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain	6
Developing and mining an underage modern Greek chat corpus: Do students show signs of bullying behavior while working on a project?	6
An integrated framework for emotion and sentiment analysis in Tamil and Malayalam visual content	6
Correction: The corpus of aggressive language in Polish parliamentary debates	6
Book Review: The Routledge handbook of discourse and disinformation	6
Detecting racism in the digital age: a survey of datasets and algorithms	6
Sense through time: diachronic word sense annotations for word sense induction and Lexical Semantic Change Detection	5
Studying word meaning evolution through incremental semantic shift detection	5
Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data	5
A performance analysis of a large language model for Marathi language NLP tasks	5
PolitePEER: does peer review hurt? A dataset to gauge politeness intensity in the peer reviews	5
The WASABI song corpus and knowledge graph for music lyrics analysis	5
KurdiSent: a corpus for kurdish sentiment analysis	5
Language resources for clinical linguistics: introduction to the special issue	5
Open source platform for Estonian speech transcription	5
Multi-task learning for multi-dialect Arabic sentiment classification and sarcasm detection	5
Correction to: Two sepedi‑english code‑switched speech corpora	4
kidsNARRATE: a versatile corpus for studying Chinese-english bilingual L2 narrative skills in preschoolers	4
Constructing a cross-document event coreference corpus for Dutch	4
Using BERT models for breast cancer diagnosis from Turkish radiology reports	4
Developing and testing syllabification systems for South African Sesotho	4
Part of speech (POS) tagging in Roman Urdu: datasets and models	4
JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments	4
Design and construction of Guayaquil radio speech corpus (CHARG)	4
A corpus of English learners with Arabic and Hebrew backgrounds	4
Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech	4
Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect	4
Text-Muddler: an advanced adversarial paradigm for disrupting NLP-based neural architectures in sentiment analysis frameworks	4
The limitations of irony detection in Dutch social media	3
Creation of a gold standard Dutch corpus of clinical notes for adverse drug event detection: the Dutch ADE corpus	3
OMCD: Offensive Moroccan Comments Dataset	3
Correction to: Semi-automation of gesture annotation by machine learning and human collaboration	3

Examining inferred author and textual correlates of harmful language annotation	3
FullStop: punctuation and segmentation prediction for Dutch with transformers	3
Multilingual speech representation for the Manipuri automatic speech recognition system	3
The taggedPBC: annotating a massive parallel corpus for crosslinguistic investigations	3
Finnish parliament ASR corpus	3
PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines	3
Correction to: Resources for Turkish natural language processing: A critical survey	3
Sentiment analysis in low-resource contexts: BERT’s impact on Central Kurdish	3
DILLo: an Italian lexical database for speech-language pathologists	3
ThaiCoref: Thai coreference resolution dataset	3
Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation	3
The Hmong Medical Corpus: a biomedical corpus for a minority language	3
Assessment of pragmatic abilities and cognitive substrates (APACS) brief remote: a novel tool for the rapid and tele-evaluation of pragmatic skills in Italian	3
Correction: COLLIE: a broad-coverage ontology and lexicon of verbs in English	3
HASTIKA: hate speech and target identification in Kannada-English code-mixed text	3
Aratox: a multi-dialect, multi-label arabic dataset and model benchmark for toxicity detection	3