OOIR: Observatory of International Research

Papers

(The median citation count of Language Resources and Evaluation is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus	41
From LIMA to DeepLIMA: following a new path of interoperability	26
Speech acts in the Dutch COVID-19 Press Conferences	25
Spelling errors made by people with dyslexia	25
Hope speech detection in Spanish	25
Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks	24
A survey on geocoding: algorithms and datasets for toponym resolution	24
IIT Delhi Dialogue Corpus: a quantitative analysis of a spoken corpus of Hindi	21
Brazilian Portuguese corpora for teaching and translation: the CoMET project	21
Prompting encoder models for zero-shot classification: a cross-domain study in Italian	18
The narratives of war (NoW) corpus of written testimonies of the Russia-Ukraine war	16
AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata	15
A new evaluation method: evaluation data and metrics for Chinese grammatical error correction	14
Understanding conversational interaction in multiparty conversations: the EVA Corpus	14
The Visual Language Research Corpus (VLRC): an annotated corpus of comics from Asia, Europe, and the United States	14
A study on methods for revising dependency treebanks: in search of gold	13
Quality assessment of Tibetan–Chinese poetry translation: integrating automated metrics and qualitative insights through a cross-system comparison of dedicated NMT engines and a prompted LLM	13
Spontaneous, controlled acts of reference between friends and strangers	12
Speech recognition in edge environments: an exploration of support and impact of model compression	12
Construction of Amharic information retrieval resources and corpora	11
Human–machine interaction in building an English reference dataset for natural language processing tasks	11
TLEX: an efficient method for extracting exact timelines from TimeML temporal graphs	11
The properties of panels in global comics: frequency and size of 76 K panels in 1,030 comics from 144 countries	11
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning	11
Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models	10

LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI	10
adaptNMT: an open-source, language-agnostic development environment for neural machine translation	10
Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese	10
Perspectivist approaches to natural language processing: a survey	9
Ma’aks: manually-curated parallel dataset for Arabic text sentiment swap	9
Automatic readability assessment for sentences: neural, hybrid and large language models	8
UHated: hate speech detection in Urdu language using transfer learning	8
Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype	8
A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: navigating the trade-offs in model size and performance	8
CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese	8
Conversion of the Spanish WordNet databases into a Prolog-readable format	8
DoSLex: automatic generation of all domain semantically rich sentiment lexicon	7
Slovenian parliamentary corpus siParl	7
The Sanskrit Sembank	7
TCMeta: a multilingual dataset of COVID tweets for relation-level metaphor analysis	7
Chinese-DiMLex: a lexicon of Chinese discourse connectives	7
Uzbek news corpus for named entity recognition	7
Developing and mining an underage modern Greek chat corpus: Do students show signs of bullying behavior while working on a project?	6
An integrated framework for emotion and sentiment analysis in Tamil and Malayalam visual content	6
Correction: The corpus of aggressive language in Polish parliamentary debates	6
Book Review: The Routledge handbook of discourse and disinformation	6
Detecting racism in the digital age: a survey of datasets and algorithms	6
VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability	6
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain	6
A performance analysis of a large language model for Marathi language NLP tasks	5
PolitePEER: does peer review hurt? A dataset to gauge politeness intensity in the peer reviews	5
The WASABI song corpus and knowledge graph for music lyrics analysis	5
KurdiSent: a corpus for kurdish sentiment analysis	5
Language resources for clinical linguistics: introduction to the special issue	5
Open source platform for Estonian speech transcription	5
Multi-task learning for multi-dialect Arabic sentiment classification and sarcasm detection	5
Sense through time: diachronic word sense annotations for word sense induction and Lexical Semantic Change Detection	5
Studying word meaning evolution through incremental semantic shift detection	5
Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data	5
Constructing a cross-document event coreference corpus for Dutch	4
Using BERT models for breast cancer diagnosis from Turkish radiology reports	4
Developing and testing syllabification systems for South African Sesotho	4
Part of speech (POS) tagging in Roman Urdu: datasets and models	4
JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments	4
Design and construction of Guayaquil radio speech corpus (CHARG)	4
A corpus of English learners with Arabic and Hebrew backgrounds	4
Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech	4
Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect	4
Text-Muddler: an advanced adversarial paradigm for disrupting NLP-based neural architectures in sentiment analysis frameworks	4
Correction to: Two sepedi‑english code‑switched speech corpora	4
kidsNARRATE: a versatile corpus for studying Chinese-english bilingual L2 narrative skills in preschoolers	4
Multilingual speech representation for the Manipuri automatic speech recognition system	3
The taggedPBC: annotating a massive parallel corpus for crosslinguistic investigations	3
Finnish parliament ASR corpus	3
FullStop: punctuation and segmentation prediction for Dutch with transformers	3

PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines	3
Correction to: Resources for Turkish natural language processing: A critical survey	3
Sentiment analysis in low-resource contexts: BERT’s impact on Central Kurdish	3
ThaiCoref: Thai coreference resolution dataset	3
Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation	3
The Hmong Medical Corpus: a biomedical corpus for a minority language	3
DILLo: an Italian lexical database for speech-language pathologists	3
Assessment of pragmatic abilities and cognitive substrates (APACS) brief remote: a novel tool for the rapid and tele-evaluation of pragmatic skills in Italian	3
Correction: COLLIE: a broad-coverage ontology and lexicon of verbs in English	3
HASTIKA: hate speech and target identification in Kannada-English code-mixed text	3
The limitations of irony detection in Dutch social media	3
Creation of a gold standard Dutch corpus of clinical notes for adverse drug event detection: the Dutch ADE corpus	3
OMCD: Offensive Moroccan Comments Dataset	3
Aratox: a multi-dialect, multi-label arabic dataset and model benchmark for toxicity detection	3
Correction to: Semi-automation of gesture annotation by machine learning and human collaboration	3
Examining inferred author and textual correlates of harmful language annotation	3
Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus	2
SOLD: Sinhala offensive language dataset	2
Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements	2
Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text	2
A Chinese natural speech complex emotion dataset based on emotion vector annotation method	2
A rich task-oriented dialogue corpus in Vietnamese	2
Automatic genre identification: a survey	2
“You’ll be a nurse, my son!” Automatically assessing gender biases in autoregressive language models in French and Italian	2
A comparative study of sentence alignment methods for Spanish text simplification	2
Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations	2
MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models	2
Do you understand Italian? Evaluating LVLMs on Italian visual question-answering	2
Automating translation checks of financial documents using large language models	2
COLLIE: a broad-coverage ontology and lexicon of verbs in English	2
Disfluency annotated corpora for Indian English in technical domains	2
An aligned corpus of Spanish bibles	2
The Mandarin Chinese speech database: a corpus of 18,820 auditory neutral nonsense sentences	2
RUN-AS: a novel approach to annotate news reliability for disinformation detection	2
Rei Miyata: controlled document authoring in a machine translation age	2
Detection of political hate speech in Korean language	2
Spoken Spanish PoS tagging: gold standard dataset	2
Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection	2
Speech emotion recognition for the Urdu language	2
Investigating interoperable event corpora: limitations of reusability of resources and portability of models	2
Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches	2
Building an emotion lexicon for Serbian using curated language resources	2
MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish	2
A comprehensive evaluation of semantic relation knowledge of pretrained language models and humans	2
Incremental imbalance-aware deep learning framework for multilingual spoken language identification	2
A survey and study impact of tweet sentiment analysis via transfer learning in low resource scenarios	2
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese	2
Building a specialised Hebrew textual corpus on construction, planning and architecture	2
The Najdi Arabic Corpus: a new corpus for an underrepresented Arabic dialect	2
Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies	2
Evaluation of a rule-based approach to automatic factual question generation using syntactic and semantic analysis	2
DiscoNaija: a discourse-annotated parallel Nigerian Pidgin-English corpus	2
Aspect-based multimodal sentiment analysis via employing visual-to-emotional-caption translation network using visual-caption pairs	2
Data-driven weakly supervised emotion classification with consistency regularization: Mandarin Chinese as a case	2
FinnSentiment: a Finnish social media corpus for sentiment polarity annotation	2
Human-inspired computational models for European Portuguese: a review	2
Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus	2
OLID-BR: offensive language identification dataset for Brazilian Portuguese	2
Normalized dataset for Sanskrit word segmentation and morphological parsing	2
A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus	2
Regionalized models for Spanish language variations based on Twitter	1
Umplc: the first longitudinal learner corpus of Portuguese	1
Maithilimt: Developing Multi-Domain Parallel Corpus for Hindi-Maithili Machine Translation	1
Neural text sanitization with privacy risk indicators: an empirical analysis	1
Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)	1
Dataset on sentiment-based cryptocurrency-related news and tweets in English and Malay language	1
DepreSym: A Depression Symptom Annotated Corpus and the Role of Large Language Models as Assessors of Psychological Markers	1
Czech news dataset for semantic textual similarity	1
The corpus of aggressive language in Polish parliamentary debates	1
A flexible tool for a qualia-enriched FrameNet: the FrameNet Brasil WebTool	1
“But why??” Evaluation of user-suggested synonyms in the Thesaurus of Modern Slovene	1
The link between translation difficulty and the quality of machine translation: a literature review and empirical investigation	1
Linguistic knowledge injected into large language model for Urdu-English neural machine translation	1
A new methodology for automatic creation of concept maps of Turkish texts	1
Parallel Trees: a novel resource with aligned dependency and constituency syntactic representations	1
Disfluency processing for cascaded speech translation involving English and Indian languages	1
Two sepedi-english code-switched speech corpora	1
Historical Portuguese corpora: a survey	1
Multilingual prediction of semantic norms with language models: a study on English and Chinese	1
Editorial: LRE updates	1

Parafrasário: a variety-based paraphrasary for Portuguese	1
Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing	1
ChavacanoMT: a corpus and evaluation of neural machine translation for Philippine Creole Spanish	1
Corpus-based computational frame and construction analysis of motion metaphors	1
Human–robot dialogue annotation for multi-modal common ground	1
RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese	1
A new corpus of geolocated ASR transcripts from Germany	1
TANDO+: corpus and baselines for document-level machine translation in Basque–Spanish and Basque–French	1
POMET: a corpus for poetic meter classification	1
Correction: Aratox: a multi-dialect, multi-label arabic dataset and model benchmark for toxicity detection	1
Special issue on language technology platforms	1
Semantic evaluation metric conforming to AMR theory (SEMCAT): a new similarity metric for abstract meaning representation	1
Bridging the linguistic divide: a survey on leveraging large language models for machine translation	1
Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization	1
Exa-PSD: a new Persian sentiment analysis dataset on Twitter	1
Error annotation: a review and faceted taxonomy	1
A corpus of Persian literary text	1
Attention and LoRA-based multimodal emotion detection system	1
Arab music improvisation corpus for research (AMICOR): development and machine translation experiments	1
The robotic-surgery propositional bank	1
From extended chunking to dependency parsing using traditional Arabic grammar	1
Fake news article detection datasets for Hindi language	1
Building the VisSE Corpus of Spanish SignWriting	1
Content-free speech activity records: interviews with people with schizophrenia	1
POS tagging of low-resource Pashto language: annotated corpus and BERT-based model	1
CsFEVER and CTKFacts: acquiring Czech data for fact verification	1
Semantic search as extractive paraphrase span detection	1
Register identification from the unrestricted open Web using the Corpus of Online Registers of English	1
Label modification and bootstrapping for zero-shot cross-lingual hate speech detection	1
CINWA (database of terminology for cultivated plants in indigenous languages of northwestern South America): introducing a resource for research in ethnobiology, anthropology, historical linguistics,	1
Evaluation of end-to-end continuous spanish lipreading in different data conditions	1
Mining culture from professional discourse: a lexicon-based hybrid method	1
Aspect sentiment triplet extraction via integrating contextual semantic relevance and syntactic relevance	1
VeLeRo: an inflected verbal lexicon of standard Romanian and a quantitative analysis of morphological predictability	1
UstanceBR: a social media language resource for stance prediction	1