IEEE-ACM Transactions on Audio Speech and Language Processing

Papers
(The median citation count of IEEE-ACM Transactions on Audio Speech and Language Processing is 5. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-09-01 to 2025-09-01.)
ArticleCitations
Representation Learning With Hidden Unit Clustering for Low Resource Speech Applications332
Decorrelation in Feedback Delay Networks249
CET2: Modelling Topic Transitions for Coherent and Engaging Knowledge-Grounded Conversations219
WDEA: The Structure and Semantic Fusion With Wasserstein Distance for Low-Resource Language Entity Alignment136
Review of Methods for Automatic Speaker Verification129
Envelope-Based Multichannel Noise Reduction for Cochlear Implant Applications109
$\mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis106
Towards Generating Diverse Audio Captions via Adversarial Training100
Multi-Channel to Multi-Channel Noise Reduction and Reverberant Speech Preservation in Time-Varying Acoustic Scenes for Binaural Reproduction83
The Harmonic Shift Algorithm for Efficient Multi-Pitch Detection73
Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data73
MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation62
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features60
DropAttack: A Random Dropped Weight Attack Adversarial Training for Natural Language Understanding55
Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping54
Attention-Based Speech Enhancement Using Human Quality Perception Modeling54
Interpretable Multimodal Capsule Fusion54
A User-Centric Approach for Deep Residual-Echo Suppression in Double-Talk54
Multi-Level Time-Frequency Bins Selection for Direction of Arrival Estimation Using a Single Acoustic Vector Sensor53
Generalizing Speaker Verification for Spoof Awareness in the Embedding Space53
Inference Skipping for More Efficient Real-Time Speech Enhancement With Parallel RNNs50
SBSim: A Sentence-BERT Similarity-Based Evaluation Metric for Indian Language Neural Machine Translation Systems48
Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition46
Learning Discriminative Representations and Decision Boundaries for Open Intent Detection46
Similarity Measurement of Segment-Level Speaker Embeddings in Speaker Diarization45
Reverberant Source Separation Using NTF With Delayed Subsources and Spatial Priors45
Learning Phone Recognition From Unpaired Audio and Phone Sequences Based on Generative Adversarial Network44
AudioLM: A Language Modeling Approach to Audio Generation43
Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps43
Comparison of Feature Extraction Methods for Sound-Based Classification of Honey Bee Activity42
The VoxCeleb Speaker Recognition Challenge: A Retrospective41
Automatic Math Word Problem Generation With Topic-Expression Co-Attention Mechanism and Reinforcement Learning40
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing39
Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations39
COVID-19 Detection via Fusion of Modulation Spectrum and Linear Prediction Speech Features39
SPEC: Summary Preference Decomposition for Low-Resource Abstractive Summarization38
Hate Speech Detection via Dual Contrastive Learning36
Emotion Prediction Oriented Method With Multiple Supervisions for Emotion-Cause Pair Extraction35
End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy35
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix35
ReZero: Region-Customizable Sound Extraction34
Enhanced Multi-Domain Dialogue State Tracker With Second-Order Slot Interactions33
Predicting Level-Dependent Changes in Concurrent Vowel Scores Using the 2D-CNN Models33
IEEE Signal Processing Society Information33
Adaptive Multi-Domain Dialogue State Tracking on Spoken Conversations33
Label-Correction Capsule Network for Hierarchical Text Classification31
Disentangled Text Representation Learning With Information-Theoretic Perspective for Adversarial Robustness31
Phrase-Aware Financial Sentiment Analysis Based on Constituent Syntax31
Complex-Domain Pitch Estimation Algorithm for Narrowband Speech Signals31
Spherically Steerable Vector Differential Microphone Arrays31
Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry30
USEV: Universal Speaker Extraction With Visual Cue29
Neural Coupled Sequence Labeling for Heterogeneous Annotation Conversion29
Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment29
Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach29
Multi-Source Domain Adaptation and Fusion for Speaker Verification29
Integrated Syntactic and Semantic Tree for Targeted Sentiment Classification Using Dual-Channel Graph Convolutional Network28
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking28
Implicit Self-Supervised Language Representation for Spoken Language Diarization28
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation28
Source Separation of Piano Concertos Using Musically Motivated Augmentation Techniques28
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations28
Blind Identification of Ambisonic Reduced Room Impulse Response27
Multi-Grained Evidence Inference for Multi-Choice Reading Comprehension27
Multi-Tone Active Noise Equalizer With Spatially Distributed User-Selected Profiles27
Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations26
Grouped Feedback Delay Networks With Frequency-Dependent Coupling26
Increasing Context for Estimating Confidence Scores in Automatic Speech Recognition26
FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow25
Weighted Frequency Smoothing for Enhanced Speaker Localization25
Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music25
Howling Detection and Gain Control for Speech Reinforcement in a Noisy Car Cabin Environment25
TOE: A Grid-Tagging Discontinuous NER Model Enhanced by Embedding Tag/Word Relations and More Fine-Grained Tags25
DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and Mandarin25
Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation25
Artificial Vocal Learning Guided by Phoneme Recognition and Visual Information23
Dynamic Multi-Branch Layers for On-Device Neural Machine Translation23
Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning23
Magnitude-Corrected and Time-Aligned Interpolation of Head-Related Transfer Functions23
SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection23
$F0$ Estimation and Voicing Detection With Cascade Architecture in Noisy Speech23
Unsupervised Disentanglement Learning Model for Exemplar-Guided Paraphrase Generation22
Neural Multi-Channel and Multi-Microphone Acoustic Echo Cancellation22
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting22
Unsupervised Music Source Separation Using Differentiable Parametric Source Models22
Attention Analysis and Calibration for Transformer in Natural Language Generation22
An Analysis of Traditional Noise Power Spectral Density Estimators Based on the Gaussian Stochastic Volatility Model22
Training a Singing Transcription Model Using Connectionist Temporal Classification Loss and Cross-Entropy Loss22
Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval22
MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription22
Tackling Interpretability in Audio Classification Networks With Non-negative Matrix Factorization22
Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation22
Generalized Hyperbolic Tangent Based Random Fourier Conjugate Gradient Filter for Nonlinear Active Noise Control22
Anti-Aliasing Speech DOA Estimation Under Spatial Aliasing Conditions22
Towards Recognition for Radio-Echo Speech in Air Traffic Control: Dataset and a Contrastive Learning Approach22
A New Diffusion Filtered-X Affine Projection Algorithm: Performance Analysis and Application in Windy Environment21
An Optimized Zero-Attracting LMS Algorithm for the Identification of Sparse System21
Refining History for Future-Aware Neural Machine Translation21
Visually Grounded Few-Shot Word Learning in Low-Resource Settings21
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing21
HPSG-Inspired Joint Neural Constituent and Dependency Parsing in O($n^3$) Time Complexity21
Decomposition-Based Wiener Filter Using the Kronecker Product and Conjugate Gradient Method21
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement20
CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding20
Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation20
Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition20
ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification20
Unified Instance and Knowledge Alignment Pretraining for Aspect-Based Sentiment Analysis20
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization20
Parametric Ambisonic Encoding of Arbitrary Microphone Arrays20
Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters20
Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding20
Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information19
Configurable EBEN: Extreme Bandwidth Extension Network to Enhance Body-Conducted Speech Capture19
Microphone Array Beamforming With High Flexible Interference Attenuation and Noise Reduction19
TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition19
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition19
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS19
DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors18
Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings18
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance18
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer18
Optimal Modal Decomposition for Directionally Biased Sound Field Recording18
Coefficients-Switched Normalized Least-Mean- Squares Adaption in Echo Canceler of Sparse-Echo-Path18
Query-Efficient Black-Box Adversarial Attacks on Automatic Speech Recognition18
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation18
Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs18
Hybrid-Frequency-Resolution Adaptive Kalman Filter for Online Identification of Long Acoustic Responses With Low Input-Output Latency18
Distance Metric-Based Open-Set Domain Adaptation for Speaker Verification18
Enhancing Paraphrase Question Generation With Prior Knowledge18
A Digital Twin Architecture for Wireless Networked Adaptive Active Noise Control18
Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training18
BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks18
Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model18
Minimum Processing Near-End Listening Enhancement18
Higher-Order Stereophony18
S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder17
NeuralDPS: Neural Deterministic Plus Stochastic Model With Multiband Excitation for Noise-Controllable Waveform Generation17
Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models17
Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion17
Emulating Perceptual Evaluation of Voice Using Scattering Transform Based Features17
Syntax-Aware Multi-Spans Generation for Reading Comprehension17
Active Discovering New Slots for Task-Oriented Conversation16
EchoScan: Scanning Complex Room Geometries via Acoustic Echoes16
A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition16
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings16
FSD50K: An Open Dataset of Human-Labeled Sound Events16
Decoding Knowledge Transfer for Neural Text-to-Speech Training16
Multi-Teacher Distillation With Single Model for Neural Machine Translation16
TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis16
SoundStream: An End-to-End Neural Audio Codec16
Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering16
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC16
A Flexible Architecture Using Temporal, Spatial and Semantic Correlation-Based Algorithms for Story Segmentation of Broadcast News16
Distributed Combined Acoustic Echo Cancellation and Noise Reduction in Wireless Acoustic Sensor and Actuator Networks16
A Perceptually Evaluated Signal Model: Collisions Between a Vibrating Object and an Obstacle16
A New Virtual Tracking Sub-Algorithm Based Hybrid Active Control System for Narrowband Noise With Impulsive Interference16
MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning16
Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation16
MOSA: Music Motion With Semantic Annotation Dataset for Cross-Modal Music Processing15
Latent-Domain Predictive Neural Speech Coding15
Multi-Source Discriminant Subspace Alignment for Cross-Domain Speech Emotion Recognition15
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data15
A Composite T60 Regression and Classification Approach for Speech Dereverberation15
Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck15
AdvExpander: Generating Natural Language Adversarial Examples by Expanding Text15
On the Predictive Power of Objective Intelligibility Metrics for the Subjective Performance of Deep Complex Convolutional Recurrent Speech Enhancement Networks15
Multi-Source Localization Using Optimized Time-Frequency Representation and Sparsity Component Analysis15
Meta-AF: Meta-Learning for Adaptive Filters15
Rethinking Textual Adversarial Defense for Pre-Trained Language Models14
JMS-QA: A Joint Hierarchical Architecture for Mental Health Question Answering14
Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions14
Empathetic Response Generation Based on Plug-and-Play Mechanism With Empathy Perturbation14
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations14
Adaptive Pre-Training and Collaborative Fine-Tuning: A Win-Win Strategy to Improve Review Analysis Tasks14
Interrelate Training and Clustering for Online Speaker Diarization14
List of Reviewers14
Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments14
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification14
Task-Adaptive Feature Fusion for Generalized Few-Shot Relation Classification in an Open World Environment14
The Weighted Cross-Modal Attention Mechanism With Sentiment Prediction Auxiliary Task for Multimodal Sentiment Analysis14
Retrieve-and-Edit Domain Adaptation for End2End Aspect Based Sentiment Analysis14
Joint Maximum Likelihood Estimation of Microphone Array Parameters for a Reverberant Single Source Scenario14
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models14
Decomposed Meta-Learning for Few-Shot Sequence Labeling14
Cross-Domain Aspect-Based Sentiment Classification With Tripartite Graph Modeling14
Operation-Augmented Numerical Reasoning for Question Answering13
Block-Based Perceptually Adaptive Sound Zones With Reproduction Error Constraints13
FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection13
FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System13
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria13
Controllable Dialogue Generation With Disentangled Multi-Grained Style Specification and Attribute Consistency Reward13
Uncertainty-Driven Knowledge Distillation for Language Model Compression13
Selective Listening by Synchronizing Speech With Lips13
Gradformer: A Framework for Multi-Aspect Multi-Granularity Pronunciation Assessment13
EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations13
En-HACN: Enhancing Hybrid Architecture With Fast Attention and Capsule Network for End-to-end Speech Recognition13
Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses13
JoinER-BART: Joint Entity and Relation Extraction With Constrained Decoding, Representation Reuse and Fusion13
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis13
Dynamic Prompt-Driven Zero-Shot Relation Extraction13
RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition13
Improving Chinese Named Entity Recognition by Large-Scale Syntactic Dependency Graph13
Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models13
Towards Unified Multi-Domain Machine Translation With Mixture of Domain Experts13
Improved Transformer With Multi-Head Dense Collaboration13
Data-Centric Methods for Environmental Sound Classification With Limited Labels12
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation12
Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech12
Artist Similarity Based on Heterogeneous Graph Neural Networks12
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition12
A Class of Pareto Optimal Binaural Beamformers12
Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models12
Multi-Task Attentive Residual Networks for Argument Mining12
Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition12
NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks12
Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children12
Joint Dual Learning With Mutual Information Maximization for Natural Language Understanding and Generation in Dialogues12
Constant-Beamwidth Beamforming With Nonuniform Concentric Ring Arrays12
Enhancing Low-Resource NLP by Consistency Training With Data and Model Perturbations12
Statistically Guided Near-End Speech Intelligibility Improvement Through Voice Transformation and Transfer Learning12
Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments12
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR12
Sparse DNN Model for Frequency Expanding of Higher Order Ambisonics Encoding Process12
Integrating Prior Translation Knowledge Into Neural Machine Translation12
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks12
Interpretable Spectrum Transformation Attacks to Speaker Recognition Systems12
SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System12
Distributed Microphone Array Localization Problem via SDP-SOCP Method12
Music Source Separation With Band-Split RNN12
Training-Based Multiple Source Tracking Using Manifold-Learning and Recursive Expectation-Maximization11
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research11
AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction11
A Compact Noise Covariance Matrix Model for MVDR Beamforming11
Recent Trends in Deep Learning Based Textual Emotion Cause Extraction11
A Multi-Level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference11
Can Pretrained English Language Models Benefit Non-English NLP Systems in Low-Resource Scenarios?11
Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement11
Low Latency Speech Enhancement for Hearing Aids Using Deep Filtering11
Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification11
WDSRL: Multi-Domain Neural Machine Translation With Word-Level Domain-Sensitive Representation Learning11
The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation11
Overview of the Tenth Dialog System Technology Challenge: DSTC1011
PQG-A2SA: Performance Quantification Guided Audio-to-Score Alignment for Orchestral Music11
Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios11
On the Quantization of Neural Models for Speaker Verification11
Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation11
Transforming Wikipedia Into Augmented Data for Query-Focused Summarization11
Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning11
Lost in Context? On the Sense-Wise Variance of Contextualized Word Embeddings11
0.36422085762024