IEEE Transactions on Multimedia

Papers
(The median citation count of IEEE Transactions on Multimedia is 6. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-11-01 to 2025-11-01.)
ArticleCitations
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations706
Disaggregation Distillation for Person Search390
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt335
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning255
Improving Vision Anomaly Detection With the Guidance of Language Modality249
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection243
Bidirectional Translation Between UHD-HDR and HD-SDR Videos241
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames205
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity202
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework194
Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning161
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics161
Dual-Task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding160
Exploring Kernel Transformations for Implicit Neural Representations158
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective149
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes140
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation138
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks135
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval134
Dynamic Contrastive Distillation for Image-Text Retrieval134
SLCGC: A lightweight Self-supervised Low-pass Contrastive Graph Clustering Network for Hyperspectral Images132
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability131
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval130
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection129
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter128
BASNet: Boundary Assisted Network for Image Splicing Forgery Detection122
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency119
Neighborhood Contrastive Transformer for Change Captioning117
Anomaly-Led Prompting Learning Caption Generating Model and Benchmark113
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception113
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features113
Progressive Local Filter Pruning for Image Retrieval Acceleration111
ICE: Interactive 3D Game Character Facial Editing via Dialogue110
Watch Where You Move: Region-Aware Dynamic Aggregation and Excitation for Gait Recognition109
Semantic Dual-Adversarial Network for Blended-Target Domain Adaptation108
Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes105
Structured Attention Network for Referring Image Segmentation104
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment101
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning100
Semantic-Aware Triplet Loss for Image Classification100
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing98
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective98
Pixel Bleach Network for Detecting Face Forgery Under Compression98
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection97
Few-Shot Generative Model Adaptation via Style-Guided Prompt95
Asymptotics-Aware Multi-View Subspace Clustering94
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation93
Towards Fast and Robust Real Image Denoising With Attentive Neural Network and PID Controller92
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion91
Guided Image-to-Image Translation by Discriminator-Generator Communication91
Semi-Supervised Contrastive Learning With Similarity Co-Calibration90
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization90
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis88
Unsupervised Learning-Based Framework for Deepfake Video Detection88
Mix-Based Training Strategies for Learning Implicit Neural Representations87
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing86
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval85
Optimal Transport-Based Patch Matching for Image Style Transfer84
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments84
FoodSAM: Any Food Segmentation83
Weakly-Supervised 3D Visual Grounding Based on Visual Language Alignment81
Scale Up Composed Image Retrieval Learning via Modification Text Generation81
A Total Variation With Joint Norms For Infrared and Visible Image Fusion80
Self-Guided Discriminative Locality Preserving Projections79
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition78
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion78
A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition78
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning76
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation76
Annealing Genetic GAN for Imbalanced Web Data Learning75
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network73
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention73
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network72
JPEG AI Compressed Domain Face Detection: a Multi-scale Bridging Perspective72
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis72
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation72
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges72
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach71
Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models70
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval70
Depth Map Super-Resolution via Deep Cross-modality and Cross-scale Guidance70
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification70
Rate-Adaptive Neural Network for Image Compressive Sensing69
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification69
ALCER3D: Adaptive Learning Constraints for Enhanced Retrieval of Complex Indoor 3D Scenarios68
Supervised Contrastive Learning for Indoor Point Cloud Oversegmentation68
RUL: Region Uncertainty Learning for Robust Face Recognition67
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation67
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks67
Human-Centric Behavior Description in Videos: New Benchmark and Model67
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor67
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter67
Unsupervised Image and Text Fusion for Travel Information Enhancement66
CariMe: Unpaired Caricature Generation With Multiple Exaggerations65
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training65
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation64
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling64
One-Shot Image-to-Image Translation via Part-Global Learning With a Multi-Adversarial Framework63
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition63
DEHand: Deformable Encoding for Photo-Realistic Free-View and Free-Pose Hand Rendering62
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition62
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning62
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization62
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds61
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection61
CRSOT: Cross-Resolution Object Tracking Using Unaligned Frame and Event Cameras61
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer61
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement60
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images60
High Specificity Guided Cross-Domain Few-Shot Segmentation60
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights60
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition59
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification59
Video-to-Music Recommendation Using Temporal Alignment of Segments59
Video Instance Segmentation by Instance Flow Assembly59
Motion Deblur by Learning Residual From Events58
Enhanced Context Mining and Filtering for Learned Video Compression57
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue57
Action-Responsive Contrastive Network for Fine-Grained Skeleton-Based Action Recognition57
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification56
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification56
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction56
Modeling Sequential Listening Behaviors With Attentive Temporal Point Process for Next and Next New Music Recommendation55
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection55
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting55
Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection55
Multi-View User Preference Modeling for Personalized Text-to-Image Generation55
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation55
Probabilistic Temporal Masked Attention for Cross-view Online Action Detection54
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution54
Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition54
CMANet: Context-aware Mutual Attention Network for Referring Image Segmentation53
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions53
A Multidimensional Media Adaptation Framework for Live Holographic Communication53
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation53
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model53
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection53
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation52
Multimodal Sentiment Analysis With Image-Text Interaction Network52
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification52
Underwater Image Enhancement With Cascaded Contrastive Learning51
Improving Fine-Grained Image Classification With Multimodal Information51
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization51
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds51
BVI-DVC: A Training Database for Deep Video Compression51
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering50
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement50
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection50
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering50
Cross-Domain Sample Relationship Learning for Facial Expression Recognition50
View-Invariant Human Action Recognition Via View Transformation Network (VTN)49
TrackletGait: A Robust Framework for Gait Recognition in the Wild48
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification48
Blind Video Quality Assessment at the Edge48
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection48
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling48
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization48
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction47
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds47
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images47
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering47
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation47
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment47
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms47
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition46
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance46
OpenSlot: Mixed Open-Set Recognition With Object-Centric Learning46
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data46
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval45
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition45
SSPNet: Predicting Visual Saliency Shifts45
Visibility-Based Geometry Pruning of Neural Plenoptic Scene Representations45
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse45
Tuning-free High-Resolution Video Diffusion with Spatial-Temporal Latent Grouping45
Geometric Continuity and Consistency Learning for Self-Supervised Point Cloud Completion44
Compact-Yet-Separate: Proto-Centric Multi-Modal Hashing With Pronounced Category Differences for Multi-Modal Retrieval44
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification44
Improving Visual Object Tracking Through Visual Prompting44
Inexactly Matched Referring Expression Comprehension With Rationale44
Point Cloud Soft Multicast for Untethered XR Users44
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification44
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation44
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction44
Category-Contrastive Fine-Grained Crowd Counting and Beyond44
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model44
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction44
MPPM: A Mobile-Efficient Part Model for Object re-ID43
CMI-Net: Cross-view Message Token Interaction Network for 3D Shape Recognition43
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface43
Self-Supervised Face Image Manipulation by Conditioning GAN on Face Decomposition43
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement43
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation43
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems43
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO43
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation43
Collaborative Learning With a Multi-Branch Framework for Feature Enhancement43
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking43
Dense Video Captioning With Early Linguistic Information Fusion43
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning42
SwimVG: Step-Wise Multimodal Fusion and Adaption for Visual Grounding42
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved42
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model42
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images42
Neural-Enhanced Rate Adaptation and Computation Distribution for Emerging mmWave Multi-User 3D Video Streaming Systems42
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction42
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation42
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection41
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning41
Flow Guidance Deformable Compensation Network for Video Frame Interpolation41
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection41
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition41
Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval41
Uncertainty-Aware Unsupervised Domain Adaptation in Object Detection41
Face De-Occlusion With Deep Cascade Guidance Learning41
Self-Supervised Graph Convolutional Network for Multi-View Clustering41
Toward Intelligent Design: An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators41
A Real-Time Semi-Supervised Deep Tone Mapping Network41
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance40
Interpretable Multi-view Representation Learning Towards Complex Scenes: From Homogeneity to Heterogeneity40
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection40
Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation40
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing40
Graph Convolutional Network With Unknown Class Number40
Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis39
Instruction-Driven 3D Facial Expression Generation and Transition39
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression39
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning39
Differentiable Spatial Regression: A Novel Method for 3D Hand Pose Estimation39
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing39
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement39
Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising39
Decoupled Prototype Learning for Reliable Test-Time Adaptation39
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval39
Design of a 5G Multimedia Broadcast Application Function Supporting Adaptive Error Recovery38
Learning Trimaps via Clicks for Image Matting38
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback38
A Novel Human Image Sequence Synthesis Method by Pose-Shape-Content Inference38
TaoHighlight: Commodity-Aware Multi-Modal Video Highlight Detection in E-Commerce38
Privacy-Preserving Image Acquisition for Neural Vision Systems38
DetailRecon: Focusing on Detailed Regions for Online Monocular 3D Reconstruction38
Learning Sparse and Discriminative Multimodal Feature Codes for Finger Recognition38
Feature Weakening, Contextualization, and Discrimination for Weakly Supervised Temporal Action Localization38
MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields38
IEEE Transactions on Multimedia Publication Information38
Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching38
Learning to Learn With Variational Inference for Cross-Domain Image Classification38
Towards Semi-supervised Dual-modal Semantic Segmentation38
Editorial38
Weakly Supervised Distribution Discrepancy Minimization Learning With State Information for Person Re-Identification38
Adapting Multimodal Large Language Models for Video Question Answering by Capturing Question-critical and Coherent Moments38
0.39064788818359