IEEE Transactions on Multimedia

Papers
(The TQCC of IEEE Transactions on Multimedia is 17. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-09-01 to 2025-09-01.)
ArticleCitations
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection639
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations361
Optimal Transport-Based Patch Matching for Image Style Transfer310
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt234
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning229
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention223
Disaggregation Distillation for Person Search204
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment201
Semantic-Aware Triplet Loss for Image Classification193
Robust Multi-stage Tracking via Multi-scale and Multi-level Representation Learning177
Improving Vision Anomaly Detection With the Guidance of Language Modality155
Towards Fast and Robust Real Image Denoising With Attentive Neural Network and PID Controller150
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation147
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective140
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing135
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity131
Few-Shot Generative Model Adaptation via Style-Guided Prompt129
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection124
Pixel Bleach Network for Detecting Face Forgery Under Compression123
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective122
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation118
Asymptotics-Aware Multi-View Subspace Clustering117
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency115
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning114
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection111
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion111
Annealing Genetic GAN for Imbalanced Web Data Learning110
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes110
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments110
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features108
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval107
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework107
Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding103
Exploring Kernel Transformations for Implicit Neural Representations100
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics98
ICE: Interactive 3D Game Character Facial Editing via Dialogue96
Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes96
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter94
Unsupervised Learning-Based Framework for Deepfake Video Detection92
Semi-Supervised Contrastive Learning With Similarity Co-Calibration91
Scale Up Composed Image Retrieval Learning via Modification Text Generation89
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment88
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition87
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames87
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization85
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval84
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval84
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception84
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning83
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing83
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability83
Bidirectional Translation Between UHD-HDR and HD-SDR Videos82
Neighborhood Contrastive Transformer for Change Captioning81
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis81
Structured Attention Network for Referring Image Segmentation80
A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition80
Progressive Local Filter Pruning for Image Retrieval Acceleration78
FoodSAM: Any Food Segmentation76
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation75
Guided Image-to-Image Translation by Discriminator-Generator Communication75
A Total Variation With Joint Norms For Infrared and Visible Image Fusion75
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks75
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion74
Dynamic Contrastive Distillation for Image-Text Retrieval74
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter73
SLCGC: A lightweight Self-supervised Low-pass Contrastive Graph Clustering Network for Hyperspectral Images73
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification73
Unsupervised Image and Text Fusion for Travel Information Enhancement70
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training68
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement68
CariMe: Unpaired Caricature Generation With Multiple Exaggerations67
Human-Centric Behavior Description in Videos: New Benchmark and Model66
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network65
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation65
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor65
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition64
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification64
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds63
Rate-Adaptive Neural Network for Image Compressive Sensing63
Multimodal Sentiment Analysis With Image-Text Interaction Network62
Underwater Image Enhancement With Cascaded Contrastive Learning62
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks62
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection61
Video-to-Music Recommendation Using Temporal Alignment of Segments61
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach60
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting60
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis60
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval60
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model60
One-Shot Image-to-Image Translation via Part-Global Learning With a Multi-Adversarial Framework59
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition59
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification58
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions58
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection58
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling58
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images58
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization57
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation57
CRSOT: Cross-Resolution Object Tracking Using Unaligned Frame and Event Cameras56
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation56
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction56
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds56
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges56
Multi-View User Preference Modeling for Personalized Text-to-Image Generation56
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue55
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation55
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network55
Cross-Domain Sample Relationship Learning for Facial Expression Recognition54
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer54
Video Instance Segmentation by Instance Flow Assembly54
Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection54
DEHand: Deformable Encoding for Photo-realistic Free-view and Free-pose Hand Rendering54
Modeling Sequential Listening Behaviors With Attentive Temporal Point Process for Next and Next New Music Recommendation54
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification54
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights53
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification53
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection52
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation52
Enhanced Context Mining and Filtering for Learned Video Compression52
BVI-DVC: A Training Database for Deep Video Compression52
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection51
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation51
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning51
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization50
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution50
Improving Fine-Grained Image Classification With Multimodal Information49
Motion Deblur by Learning Residual From Events49
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement49
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition48
Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition48
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification48
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering47
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression47
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification47
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition47
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface46
Face De-Occlusion With Deep Cascade Guidance Learning46
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection46
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering46
Decoupled Prototype Learning for Reliable Test-Time Adaptation46
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning46
Blind Video Quality Assessment at the Edge46
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval45
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images45
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing45
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction45
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification45
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation45
Graph Convolutional Network With Unknown Class Number44
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse44
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning44
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO44
SSPNet: Predicting Visual Saliency Shifts43
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking43
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement43
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction43
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification43
Collaborative Learning With a Multi-Branch Framework for Feature Enhancement43
Category-Contrastive Fine-Grained Crowd Counting and Beyond43
MPPM: A Mobile-Efficient Part Model for Object re-ID43
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation43
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms42
Instruction-Driven 3D Facial Expression Generation and Transition42
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning42
Flow Guidance Deformable Compensation Network for Video Frame Interpolation42
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization42
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction42
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering42
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction42
Inexactly Matched Referring Expression Comprehension With Rationale42
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation42
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation41
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images41
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model41
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation41
Self-Supervised Face Image Manipulation by Conditioning GAN on Face Decomposition41
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection41
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data40
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection40
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance40
Neural-Enhanced Rate Adaptation and Computation Distribution for Emerging Mmwave Multi-User 3D Video Streaming Systems40
OpenSlot: Mixed Open-Set Recognition with Object-Centric Learning40
Differentiable Spatial Regression: A Novel Method for 3D Hand Pose Estimation40
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment40
Improving Visual Object Tracking Through Visual Prompting40
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds40
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling39
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model39
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition39
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems39
Dense Video Captioning With Early Linguistic Information Fusion39
Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising38
Point Cloud Soft Multicast for Untethered XR Users38
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing38
Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation38
View-Invariant Human Action Recognition Via View Transformation Network (VTN)38
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval38
Uncertainty-Aware Unsupervised Domain Adaptation in Object Detection38
Compact-Yet-Separate: Proto-Centric Multi-Modal Hashing With Pronounced Category Differences for Multi-Modal Retrieval38
Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis38
Toward Intelligent Design: An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators38
A Real-Time Semi-Supervised Deep Tone Mapping Network38
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition38
Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval37
Weakly Supervised Distribution Discrepancy Minimization Learning With State Information for Person Re-Identification37
Soft Warping Based Unsupervised Domain Adaptation for Stereo Matching37
A Novel Human Image Sequence Synthesis Method by Pose-Shape-Content Inference37
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection37
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved37
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance37
Editorial37
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement37
Self-Supervised Graph Convolutional Network for Multi-View Clustering37
TaoHighlight: Commodity-Aware Multi-Modal Video Highlight Detection in E-Commerce37
Weakly Supervised Regional and Temporal Learning for Facial Action Unit Recognition37
Dual-Guided Frequency Prototype Network for Few-Shot Semantic Segmentation36
DSLL-Face: Distributed Supervision-Integrated Framework for Low-Light Face Detection36
Video Compressed Sensing Via Wavelet Residual Sampling and Dual-Domain Fusion36
An Information Compensation Framework for Zero-Shot Skeleton-Based Action Recognition36
IEEE Transactions on Multimedia Publication Information36
Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching36
Spatial-Temporal Aware-Based Unsupervised Network for Infrared Small Target Detection36
Dual-Level Masked Semantic Inference for Semi-Supervised Semantic Segmentation36
Weakly Supervised Few-Shot Semantic Segmentation via Pseudo Mask Enhancement and Meta Learning36
Multi-modal Depression Detection in Interview via Exploring Emotional Distribution Information36
Exploit the Best of Both End-to-End and Map-Based Methods for Multi-Focus Image Fusion36
GITANet: Group Interactive Threshold-Based Attention Network for Hyperspectral Image Classification36
MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields36
Feature Weakening, Contextualization, and Discrimination for Weakly Supervised Temporal Action Localization35
Semantic Distance Adversarial Learning for Text-to-Image Synthesis35
Learning Sparse and Discriminative Multimodal Feature Codes for Finger Recognition35
Free$\rm ^{3}$Net: Gliding Free, Orientation Free, and Anchor Free Network for Oriented Object Detection35
Design of a 5G Multimedia Broadcast Application Function Supporting Adaptive Error Recovery35
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback35
Dual Stream Relation Learning Network for Image-Text Retrieval35
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval35
DetailRecon: Focusing on Detailed Regions for Online Monocular 3D Reconstruction35
Privacy-Preserving Image Acquisition for Neural Vision Systems35
DiffUIE: Learning Latent Global Priors in Diffusion Models for Underwater Image Enhancement35
Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in Transformers35
Uncertainty Modeling for Robust Domain Adaptation Under Noisy Environments34
Detection and Localization of Video Transcoding From AVC to HEVC Based on Deep Representations of Decoded Frames and PU Maps34
Hash Bit Selection With Reinforcement Learning for Image Retrieval34
Learning to Learn With Variational Inference for Cross-Domain Image Classification34
OAFTracker: One-stage Associative Multiple Object Tracking with Fine-grained Orthogonal Representation34
Multi-Scale Grid Network for Image Deblurring With High-Frequency Guidance34
Multi-Channel Weight-Sharing Autoencoder Based on Cascade Multi-Head Attention for Multimodal Emotion Recognition34
An Active Multi-Target Domain Adaptation Strategy: Progressive Class Prototype Rectification34
Lifelong Fine-Grained Image Retrieval34
Deep Object Co-Segmentation and Co-Saliency Detection via High-Order Spatial-Semantic Network Modulation34
1.1473889350891