IEEE Transactions on Multimedia

Papers
(The TQCC of IEEE Transactions on Multimedia is 14. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-06-01 to 2025-06-01.)
ArticleCitations
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing545
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation305
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection282
Semantic-Aware Triplet Loss for Image Classification213
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations206
Optimal Transport-Based Patch Matching for Image Style Transfer198
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt183
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning177
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective158
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments149
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention149
Disaggregation Distillation for Person Search139
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment133
Guided Image-to-Image Translation by Discriminator-Generator Communication123
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter121
One-shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing118
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis109
Improving Vision Anomaly Detection With the Guidance of Language Modality105
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features105
Ensemble Prototype Networks for Unsupervised Cross-modal Hashing with Cross-Task Consistency103
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning103
Asymptotics-Aware Multi-View Subspace Clustering102
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion101
Neighborhood Contrastive Transformer for Change Captioning101
Few-Shot Generative Model Adaptation via Style-Guided Prompt99
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability99
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection98
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception98
Towards Fast and Robust Real Image Denoising With Attentive Neural Network and PID Controller97
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks95
Pixel Bleach Network for Detecting Face Forgery Under Compression94
Semi-Supervised Contrastive Learning With Similarity Co-Calibration90
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation88
Dynamic Contrastive Distillation for Image-Text Retrieval87
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval83
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective83
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames81
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval81
Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding80
Exploring Kernel Transformations for Implicit Neural Representations79
Deep Semantic-consistent Penalizing Hashing for Cross-modal Retrieval78
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition76
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics74
Unsupervised Learning-Based Framework for Deepfake Video Detection74
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity73
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes73
Progressive Local Filter Pruning for Image Retrieval Acceleration73
Annealing Genetic GAN for Imbalanced Web Data Learning73
Bidirectional Translation Between UHD-HDR and HD-SDR Videos72
FoodSAM: Any Food Segmentation70
A Total Variation With Joint Norms For Infrared and Visible Image Fusion69
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation69
Structured Attention Network for Referring Image Segmentation68
ICE: Interactive 3D Game Character Facial Editing via Dialogue68
A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition68
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization67
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection67
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning66
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion65
Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes65
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification64
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter64
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation63
Unsupervised Image and Text Fusion for Travel Information Enhancement63
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training62
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement62
CariMe: Unpaired Caricature Generation With Multiple Exaggerations61
Human-Centric Behavior Description in Videos: New Benchmark and Model60
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification60
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection59
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor59
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation58
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks57
Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition57
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer57
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection56
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights56
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement55
Underwater Image Enhancement With Cascaded Contrastive Learning55
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval54
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis54
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification54
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach54
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions53
Video Instance Segmentation by Instance Flow Assembly53
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning53
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification52
Cross-Domain Sample Relationship Learning for Facial Expression Recognition52
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition52
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds52
Rate-Adaptive Neural Network for Image Compressive Sensing51
One-Shot Image-to-Image Translation via Part-Global Learning With a Multi-Adversarial Framework51
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification51
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition51
Universal Infrared Image Nonuniformity Correction via Stripe-aware Attention Network51
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation50
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction50
Modeling Sequential Listening Behaviors With Attentive Temporal Point Process for Next and Next New Music Recommendation50
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges50
Enhanced Context Mining and Filtering for Learned Video Compression49
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition49
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images48
Improving Fine-Grained Image Classification With Multimodal Information48
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue48
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network48
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection48
Motion Deblur by Learning Residual From Events48
Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection47
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization47
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification47
Video-to-Music Recommendation Using Temporal Alignment of Segments47
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting47
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation47
BVI-DVC: A Training Database for Deep Video Compression47
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization46
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution46
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling45
Multi-View User Preference Modeling for Personalized Text-to-Image Generation45
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model45
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation45
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds45
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering44
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection44
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction44
Multimodal Sentiment Analysis With Image-Text Interaction Network44
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance44
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation44
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection43
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification43
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression43
View-Invariant Human Action Recognition Via View Transformation Network (VTN)43
Differentiable Spatial Regression: A Novel Method for 3D Hand Pose Estimation42
Blind Video Quality Assessment at the Edge42
Face De-Occlusion With Deep Cascade Guidance Learning42
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition42
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images41
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface41
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering41
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse41
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling41
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO41
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning40
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation40
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation39
SSPNet: Predicting Visual Saliency Shifts39
Prune and Merge: Efficient Token Compression for Vision Transformer with Spatial Information Preserved39
Category-Contrastive Fine-Grained Crowd Counting and Beyond39
A Real-Time Semi-Supervised Deep Tone Mapping Network39
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval39
Decoupled Prototype Learning for Reliable Test-Time Adaptation39
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification39
Graph Convolutional Network With Unknown Class Number39
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction39
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification39
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition38
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking38
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation38
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection37
Collaborative Learning With a Multi-Branch Framework for Feature Enhancement37
Self-Supervised Face Image Manipulation by Conditioning GAN on Face Decomposition37
Point Cloud Soft Multicast for Untethered XR Users37
MPPM: A Mobile-Efficient Part Model for Object re-ID37
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement37
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds36
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction36
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition36
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval36
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering36
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms36
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization36
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data36
Flow Guidance Deformable Compensation Network for Video Frame Interpolation36
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement35
Instruction-Driven 3D Facial Expression Generation and Transition35
OpenSlot: Mixed Open-Set Recognition with Object-Centric Learning35
Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval35
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction35
Improving Visual Object Tracking Through Visual Prompting35
Compact-yet-Separate: Proto-centric Multi-modal Hashing with Pronounced Category Differences for Multi-modal Retrieval35
Inexactly Matched Referring Expression Comprehension With Rationale35
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model35
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection35
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems35
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation35
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning35
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing35
Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising34
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection34
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning34
Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis34
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model34
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation34
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance34
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing34
Dense Video Captioning With Early Linguistic Information Fusion34
Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation34
TaoHighlight: Commodity-Aware Multi-Modal Video Highlight Detection in E-Commerce33
Uncertainty-Aware Unsupervised Domain Adaptation in Object Detection33
Toward Intelligent Design: An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators33
Soft Warping Based Unsupervised Domain Adaptation for Stereo Matching33
Weakly Supervised Regional and Temporal Learning for Facial Action Unit Recognition33
Self-Supervised Graph Convolutional Network for Multi-View Clustering33
Downstream-Pretext Domain Knowledge Traceback for Active Learning33
Adaptive Recurrent Forward Network for Dense Point Cloud Completion33
Detection and Localization of Video Transcoding From AVC to HEVC Based on Deep Representations of Decoded Frames and PU Maps33
Weakly Supervised Distribution Discrepancy Minimization Learning With State Information for Person Re-Identification33
Dual Stream Relation Learning Network for Image-Text Retrieval33
Exploiting Informative Video Segments for Temporal Action Localization33
Editorial33
A Novel Human Image Sequence Synthesis Method by Pose-Shape-Content Inference32
IEEE Transactions on Multimedia Publication Information32
DetailRecon: Focusing on Detailed Regions for Online Monocular 3D Reconstruction32
Exploit the Best of Both End-to-End and Map-Based Methods for Multi-Focus Image Fusion32
An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition32
Frefusion: Frequency Domain Transformer for Infrared and Visible Image Fusion32
End-to-End Blind Video Quality Assessment Based on Visual and Memory Attention Modeling32
MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields32
Spatial-Temporal Aware-based Unsupervised Network for Infrared Small Target Detection32
Dual-Guided Frequency Prototype Network for Few-Shot Semantic Segmentation31
Video Compressed Sensing Via Wavelet Residual Sampling and Dual-Domain Fusion31
Deep Object Co-Segmentation and Co-Saliency Detection via High-Order Spatial-Semantic Network Modulation31
Discover Micro-Influencers for Brands via Better Understanding31
Dual-Level Masked Semantic Inference for Semi-Supervised Semantic Segmentation31
Multi-Modal Context Propagation for Person Re-Identification With Wireless Positioning31
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback31
Frequency-Based Matcher for Long-Tailed Semantic Segmentation31
Privacy-Preserving Image Acquisition for Neural Vision Systems31
Weakly Supervised Few-Shot Semantic Segmentation via Pseudo Mask Enhancement and Meta Learning31
GITANet: Group Interactive Threshold-based Attention Network for Hyperspectral Image Classification31
Design of a 5G Multimedia Broadcast Application Function Supporting Adaptive Error Recovery31
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval31
Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching31
Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in Transformers31
Registration of Multiview Point Clouds With Unknown Overlap31
A Super-Resolution Flexible Video Coding Solution for Improving Live Streaming Quality30
An Active Multi-Target Domain Adaptation Strategy: Progressive Class Prototype Rectification30
Learning to Learn With Variational Inference for Cross-Domain Image Classification30
Category-Aware Multimodal Attention Network for Fashion Compatibility Modeling30
Lifelong Fine-Grained Image Retrieval30
Semantic Distance Adversarial Learning for Text-to-Image Synthesis30
Learning Sparse and Discriminative Multimodal Feature Codes for Finger Recognition30
EgoFish3D: Egocentric 3D Pose Estimation From a Fisheye Camera via Self-Supervised Learning30
Multi-Scale Grid Network for Image Deblurring With High-Frequency Guidance30
DiffUIE: Learning Latent Global Priors in Diffusion Models for Underwater Image Enhancement30
De-END: Decoder-Driven Watermarking Network30
Feature Weakening, Contextualization, and Discrimination for Weakly Supervised Temporal Action Localization30
Mutual Dual-Task Generator With Adaptive Attention Fusion for Image Inpainting30
Centra-Net: A Centralized Network for Visual Localization Spanning Multiple Scenes29
Domain-Oriented Knowledge Transfer for Cross-Domain Recommendation29
Long and Recent Preference Learning with Recent-k Items Distribution for Recommender System29
0.061100006103516