IEEE Transactions on Multimedia

Papers
(The TQCC of IEEE Transactions on Multimedia is 19. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-01-01 to 2026-01-01.)
ArticleCitations
Disaggregation Distillation for Person Search756
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt430
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning360
Improving Vision Anomaly Detection With the Guidance of Language Modality289
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection268
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis267
Weakly-Supervised 3D Visual Grounding Based on Visual Language Alignment254
Self-Guided Discriminative Locality Preserving Projections212
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition210
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics208
Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning172
Exploring Kernel Transformations for Implicit Neural Representations171
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation168
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter163
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features157
ICE: Interactive 3D Game Character Facial Editing via Dialogue153
Watch Where You Move: Region-Aware Dynamic Aggregation and Excitation for Gait Recognition151
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment149
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval148
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments147
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning144
Semantic-Aware Triplet Loss for Image Classification144
Pixel Bleach Network for Detecting Face Forgery Under Compression139
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective137
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing135
Few-Shot Generative Model Adaptation via Style-Guided Prompt129
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection129
Structured Attention Network for Referring Image Segmentation128
Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes123
Towards Fast and Robust Real Image Denoising With Attentive Neural Network and PID Controller122
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes120
BASNet: Boundary Assisted Network for Image Splicing Forgery Detection118
Semantic Dual-Adversarial Network for Blended-Target Domain Adaptation115
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation115
Anomaly-Led Prompting Learning Caption Generating Model and Benchmark115
Annealing Genetic GAN for Imbalanced Web Data Learning113
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention110
Scale Up Composed Image Retrieval Learning via Modification Text Generation109
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval108
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations106
Optimal Transport-Based Patch Matching for Image Style Transfer105
ViDR-GNN: Vision Implicit Discriminative Reorganization Graph Neural Networks105
A Total Variation With Joint Norms For Infrared and Visible Image Fusion105
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective104
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion103
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception102
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation102
Bidirectional Translation Between UHD-HDR and HD-SDR Videos102
Guided Image-to-Image Translation by Discriminator-Generator Communication101
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity99
Mix-Based Training Strategies for Learning Implicit Neural Representations99
SLCGC: A lightweight Self-supervised Low-Pass Contrastive Graph Clustering Network for Hyperspectral Images98
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability96
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning96
Unsupervised Learning-Based Framework for Deepfake Video Detection93
Neighborhood Contrastive Transformer for Change Captioning92
Dynamic Contrastive Distillation for Image-Text Retrieval90
Progressive Local Filter Pruning for Image Retrieval Acceleration88
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency87
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework87
Semi-Supervised Contrastive Learning With Similarity Co-Calibration87
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval86
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks86
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing86
Dual-Task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding86
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames86
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion84
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization84
FoodSAM: Any Food Segmentation84
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection84
A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition83
Asymptotics-Aware Multi-View Subspace Clustering82
JPEG AI Compressed Domain Face Detection: a Multi-scale Bridging Perspective82
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation81
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network80
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis80
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter78
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach78
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation78
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training77
Unsupervised Image and Text Fusion for Travel Information Enhancement77
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition77
One-Shot Image-to-Image Translation via Part-Global Learning With a Multi-Adversarial Framework76
CMANet: Context-aware Mutual Attention Network for Referring Image Segmentation76
DEHand: Deformable Encoding for Photo-Realistic Free-View and Free-Pose Hand Rendering76
A Multidimensional Media Adaptation Framework for Live Holographic Communication75
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions74
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification74
Modeling Sequential Listening Behaviors With Attentive Temporal Point Process for Next and Next New Music Recommendation74
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue73
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification72
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction72
CariMe: Unpaired Caricature Generation With Multiple Exaggerations72
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation72
Action-Responsive Contrastive Network for Fine-Grained Skeleton-Based Action Recognition70
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation70
Enhanced Context Mining and Filtering for Learned Video Compression70
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor69
Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models69
Denoised Semantic Features for Local Consistent No-reference Image Quality Assessment69
Depth Map Super-Resolution via Deep Cross-Modality and Cross-Scale Guidance69
Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing69
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval68
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition68
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement66
Video-to-Music Recommendation Using Temporal Alignment of Segments66
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement66
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer66
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection66
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds65
Human-Centric Behavior Description in Videos: New Benchmark and Model64
Multi-View User Preference Modeling for Personalized Text-to-Image Generation64
CRSOT: Cross-Resolution Object Tracking Using Unaligned Frame and Event Cameras64
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation64
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network63
C-CTX: Cubic-Checkerboard Context Entropy Model for Learned Image Compression62
ALCER3D: Adaptive Learning Constraints for Enhanced Retrieval of Complex Indoor 3D Scenarios62
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection62
Video Instance Segmentation by Instance Flow Assembly62
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model61
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification61
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification60
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection60
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds60
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting60
Underwater Image Enhancement With Cascaded Contrastive Learning59
RUL: Region Uncertainty Learning for Robust Face Recognition59
Probabilistic Temporal Masked Attention for Cross-View Online Action Detection59
High Specificity Guided Cross-Domain Few-Shot Segmentation58
Cross-Domain Sample Relationship Learning for Facial Expression Recognition58
Supervised Contrastive Learning for Indoor Point Cloud Oversegmentation58
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning57
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition57
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization57
BVI-DVC: A Training Database for Deep Video Compression57
Rate-Adaptive Neural Network for Image Compressive Sensing56
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution56
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges56
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights56
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification56
Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection56
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification56
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling56
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation55
Multimodal Sentiment Analysis With Image-Text Interaction Network55
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images55
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks54
Improving Fine-Grained Image Classification With Multimodal Information54
Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition54
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection53
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering53
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling53
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification53
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization53
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering53
Motion Deblur by Learning Residual From Events53
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment52
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection52
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data52
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms52
Blind Video Quality Assessment at the Edge52
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images52
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization52
Inexactly Matched Referring Expression Comprehension With Rationale51
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation51
Tuning-free High-Resolution Video Diffusion with Spatial-Temporal Latent Grouping51
Visibility-Based Geometry Pruning of Neural Plenoptic Scene Representations51
OpenSlot: Mixed Open-Set Recognition With Object-Centric Learning51
Differentiable Spatial Regression: A Novel Method for 3D Hand Pose Estimation51
Exploring Cross-Modal Mutual Prompt Learning for Video Quality Assessment50
Neural-Enhanced Rate Adaptation and Computation Distribution for Emerging mmWave Multi-User 3D Video Streaming Systems50
Dense Video Captioning With Early Linguistic Information Fusion50
Towards Structure-aware Model for Multi-modal Knowledge Graph Completion50
A Real-Time Semi-Supervised Deep Tone Mapping Network50
SSPNet: Predicting Visual Saliency Shifts50
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction49
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation49
Compact-Yet-Separate: Proto-Centric Multi-Modal Hashing With Pronounced Category Differences for Multi-Modal Retrieval49
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse49
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking48
MPPM: A Mobile-Efficient Part Model for Object re-ID48
CMI-Net: Cross-view Message Token Interaction Network for 3D Shape Recognition48
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression47
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction47
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO47
Interpretable Multi-view Representation Learning Towards Complex Scenes: From Homogeneity to Heterogeneity47
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model47
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing47
Graph Convolutional Network With Unknown Class Number47
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface47
Face De-Occlusion With Deep Cascade Guidance Learning46
Self-Supervised Face Image Manipulation by Conditioning GAN on Face Decomposition46
View-Invariant Human Action Recognition Via View Transformation Network (VTN)46
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction46
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning46
Toward Intelligent Design: An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators46
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition46
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems46
Flow Guidance Deformable Compensation Network for Video Frame Interpolation46
Category-Contrastive Fine-Grained Crowd Counting and Beyond45
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval45
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation45
Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising45
Self-Supervised Graph Convolutional Network for Multi-View Clustering45
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering45
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing45
Uncertainty-Aware Unsupervised Domain Adaptation in Object Detection45
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds45
Decoupled Prototype Learning for Reliable Test-Time Adaptation45
Instruction-Driven 3D Facial Expression Generation and Transition44
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification44
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning44
SwimVG: Step-Wise Multimodal Fusion and Adaption for Visual Grounding44
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation44
Geometric Continuity and Consistency Learning for Self-Supervised Point Cloud Completion44
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction43
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition43
Improving Visual Object Tracking Through Visual Prompting43
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation43
TrackletGait: A Robust Framework for Gait Recognition in the Wild43
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning43
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition43
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification43
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection43
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection42
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement42
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection42
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model42
Collaborative Learning With a Multi-Branch Framework for Feature Enhancement42
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance42
Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation42
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition42
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images42
Point Cloud Soft Multicast for Untethered XR Users42
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement42
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval41
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance41
IEEE Transactions on Multimedia Publication Information41
DetailRecon: Focusing on Detailed Regions for Online Monocular 3D Reconstruction41
Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis41
Editorial41
Uncertainty Modeling for Robust Domain Adaptation Under Noisy Environments41
Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval41
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved41
A Novel Human Image Sequence Synthesis Method by Pose-Shape-Content Inference41
Privacy-Preserving Image Acquisition for Neural Vision Systems41
Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in Transformers40
Detection and Localization of Video Transcoding From AVC to HEVC Based on Deep Representations of Decoded Frames and PU Maps40
Multi-Modal Depression Detection in Interview via Exploring Emotional Distribution Information40
DSLL-Face: Distributed Supervision-Integrated Framework for Low-Light Face Detection40
0.10783290863037