IEEE Transactions on Multimedia

Papers
(The TQCC of IEEE Transactions on Multimedia is 16. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-08-01 to 2025-08-01.)
ArticleCitations
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection611
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations345
Optimal Transport-Based Patch Matching for Image Style Transfer300
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt225
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning220
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention212
Disaggregation Distillation for Person Search200
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment188
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis180
Semantic-Aware Triplet Loss for Image Classification169
Robust Multi-stage Tracking via Multi-scale and Multi-level Representation Learning152
Improving Vision Anomaly Detection With the Guidance of Language Modality145
Towards Fast and Robust Real Image Denoising With Attentive Neural Network and PID Controller140
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation134
A Total Variation With Joint Norms For Infrared and Visible Image Fusion133
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective127
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing119
Dynamic Contrastive Distillation for Image-Text Retrieval116
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity116
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames114
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception113
Few-Shot Generative Model Adaptation via Style-Guided Prompt112
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection111
Pixel Bleach Network for Detecting Face Forgery Under Compression109
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability107
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective107
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition107
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation107
Asymptotics-Aware Multi-View Subspace Clustering106
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency106
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning106
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval102
Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding100
Exploring Kernel Transformations for Implicit Neural Representations97
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics93
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation91
Bidirectional Translation Between UHD-HDR and HD-SDR Videos91
Structured Attention Network for Referring Image Segmentation88
ICE: Interactive 3D Game Character Facial Editing via Dialogue88
Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes87
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter86
FoodSAM: Any Food Segmentation85
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework85
Annealing Genetic GAN for Imbalanced Web Data Learning81
MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion81
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization81
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval81
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection81
Semi-Supervised Contrastive Learning With Similarity Co-Calibration80
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments79
A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition78
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning78
Unsupervised Learning-Based Framework for Deepfake Video Detection77
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features77
Progressive Local Filter Pruning for Image Retrieval Acceleration76
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing75
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval75
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes74
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion74
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks73
Neighborhood Contrastive Transformer for Change Captioning73
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter71
Guided Image-to-Image Translation by Discriminator-Generator Communication71
DREAMT: Diversity Enlarged Mutual Teaching for Unsupervised Domain Adaptive Person Re-Identification71
Unsupervised Image and Text Fusion for Travel Information Enhancement70
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training68
Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement68
Human-Centric Behavior Description in Videos: New Benchmark and Model65
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification65
CariMe: Unpaired Caricature Generation With Multiple Exaggerations65
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor64
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation64
Cooperative Bargaining Game Based Adaptive Video Multicast Over Mobile Edge Networks63
IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection62
Semi-Supervised Authentically Distorted Image Quality Assessment With Consistency-Preserving Dual-Branch Convolutional Neural Network62
FedSH: Towards Privacy-Preserving Text-Based Person Re-Identification61
Prototypical Bidirectional Adaptation and Learning for Cross-Domain Semantic Segmentation61
EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation61
Video-to-Music Recommendation Using Temporal Alignment of Segments61
Knowledge Distillation-Based Domain-Invariant Representation Learning for Domain Generalization60
RSNet: Relation Separation Network for Few-Shot Similar Class Recognition59
Cross-Domain Sample Relationship Learning for Facial Expression Recognition59
A Two-Stream Hybrid Convolution-Transformer Network Architecture for Clothing-Change Person Re-Identification59
High Fidelity Face-Swapping With Style ConvTransformer and Latent Space Selection58
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights58
Enhanced Context Mining and Filtering for Learned Video Compression58
FFFN: Frame-By-Frame Feedback Fusion Network for Video Super-Resolution58
Image-Based Structured Vehicle Behavior Analysis Inspired by Interactive Cognition57
Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis57
Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval57
Reconstructed Graph Constrained Auto-Encoders for Multi-View Representation Learning57
Towards Neural Codec-Empowered 360$^\circ$ Video Streaming: A Saliency-Aided Synergistic Approach57
Towards Temporal Event Detection: A Dataset, Benchmarks and Challenges56
Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification56
One-Shot Image-to-Image Translation via Part-Global Learning With a Multi-Adversarial Framework56
DA-Net: Density-Aware 3D Object Detection Network for Point Clouds54
Multi-View User Preference Modeling for Personalized Text-to-Image Generation54
Motion Direction Awareness: A Biomimetic Dynamic Capture Mechanism for Video Prediction54
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation54
DEHand: Deformable Encoding for Photo-realistic Free-view and Free-pose Hand Rendering53
Sentiment-Enhanced Graph-Based Sarcasm Explanation in Dialogue53
Can Machines Generate Personalized Music? A Hybrid Favorite-Aware Method for User Preference Music Transfer53
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation53
Investigating the Effective Dynamic Information of Spectral Shapes for Audio Classification53
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network53
Deep Unfolding Network for Image Compressed Sensing by Content-Adaptive Gradient Updating and Deformation-Invariant Non-Local Modeling52
STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting52
Test-Time Model Adaptation for Visual Question Answering With Debiased Self-Supervisions52
Video Instance Segmentation by Instance Flow Assembly52
Modeling Sequential Listening Behaviors With Attentive Temporal Point Process for Next and Next New Music Recommendation52
Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition51
TPE-ADE: Thumbnail-Preserving Encryption Based on Adaptive Deviation Embedding for JPEG Images51
Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement51
GLFF: Global and Local Feature Fusion for AI-Synthesized Image Detection51
Synthesize Boundaries: A Boundary-Aware Self-Consistent Framework for Weakly Supervised Salient Object Detection51
VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition51
BVI-DVC: A Training Database for Deep Video Compression50
Rate-Adaptive Neural Network for Image Compressive Sensing50
Underwater Image Enhancement With Cascaded Contrastive Learning50
Multimodal Sentiment Analysis With Image-Text Interaction Network49
Personalized Fashion Recommendation With Discrete Content-Based Tensor Factorization49
Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection48
Motion Deblur by Learning Residual From Events48
Exploring Kernel-Based Texture Transfer for Pose-Guided Person Image Generation48
Low-Light Image Enhancement via Self-Reinforced Retinex Projection Model48
No-Reference Bitstream-Layer Model for Perceptual Quality Assessment of V-PCC Encoded Point Clouds48
Improving Fine-Grained Image Classification With Multimodal Information47
Reordered $k$-Means: A New Baseline for View-Unaligned Multi-View Clustering47
TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering46
Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression46
Blind Video Quality Assessment at the Edge46
Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification46
Face De-Occlusion With Deep Cascade Guidance Learning46
Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition46
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection45
VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning45
Decoupled Prototype Learning for Reliable Test-Time Adaptation45
Neuromorphic Similarity Measurement of Tactile Stimuli in Human–Machine Interface45
Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification44
Toward Intelligent Design: An AI-Based Fashion Designer Using Generative Adversarial Networks Aided by Sketch and Rendering Generators44
Semi-Supervised Knowledge Distillation for Cross-Modal Hashing44
Neural-Enhanced Rate Adaptation and Computation Distribution for Emerging Mmwave Multi-User 3D Video Streaming Systems44
Develop Then Rival: A Human Vision-Inspired Framework for Superimposed Image Decomposition44
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment44
RaFPN: Relation-Aware Feature Pyramid Network for Dense Image Prediction43
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images43
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation43
Progressive Learning Model for Big Data Analysis Using Subnetwork and Moore-Penrose Inverse43
Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis43
Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval43
Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval43
Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning42
SSPNet: Predicting Visual Saliency Shifts42
Edge-Assisted Massive Video Delivery Over Cell-Free Massive MIMO42
Graph Convolutional Network With Unknown Class Number42
Tensorformer: Normalized Matrix Attention Transformer for High-Quality Point Cloud Reconstruction41
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking41
Category-Contrastive Fine-Grained Crowd Counting and Beyond41
Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text Retrieval41
Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification41
Compression of Plenoptic Point Cloud Attributes Using 6-D Point Clouds and 6-D Transforms40
Collaborative Learning With a Multi-Branch Framework for Feature Enhancement40
MPPM: A Mobile-Efficient Part Model for Object re-ID40
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction40
Point Cloud Soft Multicast for Untethered XR Users40
Unleashing Knowledge Potential of Source Hypothesis for Source-Free Domain Adaptation40
Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization40
STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement40
Inexactly Matched Referring Expression Comprehension With Rationale39
Flow Guidance Deformable Compensation Network for Video Frame Interpolation39
Compact-yet-Separate: Proto-centric Multi-modal Hashing with Pronounced Category Differences for Multi-modal Retrieval39
Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model39
Textual Enhanced Adaptive Meta-Fusion for Few-Shot Visual Recognition39
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering39
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation39
Instruction-Driven 3D Facial Expression Generation and Transition39
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction38
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation38
DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection38
Simultaneously Training and Compressing Vision-and-Language Pre-Training Model38
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images38
View-Invariant Human Action Recognition Via View Transformation Network (VTN)38
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning38
Self-Supervised Face Image Manipulation by Conditioning GAN on Face Decomposition38
LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection38
Differentiable Spatial Regression: A Novel Method for 3D Hand Pose Estimation37
OpenSlot: Mixed Open-Set Recognition with Object-Centric Learning37
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data37
CenterTube: Tracking Multiple 3D Objects With 4D Tubelets in Dynamic Point Clouds37
Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising37
Uncertainty-Aware Unsupervised Domain Adaptation in Object Detection37
Improving Visual Object Tracking Through Visual Prompting37
Toward General Cross-Modal Signal Reconstruction for Robotic Teleoperation37
A Real-Time Semi-Supervised Deep Tone Mapping Network37
USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection37
MVL-Net: Pairwise Learning for Multi-View Multiple People Labelling37
Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance37
Dense Video Captioning With Early Linguistic Information Fusion36
A Pixel Distribution Remapping and Multi-Prior Retinex Variational Model for Underwater Image Enhancement36
Underwater Adaptive Video Transmissions Using MIMO-Based Software-Defined Acoustic Modems36
Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance36
FGDNet: Fine-Grained Detection Network Towards Face Anti-Spoofing36
Self-Supervised Graph Convolutional Network for Multi-View Clustering36
Exploiting Informative Video Segments for Temporal Action Localization36
Prune and Merge: Efficient Token Compression for Vision Transformer With Spatial Information Preserved36
Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation36
Weakly Supervised Distribution Discrepancy Minimization Learning With State Information for Person Re-Identification36
Soft Warping Based Unsupervised Domain Adaptation for Stereo Matching35
Weakly Supervised Regional and Temporal Learning for Facial Action Unit Recognition35
MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields35
Spatial-Temporal Aware-based Unsupervised Network for Infrared Small Target Detection35
Multi-Modal Context Propagation for Person Re-Identification With Wireless Positioning35
Downstream-Pretext Domain Knowledge Traceback for Active Learning35
Adaptive Recurrent Forward Network for Dense Point Cloud Completion35
End-to-End Blind Video Quality Assessment Based on Visual and Memory Attention Modeling35
Dual-Guided Frequency Prototype Network for Few-Shot Semantic Segmentation35
GITANet: Group Interactive Threshold-Based Attention Network for Hyperspectral Image Classification35
Accurate Head Pose Estimation Using Image Rectification and a Lightweight Convolutional Neural Network35
TaoHighlight: Commodity-Aware Multi-Modal Video Highlight Detection in E-Commerce35
Editorial35
A Novel Human Image Sequence Synthesis Method by Pose-Shape-Content Inference35
Frefusion: Frequency Domain Transformer for Infrared and Visible Image Fusion35
Weakly Supervised Few-Shot Semantic Segmentation via Pseudo Mask Enhancement and Meta Learning35
IEEE Transactions on Multimedia Publication Information35
Lifelong Fine-Grained Image Retrieval34
Hash Bit Selection With Reinforcement Learning for Image Retrieval34
Fast and Effective: Progressive Hierarchical Fusion Classification for Remote Sensing Images34
Uncertainty Modeling for Robust Domain Adaptation Under Noisy Environments34
Frequency-Based Matcher for Long-Tailed Semantic Segmentation34
Domain-Oriented Knowledge Transfer for Cross-Domain Recommendation34
DetailRecon: Focusing on Detailed Regions for Online Monocular 3D Reconstruction34
Privacy-Preserving Image Acquisition for Neural Vision Systems34
Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching34
Feature Weakening, Contextualization, and Discrimination for Weakly Supervised Temporal Action Localization34
Design of a 5G Multimedia Broadcast Application Function Supporting Adaptive Error Recovery34
De-END: Decoder-Driven Watermarking Network33
Semantic Distance Adversarial Learning for Text-to-Image Synthesis33
Detection and Localization of Video Transcoding From AVC to HEVC Based on Deep Representations of Decoded Frames and PU Maps33
An Active Multi-Target Domain Adaptation Strategy: Progressive Class Prototype Rectification33
Co-Saliency Detection Guided by Group Weakly Supervised Learning33
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback33
Learning Sparse and Discriminative Multimodal Feature Codes for Finger Recognition33
DSLL-Face: Distributed Supervision-Integrated Framework for Low-Light Face Detection33
Dual Stream Relation Learning Network for Image-Text Retrieval33
Utilizing Greedy Nature for Multimodal Conditional Image Synthesis in Transformers33
EgoFish3D: Egocentric 3D Pose Estimation From a Fisheye Camera via Self-Supervised Learning33
Category-Aware Multimodal Attention Network for Fashion Compatibility Modeling33
Knowledge-Enhanced Causal Reinforcement Learning Model for Interactive Recommendation33
Discover Micro-Influencers for Brands via Better Understanding33
Deep Object Co-Segmentation and Co-Saliency Detection via High-Order Spatial-Semantic Network Modulation33
0.072489023208618