ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2021-04-01 to 2025-04-01.)
ArticleCitations
Sniper: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis35
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework30
SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing29
SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring Computations29
VersaTile: Flexible Tiled Architectures via Associative Processors28
Ceiba: An Efficient and Scalable DNN Scheduler for Spatial Accelerators25
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing23
MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage21
Performance Evaluation of Intel Optane Memory for Managed Workloads16
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power16
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes15
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation15
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing15
Multiply-and-Fire: An Event-Driven Sparse Neural Network Accelerator14
D 2 Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage14
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation14
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators13
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals13
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability12
A Stable Idle Time Detection Platform for Real I/O Workloads12
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy11
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture11
An Application-oblivious Memory Scheduling System for DNN Accelerators10
Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache10
MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers9
The Droplet Search Algorithm for Kernel Scheduling9
ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM9
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes9
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization9
Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training9
XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments8
A Survey of General-purpose Polyhedral Compilers8
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead8
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication8
Spiking Neural Networks in Spintronic Computational RAM8
An Optimized GPU Implementation for GIST Descriptor7
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency7
Gator: Accelerating Graph Attention Networks by Jointly Optimizing Attention and Graph Processing7
FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers6
Potamoi: Accelerating Neural Rendering via a Unified Streaming Architecture6
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration6
Byte-Select Compression6
Bubble-Swap Flow Control5
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance5
RegCPython: A Register-based Python Interpreter for Better Performance5
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers5
KINDRED: Heterogeneous Split-Lock Architecture for Safe Autonomous Machines5
Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration5
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs5
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs5
AIS: An Active Idleness I/O Scheduler to Reduce Buffer-Exhausted Degradation of Solid-State Drives5
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner5
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems5
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations5
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs5
RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network4
A 2 : Towards Accelerator Level Parallelism for Autonomous Micromobility Systems4
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs4
ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer4
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V4
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure4
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing4
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage4
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks3
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training3
Energy-efficient In-Memory Address Calculation3
Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories3
A Concise Concurrent B + -Tree for Persistent Memory3
Access Characteristic-Guided Remote Swapping Across Mobile Devices3
Source Matching and Rewriting for MLIR Using String-Based Automata3
Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning3
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs3
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks3
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching3
KernelFaRer3
DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping3
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems3
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs3
LiteCON : An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning3
Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling3
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework3
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign3
Conflict Management in Vector Register Files2
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs2
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors2
Fixed-point Encoding and Architecture Exploration for Residue Number Systems2
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs2
GPU Domain Specialization via Composable On-Package Architecture2
Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code2
Early Address Prediction2
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection2
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction2
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation2
WIPE: A Write-Optimized Learned Index for Persistent Memory2
Improving Utilization of Dataflow Unit for Multi-Batch Processing2
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip2
PICO2
MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric Algorithms2
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation2
ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication2
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation2
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources2
Fast Key-Value Lookups with Node Tracker2
TEA+ : A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture2
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer2
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage2
Supporting Dynamic Program Sizes in Deep Learning-Based Cost Models for Code Optimization2
Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference2
Architecting Optically Controlled Phase Change Memory2
Automatic Sublining for Efficient Sparse Memory Accesses2
CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers2
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling2
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments1
Online Application Guidance for Heterogeneous Memory Systems1
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications1
An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform1
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache1
Cache Programming for Scientific Loops Using Leases1
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search1
Scale-out Systolic Arrays1
Performance and Power Prediction for Concurrent Execution on GPUs1
A Case for Fine-grain Coherence Specialization in Heterogeneous Systems1
CASHT: Contention Analysis in Shared Hierarchies with Thefts1
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks1
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads1
Taming Flexible Job Packing in Deep Learning Training Clusters1
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication1
Understanding Silent Data Corruption in Processors for Mitigating its Effects1
Phronesis: Efficient Performance Modeling for High-dimensional Configuration Tuning1
Reducing Minor Page Fault Overheads through Enhanced Page Walker1
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey1
Domain-Specific Multi-Level IR Rewriting for GPU1
ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management1
Towards Enhanced System Efficiency while Mitigating Row Hammer1
Extension VM: Interleaved Data Layout in Vector Memory1
TokenSmart: Distributed, Scalable Power Management in the Many-core Era1
PMGraph: Accelerating Concurrent Graph Queries over Streaming Graphs1
Lavender: An Efficient Resource Partitioning Framework for Large-Scale Job Colocation1
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication1
DxPU: Large-scale Disaggregated GPU Pools in the Datacenter1
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language1
FusionFS: A Contention-Resilient File System for Persistent CPU Caches1
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking1
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications1
QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs1
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning1
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints1
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster1
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler1
MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing1
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing1
Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUs1
Gem5-X1
Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints1
Locality-Aware CTA Scheduling for Gaming Applications1
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications1
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory1
Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses1
TSN Cache: Exploiting Data Localities in Graph Computing Applications1
ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures1
exZNS: Extending Zoned Namespace to Support Byte-loggable Zones1
YaConv: Convolution with Low Cache Footprint1
Accelerating Video Captioning on Heterogeneous System Architectures1
DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration1
GraphAttack1
LargeGraph1
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product1
Compiler Support for Sparse Tensor Computations in MLIR1
Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads1
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs1
0.076279878616333