ACM Transactions on Architecture and Code Optimization

Papers
(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-01-01 to 2026-01-01.)
ArticleCitations
Performance, Energy and NVM Lifetime-Aware Data Structure Refinement and Placement for Heterogeneous Memory Systems49
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power38
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency35
Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization27
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes24
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework24
ESMPC: An Efficient Neural Network Training Framework for Secure Two- and Three-Party Computation23
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators23
Accelerating Verifiable Queries over Blockchain Database System Using Processing-in-memory21
Intra-request Lag-aware Cache Management to Enhance I/O Responsiveness of SSDs20
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence19
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation17
An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication17
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor16
A Concise Concurrent B + -Tree for Persistent Memory16
Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage15
SIMD-Matcher: A SIMD-based Arbitrary Matching Framework15
Source Matching and Rewriting for MLIR Using String-Based Automata13
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs13
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign13
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product12
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster12
Accelerating Video Captioning on Heterogeneous System Architectures11
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler11
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments11
Flexible and Effective Object Tiering for Heterogeneous Memory Systems10
MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning10
FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs10
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning9
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism9
SnsBooster: Enhancing Sampling-based μ Arch Evaluation Efficiency through Online Performance Sensitivity Analysis9
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems9
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs9
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs9
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy9
Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization8
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping8
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies8
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults8
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions8
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks8
Towards high scalability and fine-grained parallelism on distributed HPC platforms8
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks8
COX : Exposing CUDA Warp-level Functions to CPUs8
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction8
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks8
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture8
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion7
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy7
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals7
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service7
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems7
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption7
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing7
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing7
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis7
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching7
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models7
SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis7
RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors6
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes6
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs6
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography6
A Stable Idle Time Detection Platform for Real I/O Workloads6
EDAS: Enabling Fast Data Loading for GPU Serverless Computing6
Pac-PIM: A Parallel Communication Framework for Commodity Processing-in-memory Systems6
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs6
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices6
Towards Optimizing Learned Index for High Performance, Memory Efficiency and NUMA Awareness6
HiSo: Co-optimizing the Intra-layer and Inter-layer Scheduling Schemes with the Hybrid Data Flow for PIM Architectures6
CGCGraph: Efficient CPU-GPU Co-execution for Concurrent Dynamic Graph Processing6
Lightweight Code Outlining for Android Applications5
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language5
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks5
Improving Utilization of Dataflow Unit for Multi-Batch Processing5
WIPE: A Write-Optimized Learned Index for Persistent Memory5
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling5
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs5
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage5
Toward Comprehensive Design Space Exploration on Heterogeneous Multi-core Processors5
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation5
Efficient and Scalable Hybrid Parallelization of Unstructured Computational Fluid Dynamics with Geometric Multigrid5
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing5
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations5
Capability-Based Efficient Data Transmission Mechanism for Serverless Computing4
Rethinking Variable-Length Encoding: Exploiting Bit Sparsity for Parallel Decoding in LLM Accelerators4
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication4
RaKV: A Write-Optimized LSM Store for Cloud Block Storage with Robust SLA4
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU4
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources4
MetaEC: An Efficient and Resilient Erasure-Coded KV Store on Disaggregated Memory4
TSN Cache: Exploiting Data Localities in Graph Computing Applications4
Efficient Flexible Edge Inference for Mixed-Precision Quantized DNN using Customized RISC-V Core4
Consequence-based Clustered Architecture4
Abakus: Accelerating k -mer Counting with Storage Technology4
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation4
Scale-out Systolic Arrays4
Address/Data Instruction Steering in Clustered General Purpose Processors4
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs4
Architecting Optically Controlled Phase Change Memory4
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers4
CASHT: Contention Analysis in Shared Hierarchies with Thefts4
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications4
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability3
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy3
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance3
SpiderSense: Lightweight Last-Level Cache Management via Time Period Tagging for LLC-Critical Workloads3
A Pressure-Aware Policy for Contention Minimization on Multicore Systems3
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks3
An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors3
Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters3
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler3
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory3
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization3
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation3
An Optimized GPU Implementation for GIST Descriptor3
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation3
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems3
GraphService: Topology-aware Constructor for Large-scale Graph Applications3
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures3
Towards Efficient Extendible Perfect Hashing for Hybrid PM-DRAM Memory3
Cheetah: Accelerating Dynamic Graph Mining with Grouping Updates3
CoNST: Code Generator for Sparse Tensor Networks3
Matrix: Multi-Cipher Structures Dataflow for Parallel and Pipelined TFHE Accelerator3
Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators3
A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs3
E-BATCH: Energy-Efficient and High-Throughput RNN Batching3
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead3
A Case For Intra-rack Resource Disaggregation in HPC3
A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning3
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel3
Heterogeneous Confidential Computing System for Large Language Models: A Survey3
Compressing and Accelerating Sparse CNNs Using Sign-Reserved Toeplitz Filters and Input Activation Density-aware Dataflow3
Optimizing OpenCL Barrier Synchronization and Memory Efficiency on Multi-Core DSPs3
3D GNLM: Efficient 3D Non-Local Means Kernel with Nested Reuse Strategies for Embedded GPUs3
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access3
High-performance Deterministic Concurrency Using Lingua Franca3
GenCNN: A Partition-Aware Multi-Objective Mapping Framework for CNN Accelerators Based on Genetic Algorithm3
Scheduling Language Chronology: Past, Present, and Future2
Assessing the Impact of Compiler Optimizations on GPUs Reliability2
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads2
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection2
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory2
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking2
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers2
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V2
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks2
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints2
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance2
LitTLS: Lightweight Thread-Level Speculation on Little Cores2
CARL: Compiler Assigned Reference Leasing2
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM2
A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization2
Compiler Support for Sparse Tensor Computations in MLIR2
Membrane: Accelerating Database Analytics with DRAM-Based PIM Filtering and Schema Denormalization2
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search2
HAVIT: An Efficient H ardware- A ccelerator for V ision Tra2
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs2
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training2
Conflict Management in Vector Register Files2
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors2
Supporting Dynamic Program Sizes in Deep Learning-Based Cost Models for Code Optimization2
ZNSFQ: An Efficient and High-Performance Fair Queue Scheduling Scheme for ZNS SSDs2
DCSolver: Accelerating Sparse Iterative Solvers via Divide-and-Conquer on GPUs2
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments2
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs2
GOLDYLOC: Global Optimizations & Lightweight Dynamic Logic for Concurrency2
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication2
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction2
Partitioned Scheduling and Analysis for a Typed DAG Task on Heterogeneous Multi-Cores2
FORTIFY: Feature-Oriented Representation and Graph Topology Integration for Path-Level Vulnerability Detection2
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs2
Bubble-Swap Flow Control2
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture2
Understanding Silent Data Corruption in Processors for Mitigating its Effects2
In-SRAM Parallel Data Shuffle2
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology2
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads2
DFGAS: Exploring the Balance of HW-SW Scheduling through the DFG-Aware Scheme1
Fixed-point Encoding and Architecture Exploration for Residue Number Systems1
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters1
A Survey of General-purpose Polyhedral Compilers1
ASSG: Enhanced Workload Balancing via Adaptive State Scheduling Granularity Approach for Stateful Distributed Stream Processing1
HuntKTm: Hybrid Scheduling and Automatic Management for Efficient Kernel Execution on Modern GPUs1
Cppless: Single-Source and High-Performance Serverless Programming in C++1
HotLD: A Workload-Aware Method for Global Code-Layout Optimization of Shared Libraries1
Phronesis: Efficient Performance Modeling for High-dimensional Configuration Tuning1
MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers1
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs1
HAKV: A Hotness-Aware Zone Management Approach to Optimizing Performance of LSM-tree-based Key-Value Stores1
DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory1
JUNO++: Optimizing ANNS and Enabling Efficient Sparse Attention in LLM via Ray Tracing Core1
Turn-based Spatiotemporal Coherence for GPUs1
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems1
TianheGraph: Topology-aware Graph Processing1
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer1
The Design of an Efficient Lossy Compressor for Time Series Databases1
ScaleGS: Closing the Gap between Real-time 3D Gaussian Splatting and Real-time XR Rendering1
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing1
PDGNN: Efficient Micro-batch GNN Training via Degree-Pruned Partitioning and Redundancy Elimination1
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing1
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs1
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning1
VersaTile: Flexible Tiled Architectures via Associative Processors1
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems1
FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers1
AOBO: A Fast-Switching Online Binary Optimizer on AArch641
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations1
Attack and Defense: Enhancing Robustness of Binary Hyper-Dimensional Computing1
Critical Data Backup with Hybrid Flash-Based Consumer Devices1
TEA+ : A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture1
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing1
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip1
Solving Sparse Assignment Problems on FPGAs1
Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories1
Second-level Caches: Not for Instructions1
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching1
MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations1
The Droplet Search Algorithm for Kernel Scheduling1
Symbolic Analysis for Data Plane Programs Specialization1
Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code1
Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping1
An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform1
Gator: Accelerating Graph Attention Networks by Jointly Optimizing Attention and Graph Processing1
RegCPython: A Register-based Python Interpreter for Better Performance1
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization1
Energy-efficient In-Memory Address Calculation1
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis1
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner1
Unleashing Parallelism with Elastic-Barriers1
A 2 : Towards Accelerator Level Parallelism for Autonomous Micromobility Systems1
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization1
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration1
0.10532402992249