Junyi (Kyle) Shu | publications

* = co-first author, # = corresponding author

2026

OSDI

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu#, Jiarong Xing#, and Ying Sheng#

In 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26) 2026

Abs PDF Slides
EuroSys

Serverless Replication of Object Storage across Multi-Vendor Clouds and Regions

Junyi Shu, Xiaolong Huang, Gang Huang, Hong Mei, Xuanzhe Liu, and Xin Jin#

In Proceedings of the 21st European Conference on Computer Systems 2026

Abs PDF Code Slides

Cross-cloud data replication is vital for improving reliability and performance. Since cloud providers lack native support, users turn to open-source solutions that rely on VMs. However, these are slow to provision, leading to high replication delays and costs. We propose a serverless approach for data replication using cloud functions, which cut provisioning overhead from tens of seconds to just a few. While functions offer sufficient bandwidth, they suffer from performance asymmetry across clouds and variability among instances. Our system, λReplica, mitigates this uncertainty through proactive planning and adaptive runtime adjustments. Prior to replication, λReplica formulates an SLO-compliant plan. During runtime, it employs decentralized scheduling to manage slow instances and uses changelog propagation and batching to further reduce costs. Implemented on three major clouds, λReplica outperforms existing solutions by reducing replication delay by 61%-99% with cost savings of up to three orders of magnitude. On production traces, it keeps p99.99 replication delay below 10 seconds.

2024

OSDI

Burstable Cloud Block Storage with Data Processing Units

Junyi Shu, Kun Qian, Ennan Zhai, Xuanzhe Liu, and Xin Jin#

In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) 2024

Abs PDF Slides

Cloud block storage (CBS) is a key pillar of public clouds. Today’s CBS distinguishes itself from physical counterparts (e.g., SSDs) by offering unique burst capability as well as enhanced throughput, capacity, and availability. We conduct an initial characterization of our CBS product, a globally deployed cloud block storage service at public cloud provider Alibaba Cloud. A key observation is that the storage agent (SA) running on a data processing unit (DPU) which connects user VMs to the backend storage is the major source of performance fluctuation with burst capability provided. In this paper, we propose a hardware-software co-designed I/O scheduling system BurstCBS to address load imbalance and tenant interference at SA. BurstCBS exploits high-performance queue scaling to achieve near-perfect load balancing at line rate. To mitigate tenant interference, we design a novel burstable I/O scheduler that prioritizes resource allocation for base-level usage while supporting bursts. We employ a vectorized I/O cost estimator for comprehensive measurements of the consumed resources of different types of I/Os. Our evaluation shows that BurstCBS reduces average latency by up to 85% and provides up to 5× throughput for base-level tenants under congestion with minimal overhead. We verify the benefits brought by BurstCBS with a database service that internally relies on CBS, and show that up to 83% latency reduction is observed on customer workloads.

2023

ASPLOS

Disaggregated RAID Storage in Modern Datacenters

Junyi Shu, Ruidong Zhu, Yun Ma, Gang Huang#, Hong Mei, Xuanzhe Liu, and Xin Jin#

In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 2023

Abs PDF Code Poster Slides

RAID (Redundant Array of Independent Disks) has been widely adopted for decades, as it provides enhanced throughput and redundancy beyond what a single disk can offer. Today, enabled by fast datacenter networks, accessing remote block devices with acceptable overhead (i.e. disaggregated storage) becomes a reality (e.g., for serverless applications). Combining RAID with remote storage can provide the same benefits while creating better fault tolerance and flexibility than its monolithic counterparts. The key challenge of disaggregated RAID is to handle extra network traffic generated by RAID, which can consume a vast amount of NIC bandwidth. We present dRAID, a disaggregated RAID system that achieves near-optimal read and write throughput. dRAID exploits peer-to-peer disaggregated data access to reduce bandwidth consumption in both normal and degraded states. It employs non-blocking multi-stage writes to maximize inter-node parallelism, and applies pipelined I/O processing to maximize inter-device parallelism. We introduce bandwidth-aware reconstruction for better load balancing. We show that dRAID provides up to 3x bandwidth improvement. The results on a lightweight object store show that dRAID brings 1.5x-2.35x throughput improvement on various workloads.

2021

SIGCOMM

Cost-Effective Data Analytics across Multiple Cloud Regions

Junyi Shu, Xin Jin, Yun Ma, Xuanzhe Liu, and Gang Huang

In Proceedings of the SIGCOMM ’21 Poster and Demo Sessions 2021

Abs PDF Poster

We propose a cloud-native data analytics engine for processing data stored among geographically distributed cloud regions with reduced cost. A job is split into subtasks and placed across regions based on factors including prices of compute resources and data transmission. We present its architecture which leverages existing cloud infrastructures and discuss major challenges of its system design. Preliminary experiments show that the cost is reduced by 15.1% for a decision support query on a four-region public cloud setup.