Fast and Scalable Data Transfer Across Data Systems

How hard can it be to move tuples from here to there?

Oct 27, 2025

Fast and Scalable Data Transfer Across Data Systems Haralampos Gavriilidis, Kaustubh Beedkar, Matthias Boehm, and Volker Mark SIGMOD'25

Relation Movement

We live in exciting times, unimaginably large language models getting better each day, and a constant stream of amazing demos. And yet, efficiently transferring a table between heterogeneous systems is an open research problem! An example from the paper involves transferring data from PostgreSQL to pandas. Optimizing this transfer time is important and non-trivial.

XDBC

The paper describes a system named XDBC. XDBC software runs on both the source and the destination data management systems (DMS), as illustrated by Fig. 4:

Source: https://dl.acm.org/doi/10.1145/3725294

The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput:

Number of workers assigned to each task
Data interchange format (row-major, column-major, Arrow)
Compression (zstd, snappy, lzo, lz4)

Section 4.1 of the paper claims the search space is so large that brute force search will not work, so a heuristic algorithm is used. The heuristic algorithm assumes accurate performance models which can estimate performance of each pipeline stage given a specific configuration. This model is based on real-world single-core performance measurements, and Gustafson’s law to estimate multi-core scaling.

The algorithm starts by assigning 1 worker to each pipeline stage (in both the client and server). An iterative process then locates the pipeline stage which is estimated to be the slowest and assigns additional workers to it until it is no longer the bottleneck. This process continues until no more improvement can be found, due to one of the following reasons:

All available CPU cores have been assigned
Network bandwidth is the bottleneck

If the process ends with more CPU cores available, then a hard-coded algorithm determines the best compression algorithm given the number of cores remaining. The data interchange format is determined based on which formats the source and destination DMSs support, and which compression algorithm was chosen.

XDBC vs Alkali

The XDBC optimizer has a lot of similarities with the Alkali optimizer. Here are some differences:

Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the deser and comp pipeline stages.
Alkali uses an SMT solver to determine the number of cores to assign to each stage.
The Alkali performance model explicitly takes into account inter-core bandwidth requirements.
Alkali doesn’t deal with compression.

Results

Fig. 7(a) shows results from the motivating example (PostgreSQL→Pandas). Fig. 7(b) compares XDBC vs built-in Pandas functions to read CSV data over HTTP. connector-x is a more specialized library which supports reading data into Python programs specifically.

Dangling Pointers

There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback, but as the Alkali paper shows, there are other approaches (e.g., LP solvers, ILP solvers, SMT solvers, machine learning models). It would be great to see cross-cutting studies comparing heuristics to other approaches.

Oct 28

Yeah, that makes a lot of sense. It's a bit recursive to think of an AI model which optimizes data transfer to enable faster training of another AI model.

Expand full comment

Matthew Humphrey

Are you familiar with AlphaTensor? When you said "There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback" it made me think of this research. They use AI models to narrow the search space and then more traditional algorithms take over.

https://deepmind.google/discover/blog/discovering-novel-algorithms-with-alphatensor/

Dangling Pointers

Fast and Scalable Data Transfer Across Data Systems

How hard can it be to move tuples from here to there?

Relation Movement

XDBC

XDBC vs Alkali

Results

Dangling Pointers

Discussion about this post