Fast, RDMA-capable networks present a “network wall” for data-intensive applications in a data center. Software developers are facing two unpalatable choices: either communicate using messages and re-implement features of TCP/IP in their application, or find creative ways to cope with the microsecond-long latency of a remote memory access. Computer architects have the expertise to address this problem, but the solutions that were used to scale the “memory wall” alone will not be sufficient.
From supercomputers to every server
Modern network adapters expose memory semantics to applications using RDMA and offer very high throughput. 200 Gb/s InfiniBand devices are available today for commodity servers and 400 Gb/s is the next industry milestone. Ethernet is quickly closing the gap with 100 Gb/s speeds and RDMA support. While the bandwidth has quadrupled across generations of network adaptors, latency has stubbornly remained a little over 1μs per round-trip. The laws of physics suggest that latency is unlikely to improve further.
The existence of fast networks is not news to the computer architecture community that has studied the performance of interconnection networks. What is remarkable, however, is the pace at which fast networks are gaining broader market acceptance. For an investment of about $1,000 per port, commodity servers today can communicate at speeds that were only achievable in high-end supercomputers only a few years ago. The biggest benefit to applications from the commoditization of fast networking is standardization, which promises performance portability across deployments.
But what programming interfaces do applications use to access an RDMA-capable network? The TCP/IP stack in the OS copies data from kernel memory to user memory during a data transfer, and is typically avoided. One solution is MPI, a de facto standard in the high-performance computing community. However, MPI is far from a perfect solution for data-intensive processing. In a recent paper, we found that an MPI implementation only utilizes 50% of the link bandwidth when executing a network-bound SQL query. In addition, debugging such performance problems is cumbersome, as the MPI library is opaque to applications. Furthermore, MPI performs its own memory management which conflicts with memory management decisions made by the application.
A memory link or a messaging channel?
Many data-intensive applications increasingly opt to use the OpenFabrics Verbs interface (libibverbs) and directly access the network adapter. The Verbs interface supports two communication modes for applications: the channel mode and the memory mode.
In the channel mode, applications eschew the ability to directly access remote memory and instead send or receive messages. This turns out to be a very efficient way to communicate, especially when used with an unreliable datagram protocol. Using a fast network as a message channel is not a panacea, however, because the receiver has to initiate the message transmission. (Else, the network adapter would not know the user space address where it can store an incoming message.) Communicating in the channel mode requires co-ordination in software to ensure that receivers are ready to receive messages before senders transmit them. This may require substantial algorithmic changes or even be infeasible for certain applications. Furthermore, applications using an unreliable datagram protocol now need to tolerate out of order message delivery, carefully tune the window size, avoid “buffer bloat” and network congestion, and gracefully recover when packets are dropped—all are features of the TCP/IP stack that applications have come to take for granted.
The alternative is to use the memory mode to transfer data, where applications will directly read from and write to remote memory. The technical challenge is that the latency of a remote memory access is one order of magnitude greater than the latency of a local memory access. Remote manipulations of even the simplest data structures require multiple round-trips in the network, in either a “lock, write, unlock” or a lock-free “fetch-and-add, then write” pattern. This overall latency of such an operation would likely be in the 10μs range. The application needs to be implemented cleverly to hide such microsecond-long latencies, which itself is a challenging, open problem.
An opportunity for impact
As fast networks gain broader market acceptance, the scarcity of talent that can develop correct and fast software that uses RDMA is exacerbated. Data-intensive applications desperately need better hardware support for remote memory operations in a data center. The computer architecture community has valuable expertise in characterizing the memory consistency needs of applications and hiding the latency of memory accesses. Existing solutions like simultaneous like multi-threading, prefetching, speculative loading, transactional memory and configurable coherence domains have been successful in hiding the memory wall from applications, often without even recompiling them. It appears unlikely that these techniques alone will be sufficient to scale the “network wall” where latency is 10X higher. The computer architecture community could lead the conversation on how to address this emerging application need.
About the author: Spyros Blanas is an Assistant Professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.