The focus of most published research in architecture is on applications implemented in high-performance, “close-to-the-metal” languages essentially developed before computers got fast. These, let’s call them metal languages, include FORTRAN (introduced in 1957), C (1972), and C++ (1985). Despite their age, these languages are far from dead! Programmers continue to write applications in them, and they continue to evolve: the just approved C++20 standard is the latest example.
An integer in Python consumes 24 bytes instead of 4 bytes (because every object carries around type information, reference counts, and more), while the overhead for data structures like lists or dictionaries can be 4x that of their C++ counterparts.
The execution time story is even worse. As Leiserson et al. have pointed out in their recent Science article (“There’s plenty of room at the Top”) — and cited by Hennessy in his Turing lecture — a naive implementation of matrix-matrix multiply in Python runs between 100x and 60,000x slower than their counterparts written in a “metal” language (highly optimized C).
To be clear, these languages were not designed to be fast or space-efficient, but for ease of use. Unfortunately, their inefficiencies have now become a real problem.
Unfortunately, languages like Python have proven resistant to efficient implementation, partly because of their design, and partly because of limitations imposed by the need to interop with C code. The current state-of-the-art Python compiler, pypy, often yields around a 2x improvement in performance (and sometimes less). This is something, but it’s nowhere close to making up the gap between Python and C. Part of the appeal of Python is that there is a vast array of libraries available for it; when these are written in C, they can go a long way to alleviating Python’s performance problems. But any time memory and application logic moves into Python land, it’s game over.
- We need to understand these applications and their interactions with all levels of the stack. Are caches large enough for this code? Can we do something to optimize giant event loops running bytecode interpreters at the architecture level (perhaps by revisiting ideas from long ago)? What about branch prediction tables? Is there room for accelerators? Better performance counters that will yield actionable insights? Who knows! But let’s find out and see if we can help. (There’s some work on hardware proposals for these systems, like Zhu et al., MICRO 15, Gope et al., ISCA 17, and Choi et al., ISCA 17, but we need more!)
- We need research on tool support to help programmers write more efficient applications. For example, existing profilers for these languages are essentially re-implementations of gprof and perf, ignoring the aspects of scripting languages that make them different. Python programmers need to distinguish between time and memory spent in pure Python (optimizable) from time and memory spent in C libraries (not so much). They need help tracking down expensive and insidious traffic across the language boundaries (copying and serialization). We have been developing a new profiler for Python called Scalene, that does exactly this, but plenty of work remains to be done.
To sum up: yes, metal languages will continue to be important, but so will the irrational exuberance languages. We as a community should lead the way in developing systems (at the hardware and software levels) that will make them run faster.
About the Author: Emery D. Berger is a professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.