Cliff Click
Chief JVM Architect of Azul Systems
Previously he was with Motorola where he helped deliver industry leading SpecInt2000 scores on PowerPC chips, and before that he researched compiler technology at HP Labs. Cliff has been writing optimizing compilers and JITs for over 20 years. He is invited to speak regularly at industry and academic conferences including JavaOne, ECOOP, JVM and VEE; serves on the Program Committee of many conferences (including PLDI and OOPSLA); and has published many papers about HotSpot technology and more than a dozen related patents. Cliff holds a PhD in Computer Science from Rice University.
Presentations
Challanges and Directions in Java Virtual Machines
Available core counts are going up, up, up! Intel is shipping quad-core chips; Sun’s Rock has (effectively) 64 CPUs and Azul’s hardware nearly a thousand cores. How do we use all those cores effectively? The JVM proper can directly make use of a small number of cores (JIT compilation, profiling), and garbage collection can use about 20 percent more cores than the application is using to make garbage--but this hardly gets us to four cores. Application servers and transactional—J2EE/bean--applications scale well with thread pools to about 40 or 60 CPUs, and then internal locking starts to limit scaling. Unless your application has embarrassingly parallel data (e.g. data mining; risk analysis; or, heaven forbid, Fortran-style weather-prediction), how can you use more CPUs to get more performance? How do you debug the million-line concurrent program?
“Locking” paradigms (lock ranking, visual inspection) appear to be nearing the limits of program sizes that are understandable and maintainable. “Transactions,” the hot new academic solution to concurrent-programming woes, has its own unsolved issues (open nesting, “wait,” livelock, significant slowdowns without contention). Neither locks nor transactions provide compiler support for keeping the correct variables guarded by the correct synchronization, such as atomic sets. Application-specific programming, such as stream programming or graphics, is, well, application-specific. Tools (debuggers, static analyzers, profilers) and libraries (e.g. the JDK concurrency utilities) are necessary but not sufficient. Where is the general-purpose concurrent programming model? This session’s speaker claims that we need another revolution in thinking about programs.
The Art of (Java) Benchmarking
People write toy Java benchmarks all the time. Nearly always they "get it wrong" -- wrong in the sense that the code they write doesn't measure what they think it does. Oh, it measures something all right -- just not what they want. This session presents some common benchmarking pitfalls, demonstrating pieces of real, bad (and usually really bad) benchmarks. The session is for any programmer who has tried to benchmark anything. It provides specific advice on how to benchmark, stumbling blocks to look out for, and real-world examples of how well-known benchmarks fail to actually measure what they intended to measure.
People write toy Java benchmarks all the time. Nearly always they "get it wrong" -- wrong in the sense that the code they write doesn't measure what they think it does. Oh, it measures something all right -- just not what they want. This session presents some common benchmarking pitfalls, demonstrating pieces of real, bad (and usually really bad) benchmarks such as the following: SpecJVM98 209_db isn't a DB test; it's a bad string-sort test and indirectly a measure of the size of your TLBs and caches. SpecJAppServer2004 is a test of your DB and network speed, not your JVM. SpecJBB2000 isn't a middleware test; it's a perfect young-gen-only garbage collection test. The session goes through some of the steps any programmer would go through to make a canned program run fast -- that is, it shows you how benchmarks get "spammed."
The session is for any programmer who has tried to benchmark anything. It provides specific advice on how to benchmark, stumbling blocks to look out for, and real-world examples of how well-known benchmarks fail to actually measure what they intended to measure.
Fast Bytecodes for Funny Languages
There are several languages that target bytecodes and the JVM machine as their new "assembler," including Scala, Clojure, Jython, JRuby, the JavaScript programming language/Rhino, and JPC. This session takes a quick look at how well these languages sit on a JVM machine, what their performance is, where it goes, and why.
Some of the results are surprising: Clojure's STM ran a complex concurrent problem with 600 parallel worker threads with perfect scaling on an Azul box without modification. Some of the results are less surprising: fixnum/bignum math ops take a substantial toll on the benefit of entirely transparent integer math and a lack of tail-call optimization gives some languages fits. Some of the languages can get "to the metal" and sometimes performance takes a backseat to other concerns. This session, for non-Java platform JVM machine users, is a JVM machine's-eye-view of bytecodes, JITs, and code-gen and will give you insight into why a language is (or is not!) as fast as you might expect.
A Crash Course in Modern Hardware
I walk through a tiny performance example on a modern out-of-order CPU, and basically show that (1) single-threaded performance is tapped out, (2) all the action is with multi-threaded programs and (3) the memory subsystem.
I discuss the Von Neumann architecture, CISC vs RISC, the rise of multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order dispatch, static vs dynamic ILP, performance impact of cache misses, memory performance, memory vs CPU caching, examples of memory/CPU cache interaction, and tips for improving performance.



