Cliff Click is the CTO and Co-Founder of 0xdata, a firm dedicated to creating a new way to think about web-scale data storage and real-time analytics. Cliff wrote his first compiler when he was 15 (Pascal to TRS Z-80!), although my most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). I helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM. Before that Cliff worked on HotSpot at Sun Microsystems, and am at least partially responsible for bringing Java into the mainstream.
Cliff is invited to speak regularly at industry and academic conferences and has published many papers about HotSpot technology. He holds a PhD in Computer Science from Rice University and about 15 patents.
I walk through a tiny performance example on a modern out-of-order CPU, and basically show that (1) single-threaded performance is tapped out, (2) all the action is with multi-threaded programs and (3) the memory subsystem.
I discuss the Von Neumann architecture, CISC vs RISC, the rise of multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order dispatch, static vs dynamic ILP, performance impact of cache misses, memory performance, memory vs CPU caching, examples of memory/CPU cache interaction, and tips for improving performance.
Datasets have gotten to PB-scale, but the modeling you can do has been limited to a single-node (e.g. R, SAS) or stuck inside the database or takes hours on Hadoop-like technologies. We have built a simple clustering package, and are using it to do distributed analytics on the sum of all ram in a cluster.
This talk focuses on how the clustering technology, plus a Java-based vector math API, is being used to build full algorithms like GLM/GLMNET, Random Forest and K-means. These algorithms are complex multi-pass programs and traditional distributed programming models expose the distributed boundaries making the algorithms hard to reason about. We have a basic JDK for doing at-scale math, we can run most Plain Olde Java in (distributed) inner loops, communicate via a K/V store with exact Java Memory Model consistency (not lazy consistency). Adding more cpus makes these algorithms run faster, and adding more ram allows larger datasets. We are bringing back Moore's Law!
Available core counts are going up, up, up! Intel is shipping quad-core chips; Sun’s Rock has (effectively) 64 CPUs and Azul’s hardware nearly a thousand cores. How do we use all those cores effectively? The JVM proper can directly make use of a small number of cores (JIT compilation, profiling), and garbage collection can use about 20 percent more cores than the application is using to make garbage–but this hardly gets us to four cores. Application servers and transactional—J2EE/bean–applications scale well with thread pools to about 40 or 60 CPUs, and then internal locking starts to limit scaling. Unless your application has embarrassingly parallel data (e.g. data mining; risk analysis; or, heaven forbid, Fortran-style weather-prediction), how can you use more CPUs to get more performance? How do you debug the million-line concurrent program?
“Locking” paradigms (lock ranking, visual inspection) appear to be nearing the limits of program sizes that are understandable and maintainable. “Transactions,” the hot new academic solution to concurrent-programming woes, has its own unsolved issues (open nesting, “wait,” livelock, significant slowdowns without contention). Neither locks nor transactions provide compiler support for keeping the correct variables guarded by the correct synchronization, such as atomic sets. Application-specific programming, such as stream programming or graphics, is, well, application-specific. Tools (debuggers, static analyzers, profilers) and libraries (e.g. the JDK concurrency utilities) are necessary but not sufficient. Where is the general-purpose concurrent programming model? This session’s speaker claims that we need another revolution in thinking about programs.
There are several languages that target bytecodes and the JVM machine as their new “assembler,” including Scala, Clojure, Jython, JRuby, the JavaScript programming language/Rhino, and JPC. This session takes a quick look at how well these languages sit on a JVM machine, what their performance is, where it goes, and why.
Some of the results are surprising: Clojure's STM ran a complex concurrent problem with 600 parallel worker threads with perfect scaling on an Azul box without modification. Some of the results are less surprising: fixnum/bignum math ops take a substantial toll on the benefit of entirely transparent integer math and a lack of tail-call optimization gives some languages fits. Some of the languages can get “to the metal” and sometimes performance takes a backseat to other concerns. This session, for non-Java platform JVM machine users, is a JVM machine's-eye-view of bytecodes, JITs, and code-gen and will give you insight into why a language is (or is not!) as fast as you might expect.
People write toy Java benchmarks all the time. Nearly always they “get it wrong” – wrong in the sense that the code they write doesn't measure what they think it does. Oh, it measures something all right – just not what they want. This session presents some common benchmarking pitfalls, demonstrating pieces of real, bad (and usually really bad) benchmarks. The session is for any programmer who has tried to benchmark anything. It provides specific advice on how to benchmark, stumbling blocks to look out for, and real-world examples of how well-known benchmarks fail to actually measure what they intended to measure.
People write toy Java benchmarks all the time. Nearly always they “get it wrong” – wrong in the sense that the code they write doesn't measure what they think it does. Oh, it measures something all right – just not what they want. This session presents some common benchmarking pitfalls, demonstrating pieces of real, bad (and usually really bad) benchmarks such as the following: SpecJVM98 209_db isn't a DB test; it's a bad string-sort test and indirectly a measure of the size of your TLBs and caches. SpecJAppServer2004 is a test of your DB and network speed, not your JVM. SpecJBB2000 isn't a middleware test; it's a perfect young-gen-only garbage collection test. The session goes through some of the steps any programmer would go through to make a canned program run fast – that is, it shows you how benchmarks get “spammed.”
The session is for any programmer who has tried to benchmark anything. It provides specific advice on how to benchmark, stumbling blocks to look out for, and real-world examples of how well-known benchmarks fail to actually measure what they intended to measure.