‹ Retornar à série de vídeos: High-Productivity Languages Track

The State of High-Performance Computing in the Open-Source R Ecosystem

  • Visão geral
  • Recursos

Speaker: Drew Schmidt, University of Tennessee

R is a strange language. Dating back to S from Bell Labs, it is the mad science experiment produced by blending a C-inspired programming language with a feature-rich, interactive data analysis package. Primarily developed by statisticians, it has an eclectic mix of programming idioms and syntax styles. One writer describes R as, "the most shockingly dreadful and most useful language for data analysis." Yet in spite of (or as the above quote suggests, perhaps because of) its many quirks, it is beloved by many data scientists. Indeed, it is the de facto standard for data analysis in academia, and has been steadily gaining popularity in industry for some time. Recently, IEEE Spectrum listed R as the fifth most popular programming language in its rankings. A humble scripting language designed only to be good at data analysis beat out standards like C# and JavaScript* in a general-purpose "language shootout."

So it would seem that R is here to stay, including on the cluster. Unsurprisingly, in the age of big(ger) data, statisticians, scientists, and all other analyzers of data are increasingly finding themselves in the need of high-performance computing (HPC) resources. When they need to move to small campus clusters, national resources for supercomputers, or the cloud, they want to bring R with them. However, R was built with the desktop, not the cluster, in mind. To address this, the open-source R community is steadily developing solutions to transform R from merely being a "high productivity" language, into a legitimate high-performance language. These external packages enhance R computations to use multithreaded and compiled kernels, access coprocessor cards like GPUs and the Intel® Xeon Phi™ coprocessors, and even elevate R to large distributed resources, living atop technologies like message-passing interface (MPI) and Apache Spark*.

This talk explores the package's landscape, and describes the history of R's usage on HPC resources, as well as the current state of the art.