4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

SUMMARY:

 

This posted at the request of our good friends at SC11 Education!

 

 

Call for Papers: Deadline Extension: Resilience@Euro-Par 2011

 

PAPER DEADLINE EXTENDED TO FRI JUNE 24!

 

WHAT: 4th Workshop on Resiliency in High Performance Computing (Resilience)

in Clusters, Clouds, and Grids

WHERE: In conjunction with the 17th International European

Conference on Parallel and Distributed Computing (Euro-Par 2011)

Bordeaux France

WHEN: Mon Aug 29 - Fri Sep 2 2011

 

Due to multiple requests, we have extended the paper submission

deadline to Fri June 24 2011.

 

We apologize if you receive multiple copies of this notice.

 

Important Web sites:

 

Resilience 2011 at

http://xcr.cenit.latech.edu/resilience2011

 

Euro-Par 2011 at

http://europar2011.bordeaux.inria.fr

 

Important dates:

Paper submission deadline on Fri June 24 2011

Notification deadline on Tue July 12 2011

Resilience Workshop on Tue Aug 30 2011

Euro-Par conference on Mon Aug 29 - Fri Sep 2 2011

Camera ready deadline is after the workshop.

 

DETAILS:

 

Clusters, Clouds, and Grids are three different computational paradigms

with the intent or potential to support High Performance Computing

(HPC). Currently, they consist of hardware, management, and usage models

particular to different computational regimes, e.g., high performance

cluster systems designed to support tightly coupled scientific

simulation codes typically utilize high-speed interconnects and

commercial cloud systems designed to support software as a service (SAS)

do not. However, in order to support HPC, all must at least utilize

large numbers of resources and hence effective HPC in any of these

paradigms must address the issue of resiliency at large-scale.

 

Recent trends in HPC systems have clearly indicated that future

increases in performance, in excess of those resulting from improvements

in single- processor performance, will be achieved through corresponding

increases in system scale, i.e., using a significantly larger component

count. As the raw computational performance of these HPC systems

increases from today's tera- and peta-scale to next-generation multi

peta-scale capability and beyond, their number of computational,

networking, and storage components will grow from the ten-to-one-hundred

thousand compute nodes of today's systems to several hundreds of

thousands of compute nodes and more in the foreseeable future. This

substantial growth in system scale, and the resulting component count,

poses a challenge for HPC system and application software with respect

to fault tolerance and resilience.

 

Furthermore, recent experiences on extreme-scale HPC systems with

non-recoverable soft errors, i.e., bit flips in memory, cache,

registers, and logic added another major source of concern. The

probability of such errors not only grows with system size, but also

with increasing architectural vulnerability caused by employing

accelerators, such as FPGAs and GPUs, and by shrinking nanometer

technology. Reactive fault tolerance technologies, such as

checkpoint/restart, are unable to handle high failure rates due to

associated overheads, while proactive resiliency technologies, such as

migration, simply fail as random soft errors can't be predicted.

Moreover, soft errors may even remain undetected resulting in silent

data corruption.

 

Important Web sites:

Resilience 2011 at

http://xcr.cenit.latech.edu/resilience2011

Euro-Par 2011 at

http://europar2011.bordeaux.inria.fr

 

Prior conferences Web sites:

Resilience 2010 at http://xcr.cenit.latech.edu/resilience2010

Resilience 2009 at http://xcr.cenit.latech.edu/resilience2009

Resilience 2008 at http://xcr.cenit.latech.edu/resilience2008

 

Important dates:

Paper submission deadline on Fri June 24 2011

Notification deadline on Tue July 12 2011

Resilience Workshop on Tue Aug 30 2011

Euro-Par conference on Mon Aug 29 - Fri Sep 2 2011

Camera ready deadline is after the workshop.

 

Submission guidelines:

Authors are invited to submit papers electronically in English in PDF

format via EasyChair at

<https://www.easychair.org/conferences/?conf=resilience20110>. Submitted

manuscripts should be structured as technical papers and may not exceed

10 pages, including figures, tables and references, using Springer's

Lecture Notes in Computer Science (LNCS) format at

<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>.

Submissions should include abstract, key words and the e-mail address of

the corresponding author. Papers not conforming to these guidelines may

be returned without review. All manuscripts will be reviewed and will be

judged on correctness, originality, technical strength, significance,

quality of presentation, and interest and relevance to the conference

attendees. Submitted papers must represent original unpublished research

that is not currently under review for any other conference or journal.

Papers not following these guidelines will be rejected without review

and further action may be taken, including (but not limited to)

notifications sent to the heads of the institutions of the authors and

sponsors of the conference. Submissions received after the due date,

exceeding length limit, or not appropriately structured may also not be

considered. The proceedings will be published in Springer's LNCS as

post-conference proceedings. At least one author of an accepted paper

must register for and attend the workshop for inclusion in the

proceedings. Authors may contact the workshop program chair for more

information.

 

Topics of interest include, but are not limited to:

 

Reports on current HPC system and application resiliency

HPC resiliency metrics and standards

HPC system and application resiliency analysis

HPC system and application-level fault handling and anticipation

HPC system and application health monitoring

Resiliency for HPC file and storage systems

System-level checkpoint/restart for HPC

System-level migration for HPC

Algorithm-based resiliency fundamentals for HPC (not Hadoop)

Fault tolerant MPI concepts and solutions

Soft error detection and recovery in HPC systems

HPC system and application log analysis

Statistical methods to identify failure root causes

Fault injection studies in HPC environments

High availability solutions for HPC systems

Reliability and availability analysis

Hardware for fault detection and recovery

Resource management for system resiliency and availability

 

General Co-Chairs:

Stephen L. Scott, Oak Ridge National Laboratory, USA

Chokchai (Box) Leangsuksun, Louisiana Tech University, USA

 

Program Chair:

Christian Engelmann, Oak Ridge National Laboratory, USA

 

Publication Co-Chairs:

James Brandt, Sandia National Laboratories, USA

Ann Gentile, Sandia National Laboratories, USA

 

Program Committee:

Vassil Alexandrov, Barcelona Supercomputing Center, Spain

David E. Bernholdt, Oak Ridge National Laboratory, USA

George Bosilca, University of Tennessee, USA

Jim Brandt, Sandia National Laboratories, USA

Patrick G. Bridges, University of New Mexico

Greg Bronevetsky, Lawrence Livermore National Laboratory, USA

Kasidit Chanchio, Thammasat University, Thailand

Zizhong Chen, Colorado School of Mines, USA

Nathan DeBardeleben, Los Alamos National Laboratory, USA

Jack Dongarra, University of Tennessee, USA

Christian Engelmann, Oak Ridge National Laboratory, USA

Yung-Chin Fang, Dell, USA

Kurt B. Ferreira, Sandia National Laboratories, USA

Ann Gentile, Sandia National Laboratories, USA

Cecile Germain, University Paris-Sud, France

Rinku Gupta, Argonne National Laboratory, USA

Paul Hargrove, Lawrence Berkeley National Laboratory, USA

Xubin He, Virginia Commonwealth University, USA

Larry Kaplan, Cray, USA

Daniel S. Katz, University of Chicago, USA

Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands

Dieter Kranzlmueller, LMU/LRZ Munich, Germany

Zhiling Lan, Illinois Institute of Technology, USA

Chokchai (Box) Leangsuksun, Louisiana Tech University, USA

Xiaosong Ma, North Carolina State University, USA

Celso Mendes, University of Illinois at Urbana Champaign, USA

Thomas Naughton, Oak Ridge National Laboratory, USA

George Ostrouchov, Oak Ridge National Laboratory, USA

DK Panda, The Ohio State University, USA

Mihaela Paun, Louisiana Tech University, USA

Alexander Reinefeld, Zuse Institute Berlin, Germany

Eric Roman, Lawrence Berkeley National Laboratory, USA

Stephen L. Scott, Oak Ridge National Laboratory, USA

Gregory M. Thorson, SGI, USA

Geoffroy Vallee, Oak Ridge National Laboratory, USA

Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.