Out of memory error with Cpzgemr2d

Out of memory error with Cpzgemr2d

Hello everybody, I am trying to distribute over a BLACS grid a square complex (double precision) matrix making use of the Cpzgemr2d function. This global matrix resides on a single "master" node and I need to distribute it over my grid as preliminary step of a linear system solution calculation. Everything works fine if I try to run my code with matrix of size of about 2GB or smaller, making use of various BLACS grids (2x2,3x3,4x4, etc...) with row/column block size equal to 64. However, increasing the matrix size for example to 4 GB, for the same grid and data block configuration, I receive an "xxmr2d:out of memory" error message after a call to Cpzgemr2d, which is not able to complete the matrix distribution procedure. Memory availability on my test machine is quite large (more than 40 GB per MPI task) and I am only allocating the mermory required for global and local matrix data blocks, so I guess that memory itself should not be the cause of my problem.

Is there any size limit for the above data distribution procedure? My test machine is a Linux cluster and I am working with MKL Scalapack 64 bit library.

Many thanks in advance for your help!

8 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi Christopher,

I have experienced the same issue, still not found a solution or workaround. Hope intel guys can help us!

Thanks in advance.

Hi Christopher and Massimiliano B

I have the same problem when I use pdgemr2d. Hope intel guys can help us!

When I compile everything with 64 bit integer and link to a 64 bit integer MPI library this is STILL a problem. Normally I can also compile with 64 bit integer and link to a 32 bit integer MPI library (using some wrappers for all MPI calls) but since scalapack/blacs calls MPI directly this is not a possible solution.

 

Best Regards

Thomas Kjaergaard

Hi Thomas K. and Massimiliano B.,

I have a question for you (nothing helpful for solving the problem, just out of curiosity…): have you tried to build by yourself the ScaLAPACK library by downloading the source code from netlib.org and link your code against that library instead of the MKL one? My problem is still present if I just make use of the downloaded code, however the observed error seems to be solved by the fix suggested here 

https://software.intel.com/en-us/forums/topic/286499

namely:

********************************
You need to modify the file REDIST/SRC/pgemraux.c and change
void *
mr2d_malloc(n)
unsigned int n;
{

to
void *
mr2d_malloc(n)
unsigned long int n;
{
********************************

Best regards

Christopher

 

 

No I have not tried to compile SCALAPACK myself, I just link to MKL.

Maybe I should, maybe I will need to but INTEL really should fix this problem!

TK

Hello Thomas,  it seems the problem on our side. the problem is escalated.

Can you give us the reproducer? it will help us to check the problem additionally.

regards, Gennady

Hello everybody,
I have written a main function making use of ScaLAPACK for the distribution of a square complex matrix, in order to reproduce the error described in my first post. I hope it could be useful for finding a solution. Some comments about the code:
1 - The USE_MKL_SCALAPACK variable allows to switch from the MKL to the open source ScaLAPACK library.
2 - In this code, a square complex matrix of size numberGlobalDataRows*numberGlobalDataRows is being built by the master task only (default BLACS grid is a 2x2 task grid). This global matrix is then distributed over the BLACS task grid making use of the Cpzgemr2d function: the out of memory error appears in correspondence to the call to this function when "large" global matrices are considered.
3- I have included some (conditionally compiled) code lines for printing to standard output the local
matrices owned by the single tasks after the distribution procedure, in order to check the distribution
result.
4 - I hope the code is clear enough (and without critical bugs)

Thank you for your support

Christopher

#include <iostream>
#include "mpi.h"

using namespace std;

#ifdef USE_MKL_SCALAPACK

	#include "mkl_types.h"
	typedef MKL_INT int_type;
	typedef MKL_Complex16 dcomplex_type;

#else

	struct dcomplex
	{
		double r;
		double i;
	};
	typedef int int_type;
	typedef dcomplex dcomplex_type;

#endif


extern "C"
{
	int numroc_(int_type* n,
				int_type* nb,
				int_type* iproc,
				int_type* isrcproc,
				int_type* nprocs);

	void  descinit_(int_type* desca,
					int_type* m,
					int_type* n,
					int_type* mb,
					int_type* nb,
					int_type* irsrc,
					int_type* icsrc,
					int_type* context,
					int_type* llda,
					int_type* info);

	int_type Csys2blacs_handle(MPI_Comm i_mpiCommunicator);

	void Cblacs_gridexit(int_type context);

	void Cblacs_gridinfo(int_type context,
						 int_type *nprow,
						 int_type *npcol,
						 int_type *myrow,
						 int_type *mycol);

	void Cblacs_gridinit(int_type *context,
						 char *order,
						 int_type nprow,
						 int_type npcol);

	void Cpzgemr2d(int_type m,
				   int_type n,
				   dcomplex_type* ptrmyblock,
				   int_type ia,
				   int_type ja,
				   int_type* ma,
				   dcomplex_type* ptrmynewblock,
				   int_type ib,
				   int_type jb,
				   int_type* mb,
				   int_type globcontext);
}

int_type c_zero(0);
int_type c_one(1);
char c_blacsGridDefaulOrdering('R');

/////////////////////////////////////////////////////////////////
// Matrix sizes
int_type numberGlobalDataRows(15000);
int_type numberGlobalDataColumns(numberGlobalDataRows);

/////////////////////////////////////////////////////////////////
// BLACS grid structure
int_type numberOfTasksPerRow(2);
int_type numberOfTasksPerColumn(2);

/////////////////////////////////////////////////////////////////
// Block cyclic distribution parameters
int_type numberOfRowsInDataBlock(64);
int_type numberOfColumnsInDataBlock(64);


int main(int argc, char *argv[])
{

	/////////////////////////////////////////////////////////////////
	// Initialize MPI environment
	int rank, size;
	MPI_Init(&argc,&argv);
	MPI_Comm_rank(MPI_COMM_WORLD,&rank);
	MPI_Comm_size(MPI_COMM_WORLD,&size);
	const bool isMasterTask(rank==0);


	/////////////////////////////////////////////////////////////////
	// Retrieve global BLACS context for the provided communicator
	int_type blacsContext = Csys2blacs_handle(MPI_COMM_WORLD);


	/////////////////////////////////////////////////////////////////
	// Create master-only MPI communicator
	MPI_Group masterTaskOnlyGroup(MPI_GROUP_NULL);
	MPI_Comm masterTaskOnlyCommunicator(MPI_COMM_NULL);
	MPI_Group mpiCommunicatorGroup;
	int mpiFunctionExitCode(MPI_Comm_group(MPI_COMM_WORLD,
										   &mpiCommunicatorGroup));
	if(mpiFunctionExitCode!=0)
	{
		throw;
	}

	int numberNewCommunicatorTasks(1);
	int newCommunicatorTaskIds[1] = {0};
	mpiFunctionExitCode = MPI_Group_incl(mpiCommunicatorGroup,
										 numberNewCommunicatorTasks,
										 newCommunicatorTaskIds,
										 &masterTaskOnlyGroup);
	if(mpiFunctionExitCode!=0)
	{
		throw;
	}

	mpiFunctionExitCode = MPI_Comm_create(MPI_COMM_WORLD,
										  masterTaskOnlyGroup,
										  &masterTaskOnlyCommunicator);
	if(mpiFunctionExitCode!=0)
	{
		throw;
	}


	/////////////////////////////////////////////////////////////////
	// Retrieve BLACS context for the master task only communicator
	int_type masterBlacsContext(-1);
	if(isMasterTask)
	{
		masterBlacsContext = Csys2blacs_handle(masterTaskOnlyCommunicator);
	}


	/////////////////////////////////////////////////////////////////
	// Set linear system parameters and allocate memory
	dcomplex_type* globalMatrix(NULL);
	if(isMasterTask)
	{
		globalMatrix = new dcomplex_type[numberGlobalDataRows*numberGlobalDataColumns];
	}


	/////////////////////////////////////////////////////////////////
	// Initialize master task only BLACS grid
	if(isMasterTask)
	{
		Cblacs_gridinit(&masterBlacsContext,
						&c_blacsGridDefaulOrdering,
						c_one,
						c_one);
	}


	/////////////////////////////////////////////////////////////////
	// Set global matrix descriptor
	int_type globalMatrixDescription[9] = {0,-1,0,0,0,0,0,0,0};
	int_type blacsFunctionExitCode(0);
	if(isMasterTask)
	{
		descinit_(globalMatrixDescription,
				  &numberGlobalDataRows,
				  &numberGlobalDataColumns,
				  &numberGlobalDataRows,
				  &numberGlobalDataColumns,
				  &c_zero,
				  &c_zero,
				  &masterBlacsContext,
				  &numberGlobalDataRows,
				  &blacsFunctionExitCode);

		if(blacsFunctionExitCode!=0)
		{
			throw;
		}
	}


	/////////////////////////////////////////////////////////////////
	// Fill global matrix
	if(isMasterTask)
	{
		unsigned int elementIndex(0);
		for(unsigned int rowIndex=0;rowIndex<numberGlobalDataRows;++rowIndex)
		{
			for(unsigned int columnIndex=0;columnIndex<numberGlobalDataColumns;++columnIndex,++elementIndex)
			{
				if(rowIndex == columnIndex)
				{
					#ifdef USE_MKL_SCALAPACK
						globalMatrix[elementIndex].real = 1.;
						globalMatrix[elementIndex].imag = 0.;
					#else
						globalMatrix[elementIndex].r = 1.;
						globalMatrix[elementIndex].i = 0.;
					#endif
				}
				else
				{
					#ifdef USE_MKL_SCALAPACK
						globalMatrix[elementIndex].real = 0.;
						globalMatrix[elementIndex].imag = 0.;
					#else
						globalMatrix[elementIndex].r = 0.;
						globalMatrix[elementIndex].i = 0.;
					#endif
				}
			}
		}
	}


	/////////////////////////////////////////////////////////////////
	// Initialize BLACS task grid
	Cblacs_gridinit(&blacsContext,
					&c_blacsGridDefaulOrdering,
					numberOfTasksPerRow,
					numberOfTasksPerColumn);


	/////////////////////////////////////////////////////////////////
	// Retrieve current task row and column index
	int_type taskRowIndex;
	int_type taskColumnIndex;
	Cblacs_gridinfo(blacsContext,
					&numberOfTasksPerRow,
					&numberOfTasksPerColumn,
					&taskRowIndex,
					&taskColumnIndex);


	/////////////////////////////////////////////////////////////////
	// Retrieve local number of rows and columns according to block
	// cyclic data distribution and global problem size
	int_type numberLocalDataRows = numroc_(&numberGlobalDataRows,
										   &numberOfRowsInDataBlock,
										   &taskRowIndex,
										   &c_zero,
										   &numberOfTasksPerRow);
	int_type numberLocalDataColumns = numroc_(&numberGlobalDataColumns,
					       	   	   	   	   	  &numberOfColumnsInDataBlock,
					       	   	   	   	   	  &taskColumnIndex,
					       	   	   	   	   	  &c_zero,
					       	   	   	   	   	  &numberOfTasksPerColumn);


	/////////////////////////////////////////////////////////////////
	// Allocate memory for local data containers
	dcomplex_type* localMatrix = new dcomplex_type[numberLocalDataRows*numberLocalDataColumns];


	/////////////////////////////////////////////////////////////////
	// Initialize local data descriptors
    int_type localMatrixDescription[9];
	descinit_(localMatrixDescription,
			  &numberGlobalDataRows,
			  &numberGlobalDataColumns,
			  &numberOfRowsInDataBlock,
			  &numberOfColumnsInDataBlock,
			  &c_zero,
			  &c_zero,
			  &blacsContext,
			  &numberLocalDataRows,
			  &blacsFunctionExitCode);

	if(blacsFunctionExitCode!=0)
	{
		throw;
	}

	/////////////////////////////////////////////////////////////////
	// Distribute linear system blocks
	Cpzgemr2d(numberGlobalDataRows,
			  numberGlobalDataColumns,
			  globalMatrix,
			  c_one,
			  c_one,
			  globalMatrixDescription,
			  localMatrix,
			  c_one,
			  c_one,
			  localMatrixDescription,
			  blacsContext);

#if 0

	/////////////////////////////////////////////////////////////////
	// Print local matrix
	for(unsigned int tRowIndex=0;tRowIndex<numberOfTasksPerRow;++tRowIndex)
	{
		for(unsigned int tColumnIndex=0;tColumnIndex<numberOfTasksPerColumn;++tColumnIndex)
		{
			if((tRowIndex == taskRowIndex) && (tColumnIndex==taskColumnIndex))
			{
				unsigned int elementIndex(0);
				for(unsigned int rowIndex=0;rowIndex<numberLocalDataRows;++rowIndex)
				{
					for(unsigned int columnIndex=0;columnIndex<numberLocalDataColumns;++columnIndex,++elementIndex)
					{
						#ifdef USE_MKL_SCALAPACK
							cout<<"("<<localMatrix[elementIndex].real<<","<<localMatrix[elementIndex].imag<<") ";
						#else
							cout<<"("<<localMatrix[elementIndex].r<<","<<localMatrix[elementIndex].i<<") ";
						#endif
					}
					cout<<endl;
				}
				cout<<endl;
			}
			MPI_Barrier(MPI_COMM_WORLD);
		}
	}
#endif


	/////////////////////////////////////////////////////////////////
	// Release memory for local data containers
	delete [] localMatrix;


	/////////////////////////////////////////////////////////////////
	// Free BLACS master only context
	if(masterBlacsContext != -1)
	{
		Cblacs_gridexit(masterBlacsContext);
	}


	/////////////////////////////////////////////////////////////////
	// Release global matrix memory
	if(globalMatrix != 0)
	{
		delete [] globalMatrix;
	}

	/////////////////////////////////////////////////////////////////
	// Free MPI resources
	if(masterTaskOnlyCommunicator!=MPI_COMM_NULL)
	{
		MPI_Comm_free(&masterTaskOnlyCommunicator);
	}
	if(masterTaskOnlyGroup!=MPI_GROUP_NULL)
	{
		MPI_Group_free(&masterTaskOnlyGroup);
	}

	/////////////////////////////////////////////////////////////////
	// Free BLACS context
	Cblacs_gridexit(blacsContext);


	/////////////////////////////////////////////////////////////////
	// Finalize MPI environment
	MPI_Finalize();

	return 0;
}

 

please try to use the latest version of MKL - 11.2. Some changes have been done to eliminate such sort of problems. 

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi