What collateral/documentation do you want to see?

What collateral/documentation do you want to see?

Do you have questions that you are not finding the answers for in our documentation?  Need more training, source code examples, on what specifically?   Help us understand what's missing so that we can make sure we develop documentation you care about (what is important, and what is nice to have)!   Thank you

59 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I would hope to get better documentation on the exact semantics of the offload pragmas/declspecs/_Cilk*. For instance, _Cilk_shared on variables means that the value of the variable is shared, while _Cilk_shared on a function means that the code is compiled for host+accelerator. _Cilk shared am classes apparently means that all class methods are _Cilk_shared and the values of the members of each instance are shared - as long as they can be copied by binary copy.

The discussion of these in the compiler documentation and elsewhere is based on examples, rather than on a more formal description. Some issues where I still dont have a clear understanding: If a base class is declared as offload, what happens to derived classes? Under which circumstances will the compiler issue a warning "warning #2571: variable has not been declared with compatible "target" attribute" (and when is it important to listen to it)? In which cases will the compiler not emit such a warnung while in reality a target attribute would be necessary? ....

Just some things that currently bother me.

Georg

We'd like to see some more specific hardware requirements for the host system.  We weren't aware that we would need a special motherboard/BIOS to support the card (due to our lack of research and hastiness :) so now we're trying to find out what the options are that will work for us.

Any information you have about what motherboard/BIOS features are necessary would be very helpful.

@Greg:  thank you -- now looking for what you are asking

@Justin:  Does this link help at all? http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-where-to-buy.html    If not, what information would be more helpful?    If this URL does the trick, then maybe we just need to make a more visible link to it from the mic-developer site......

 

Belinda, 

I had actually seen that page in my searching, and it does help to some extent.  The problem right now is that I don't think many (if any) of those vendors are actually offering Xeon Phi solutions currently.

What would be more useful is a more generic list of BIOS/motherboard requirements so that we could look into putting a server together that would work.  It may be that the Intel boards listed on the Xeon Phi product page are the only ones that are currently compatible, but I don't know for sure if that's true.

Hi Justin,

If you have an empty PCI-E 2.0 x16 slot on your motherboard and the power supply has the right specs (XEON Phi 5110P needs 225W) you are in business.

Dragos

Quote:

Justin Bennett wrote:

Belinda, 

I had actually seen that page in my searching, and it does help to some extent.  The problem right now is that I don't think many (if any) of those vendors are actually offering Xeon Phi solutions currently.

What would be more useful is a more generic list of BIOS/motherboard requirements so that we could look into putting a server together that would work.  It may be that the Intel boards listed on the Xeon Phi product page are the only ones that are currently compatible, but I don't know for sure if that's true.

Justin: to add to what was said above,  the motherboard BIOS must support memory mapped I/O above 4GB (large Base Address Register support per the PCIe specification).

 

Some additional suggestions where the current docs are not as useful as the could/should be:

  • The documentation is particularly weak on C++ examples, and some important aspects such as vectorization of STL vectors, alignment of class data, Cilk Array Notation and STL are barely mentioned/considered.
  • Organization of larger projects. For instance: Should I add "#pragma offload_attribute(push, target(mic))/pop" to my .h files (declarations), or should I better add it only to the .cpp files (implementation/definition). What about templates (usually need definition in .h)?
  • How to handle cases where icc wrongly believes it has to compile for offload (internally doing 2 compiler runs), but where the mic side of the compile fails due to missing .h files (e.g. Qt files).

Georg

Hi Georg,

We appreciate your feedback. However, we don't totally understand your third concern: How to handle cases where icc wrongly believes it has to compile for offload (internally doing 2 compiler runs), but where the mic side of the compiler fails due to missing .h files (e.g. Qt files).

Would you like to elaborate your concern please? Thank you.

Hi,

I am having problems configuring the card. I get 3d and 3E post codes on dmesg duting card initialization and would like to find information about such codes and their meaning. I have posted a support request and some messages on the forum about the issue but no response yet :-(.If there is a problem with the card, I must know as soon as possible to return it...

Thanks in advance,

Jose

I can't speak to what might be the issue with the card but post codes are available in the MPSS readme (here: http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss) that may help with their meaning.

What is your Premier issue #?  (I will ping the HW support side)

Quote:

Kevin Davis (Intel) wrote:

I can't speak to what might be the issue with the card but post codes are available in the MPSS readme (here: http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss) that may help with their meaning.

What is your Premier issue #?  (I will ping the HW support side)

thanks for your response. I posted a support issue but just received an ack mail without any issue #. By the way, I am also unable to find the post codes on the multiple files that are located under the downloading page you provide.

best regards,

Jose

Hi Jose. I'm very sorry for the goose chase. I did not realize the readme files for the public MPSS release were different than internally available versions.

If you purchased through an OEM then please contact them regarding the problem you described w/the card in your other thread. If not then let me know and I'll see how I can help.

Quote:

loc-nguyen (Intel) wrote:

We appreciate your feedback. However, we don't totally understand your third concern: How to handle cases where icc wrongly believes it has to compile for offload (internally doing 2 compiler runs), but where the mic side of the compiler fails due to missing .h files (e.g. Qt files).

Would you like to elaborate your concern please? Thank you.

Apparently, icc launches the MIC-side of the compiler if it finds something like a "pragma _offload", "_Cilk_shared", attribute etc. somewhere in the sources that it compiles for the host. If I add those annotations to header .h files, this will trigger MIC-compilation for all  .cpp files that include one of those headers - even if the actual code (.cpp) does not have a single place where MIC is called.

I found this compiler behaviour quite annoying (it may be suitable for small projects). In particular, in C++ you sometimes have to put those annotations into headers, e.g. for header-only implementations or templates. Also, the compiler will complain about methods/classes not available for offload if an offloaded method accesses a class that has not been declared offload (and declaration usually happens in the .h file...).  Plus there are cases where the compiler wrongly assumes that an offload version of a method is required (I opened a Premier issue for this). Unfortunately, adding attributes to headers was the first approach I took, and it failed miserably. This is a kind of catch-22....

I the end, I changed my makefiles to add "-no-offload"/"-offload-attribute-target=mic" options to the makefiles-only very few changes to .cpp and .h files.

Georg

Belinda,

I've been looking for the source code for the black-Scholes and Monte Carlo european options case studies. As I would like to compile and run the codes for a learning exercise on how to use the MIC architecture. I think it would be a good idea to have some example codes available to download and play with, to get a hands on feel for programming on the Xeon Phi.

David

Quote:

kankamuso wrote:

Quote:

Kevin Davis (Intel)wrote:

I can't speak to what might be the issue with the card but post codes are available in the MPSS readme (here: http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss) that may help with their meaning.

What is your Premier issue #?  (I will ping the HW support side)

thanks for your response. I posted a support issue but just received an ack mail without any issue #. By the way, I am also unable to find the post codes on the multiple files that are located under the downloading page you provide.

best regards,

Jose

Jose - FYI - the readme files for the new MPSS Gold Update 2 (posted 3/7/2013) now include the post codes.

Hi,

Speaking for a large set of the scientific community, I'd like to see more Fortran examples.  Let's face it, we all learn best by example; detailed documentation is pretty much only useful after you have a pretty good idea of what you're doing.  One of the attractions of the PHI is not having to get too close to the metal (we gave up on converting our code to a GPU) and Intel has been using that compatibility to sell the concept of the PHI.  Most scientific code is constantly changing and we scientists would like to do more science and less programming.  Fortran is more accomodating to heavy computer users who don't want to be full-time coders.  Some disciplines use it more than others but there are a lot of HPC folks, like my small group, who are using several hundred CPU hours/week and are moving up to several thousand/week. Some of my Fortran-touting colleagues use hundreds of times that. 

thanks,

Quote:

Bruce Weaver wrote:

Hi,

Speaking for a large set of the scientific community, I'd like to see more Fortran examples. 

The advanced labs for the training videos, including Fortran versions, should be making it out to the web late this week or early next.  It's a start at least. Luckily Fortran is a good fit for the Intel(r) Xeon Phi(tm) coprocessor. But yes, we do need to put more Fortran out there.

I would like to see two things:

(1) Documentation of the MSRs and/or PCI Configuration space registers used for the memory controller performance counters.  SEP/VTune provides a "memory bandwidth" measurement set, but I have not found out exactly how this is being measured.

(2) Documentation of the physical address to distributed tag directory hash.   This does not matter for bandwidth (since contiguous addresses are spread across all of the distributed tag directories), but it can be very helpful for choosing addresses to be used for synchronization variables.   I have measured a 3:1 variation in cache-to-cache intervention latency as a function of address when sharing a cache line between two (adjacent) cores (depending on the distance between the cores and the distributed tag directory for each cache line address).   Being able to determine this locality programmatically should allow for the implementation of significantly more effective synchronization constructs.

John D. McCalpin, PhD "Dr. Bandwidth"

Hi, I would find it useful to see some examples that combine MPI processes with openMP (or similar) thread generation (and in Fortran).  On HPC clusters I've iused in the past, I just matched # of cores to # of processes, but it appears the Phi benifits from running fewer processes with many threads.  Perhaps some of these examples are out there already, as I've only just started looking.

Thanks!

for those of you who asked for some fortran examples, please take a look at http://software.intel.com/mic-developer > Training under "Code Samples", we recently published some Fortran labs there.

if you are looking for other (and more specific) examples, let us know.   

Thank you all so far for the feedback, we are working down the list of requests and will use this thread to respond on where we are with what you asked

It would be nice in addition to default installation to have example of very minimal root filesystem which has most utilities stored on NFS share, as well as tarball with more utilities.

For example, in addition to "base", "common" and "micX" in /opt/intel/mic/filesystem it would be nice to have "nfs" and have it populated with everything ever compiled for mic, such as (for example) gdb that currently resides elsewhere. It would be nice to have a more recent edition of perf.

Also, I could not find swapon/swapoff for Xeon Phi, it would be very nice to have these available.

thank you

Vladimir Dergachev

I would really like to see documentation and/or instructions on how to setup Boinc to run on Phi. If I had that I would buy one. I know several others who are interested but we can't seem to find any info on how to make it work or even it if works with Boinc or even Folding@home.

Boinc is used to crunch for disease research such as cancer as well as dozens of other projects. The potential for Phi is enormous since most projects are still CPU bound.

 

We would be pleased if more information on a variety of topics were availiable:

  • The details of the SBOX control registers. In the Appendix of the System Software Developers Guide (May 2013) hundreds of registers are listed, but their content is not revealed. Reading values such as, e.g., temperatures is possible but meaningless since one cannot decode the values.
  • The details of the DMA controller, and some examples of how to program it. A variety of applications would gain from direct access to the DMA engine on the MIC.
  • Efficient ways for thread synchronization.

Hi!

A better MPSS source code organization would be wellcome, or even a MPSS source organization HOWTO document.

Many thanks.

Javi Roman

I want to get doucumet about how to get CPU_CLK_UNHALTED and INSTRUCTIONS_EXECUTED on MIC. THanks.

--GHui

Hi, I would like to find the documention regarding the physical numbering of cores on the INTEL MIC. I mean, which physical core is adjacent to which one on the chip.

Desperately looking for demo code.

My job as a test engineer required me to do some playing around with the Intel Xeon Phi cards (3120A and 5110P) on our company's recent workstations. My focus is definitely not Parallel Computing so I was there and realizing that "a fool with a tool is still a fool". :-)

When I had to see about CUDA support for the nVidia GPUs there were nice demos to download so that even without being in the parallel computing stuff you could easily push the load on the card to 100% and be impressed at the GFlops that the tools are reporting.

On Intel Xeon Phi there was nothing like that. The nicest thing I found was the monitoring program that at least showed how the temperature on the 5110P was rising because cooling equipment was too poor in the first tests. Meanwhile I found the code samples on http://lotsofcores.com/article/code-samples-now-available so with some of the sample programs there I can put 100% load on the card and do my tests. But I'm really missing a sort of "eye candy" that easily shows the power of the coprocessor card.

Regards

Rainer

-- Rainer Koenig, Augsburg, Germany

@Pinak P. I don't believe there is any documentation on which core is where on the chip nor is there likely to be. But is your concern more one of information on how your program is distributed around the ring interconnect?

@Rainer K. There are sample programs for the coprocessor provided with the compilers ( /opt/intel/composerxe/Samples/en_US/C++/mic_samples, /opt/intel/composerxe/Samples/en_US/Fortran/mic_samples) and performance workloads provided with the MPSS (not installed by default but you can find the directions for installing them in the readme-en.txt file that comes with the MPSS.) Are these the sort of things you were looking for? If so, then maybe what we need to do is make them easier to find. If not, can you let us know what sort of things we should be looking at adding?

@Frances What do you mean by "But is your concern more one of information on how your program is distributed around the ring interconnect?" ...... I did not exactly follow that. I am working on MPI programs and am looking at some latencies when cache-lines are transfered from one core's L2 to another. I was looking for some information like this one has for the Cell processor: http://www.ibm.com/developerworks/power/library/pa-cellperf/ (figure 4) ..... I would also like to know the maximum bandwidth the 'ring interconnect' is capable of.

@GHui 

You can use PAPI to get these. At my github theres a wrapper for papi that can be easily used in offload or native mode to record CPU_CLK_UNHALTED and INSTRUCTIONS_EXECUTED, check the readme and example here for more info:

https://github.com/TimDykes/IntelMIC/tree/master/papi_wrapper

Quote:

GHui wrote:

I want to get doucumet about how to get CPU_CLK_UNHALTED and INSTRUCTIONS_EXECUTED on MIC. THanks.

Here are a couple of points I had issues with while getting my Phi cards configures, etc.  I have not read all of the available documentation, so some of these may already be explained, but they were not in the readme file that came with MPSS.  Also, my main problems where associated with linux, but others may have the same issues so I think it's worth documenting.

  1. If you want to use all the Phi cards on a given piece of hardware, then you need to setup the appropriate bridge (at least if using MPI).  Frances was very helpful pointing me to the relevant documentation, but it would have been nice if there was a better pointer to that documentation in the readme.txt file that is supplied in the MPSS package.
  2. I still have issues getting processor and coprocessors communicating after some major change.  This is a basic linux problem for me, as I don't use linux that often.  While it was pointed out (somewhere) that you need to be able to ssh from cpu to phi, phi to cpu, and phi to phi, it still seems to plague me when I need to set it up (our lab just change all IPs to private network, so I just had to do it again).  Even after public keys are copied around, getting the known_hosts file properly setup also causes issues (i currently just ssh from each node to all others so the file gets setup correctly at each location).  There must be a better way of setting this up, but this is where my lack of linux slows me down.
  3. There are a few posting about what intel libraries need to be copied over to the phi cards inorder for MPI programs to run.  These postings are useful (but not complete, as I had to add a few libraries after getting runtime errors), but they should be better documented.  I sure setting up NFS would be a better way to go, I'm still working on that.
  4. Perhaps it's just me, but documentation, postings, etc seem scattered, so you have to hunt, and sometimes you don't know exactly what you're hunting for.  It would be nice if in the MPSS readme file if there were better pointers (that may be too difficult though).
  5. Finally, it would be nice if you could directly search this forum, like you can on other Intel forums.  Limiting a google search to software.intel.com turns up too much unrelated info.    Many peoples questions are already answered, but they are nearly impossible to find as the forum is currently setup.

I will add that this forum is great!  There are many people here that are very helpful and respond amazingly quickly, so thanks (i did get everything working...at least so far :)

cheers,

-joe

Joe, we really appreciate your feedback and we will continue to try to improve the organization of it all.  

What I _can_ tell you is that item (5) (being able to search this forum alone) is now implemented -- this was a recent change, and perhaps could be made more visible -- if you go to http://software.intel.com/en-us/forums/intel-many-integrated-core and look at the bottom of the initial 'Announcements" section there is a "Search within this Forum" link.   I will talk to our web designers to see if we can make that a bit more obvious!

Hi - I agree that the Intel fora are great. I've used them a lot and am also building up tips at http://highendcompute.co.uk/XPhi so welcome ideas. Yours @highendcompute

High End Compute

Hi Belinda, I would like to know more about using external NFS servers -- the documentation seems to imply that you can only NFS mount to the MICs something that is shared from the local server.  Is that only for an internal bridged network?  For example, if I setup an external bridged network, does the MPSS (v3.1 on Linux) software support NFS mounting something external to the server?  I will be testing this, but I have to get IPs allocated in order to do so and that takes some time. 

Thanks, Sally.

Sally,

If you set up an external bridge, then you should be able to NFS mount from any reachable server. I will make a note that the documentation on this needs work.

Frances 

Frances,

Thanks - that's good news.  I have another question then :-) 

For the external bridged network, how do I specify a particular IP address for each MIC?  The commands in the doc (p.100) indicate that you have to use consecutive IPs (ie. last octet is bridge, mic0=bridge+1, mic1=bridge+2 - in the example, node 0 is 2, 3 and 4).  Is that the case, or can you specify completely different IPs? 

Thanks, Sally

Sally,

As long as everyone is in the same subnet, no, I believe they don't need to be consecutive. If you use a single "micctrl --network" to set all the cards at once, then they will be assigned consecutive addresses by default. But you should be able to set them to individual values by using a different "micctrl --network" command for each card. If that doesn't work for you, let me know. It might be best to start a new forum issue at that point though, so that it will be easier for us to track the issue.

Frances

Belinda, I'd like to see more info on best practices for OpenCL on MIC.  As an example I'm considering a Phi coproc for an OpenCL app, and I'm guessing that local memory caching (like one does for tiled GEMM to minimize memory bandwidth  on GPUs). But GPUs actually have dedicated hardware memory per compute unit, whereas for MIC I'm guessing there is no such thing, and using __local buffers woudn't help, but probably actually hurt.

James,

Thanks for your feedback.

I assume that the following paper answer part of your request: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor

The paper can be accessed through the "training" tab in the main OpenCL XE page: www.intel.com/software/opencl-xe

Were you aware of that OpenCL XE training page? What should be improved with respect to OpenCL BKMs sharing? (content and format)

Arik

 

Thanks Arik, that's what I was looking for.

I noticed there is a Windows MPSS, does this mean if it is running on Windows 7 and a program such as Adobe Premier which has OpenCL support will be able to use the cores of the Intel Phi?

Will Windows 7 show the cores as extra cores in Windows as it would with XEON CPU's?

CIder PC www.ciderpc.com

Quote:

Joe wrote:

Hi, I would find it useful to see some examples that combine MPI processes with openMP (or similar) thread generation (and in Fortran).  On HPC clusters I've iused in the past, I just matched # of cores to # of processes, but it appears the Phi benifits from running fewer processes with many threads.  Perhaps some of these examples are out there already, as I've only just started looking.

Thanks!

Depending on what you have in mind, this may be an interesting topic, but one on which there isn't much interest in attempting to make committee decisions about documentation.

I'm not sure how examples would help you on this.  Assuming that your application scales reasonably well both under OpenMP and under MPI, it's mainly a question of trying the combinations.  As you're likely to be running in what is somewhat illogically termed "symmetric" mode ranks on both host and coprocessor), you need to balance the work so that MPI barriers are reached about the same time by all ranks.

As far as the MIC side of this is concerned, you are limited by the increased memory consumption of additional MPI ranks, and the way in which sharing of VPU resources is done by alternating among threads on each core, making it unlikely that you want want more than 1 rank on a core.  Useful applications, unlike simple examples, will run out of RAM, depending on your coprocessor model, with a lot fewer than 1 rank per core.  On the other hand, real OpenMP applications, if there is any use of private arrays, will become starved for stack even with stack set to "unlimited," which some Intel experts advise against doing, so the number of useful threads per rank is limited.  Of course, simple OpenMP examples will scale to at least 116 threads with the normal "unlimited" stack.

I have worked with applications which showed a pronounced performance peak (running on MIC alone) at 6 ranks of 30 threads (setting KMP_AFFINITY=balanced so as to spread the threads evenly across the cores assigned to each rank).  If your application runs best on host with 1 rank per core, this may pose a problem in that you already have an excessive number of cores passing messages between host and coprocessor even before you engage multiple nodes, and, for MIC to be useful, the performance of a rank with 30 threads ought to exceed performance of individual host cores.  Further, now that host platforms are available with up to 24 cores, adding the coprocessor capability to support 6 more ranks is not so interesting, if you can't take advantage of those coprocessor ranks being more powerful than host cores.

If you are talking about future MIC products which improve cluster performance and don't depend on matching coprocessor and host performance, that's a different story, but not one on which we will have any details in the near future.

 

Hi Harry,

You are right, OpenCL applications can use the all cores available in Intel(R) Xeon Phi(TM) coprocessors. On the Windows host you only see the host cores but MPSS provides  APIs / tool which allow you to monitor and use all cores in the coprocessors. Thank you.

I found out that build scripts aren't provided for MPSS 3.1 (I talked to the Yocto guys, who confirm it). Shouldn't build scripts be included in the source release for the GPL components?

@rhn:  are you trying to build host components or the entire microprocessor OS?

 

@Belinda: I'm trying to build a portion of the coprocessor OS (namely, the kernel).

@rhn:  I believe this forum thread contains what you are looking for (scroll down to the answer that was given today, March 4 2014)

http://software.intel.com/comment/1781756

 

Thanks for the kernel info.

I'm still interested in more documentation regarding DMA. My troubles with it are two-pronged: the DMA-related registers are undoumented and relation of MIC state changes (booting, resetting) to DMA state is unknown.

So far, I could utilize DMA by inferring register functions from open-source code, but it only works until first device boot.

The registers I found have some relation to DMA and aren't fully described in the System Programmer's Manual are: DCAR_*, DAUX_LO_*, DAUX_HI_*, DMA_DTSTAT_*, DMA_DSTATWB_HI_*, DMA_DSTATWB_LO_*, DCHERR_*, DCHERRMSK_*, DCR, MarkerMessage_Send. They are mostly the same as registers listed in /proc/mic_dma_registers_* and /proc/mic_dma_ring_* (sadly, this debug interface is barely useful without knowing the functions of each register).

The state changes that modify DMA state even without kmod involvement: boot ELF (it stops the DMA engine; what initilization is required on the device side?) and reset (sometimes it resets DMA registers, sometimes not?).

This information would save me (and possibly others) a lot of effort analyzing DMA behaviour.

I'd like to see a detailed technical discussion of the vector pipeline and instruction details to help better understand where I'm efficiently using the resources and not.  The 2 books only give a basic idea of what's going on and the "Xeon Phi Coprocessor Instruction Set Architecture Reference Manual" doesn't seem to include information like the instruction latency, pairing restrictions, relation of the pipeline and the hardware "threads" etc...  Ultimately my goal would be to be able to take a sequence of a dozen or so instructions (mostly straight line intrinsics maybe a little scalar control) and map out how the vector pipeline is being utilized.

Is there an example available that shows how to offload the MKL DSS solver to Intel(R) Xeon Phi(TM) coprocessors? I see that it can be offloaded in several presentations but cannot find an example how to do it.

页面

发表评论

登录添加评论。还不是成员?立即加入