Simple question about (Xeon Phi) card extension

Simple question about (Xeon Phi) card extension

bustaf's picture

Hi
I have just an small question that is concrete
when you  have  this card is hosted in motherboard and his module started and ready
could you give what is showed on screen with command
cat /proc/cpuinfo

I can read on the exchange (processor or coprocessor?)
answer of  James Reinders (Intel)

(The difference between a processor and coprocessor is this: a processor does not require
another compute device to be present in a working system.
A coprocessor requires a processor be in the system in addition to the coprocessor.)

I agree,but in this voice i can also understand that existing similar tree with the processor
physical processor on the socket
.. sub physical cores 
   .. sub logical cores

Sorry if I submit this question,  the process that is given in your files documented is very well but
not correspond with how we want using.
More precise, I envisage  to use bridge process (Tun driver, bridging driver) on an network card that
have multiple pseudo MAC (dummy),and drive dynamically the charges  heterogeneous linked  on each  
pseudo  MAC address , require just  that it's linked  core precise  on the IP address specific as index.
An process very simple and summary that I use already (internal or external) without this
Phi card added on machine.

Regards  

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

/proc/cpuinfo tells you nothing about whether an Intel(r) Xeon Phi(tm) co-processor is present or active.
lspci tells you whether it is basically powered on.
Only the additional utilities in the mpss installation give useful information.

bustaf's picture

Hi
lspci will answer always ,hardware are present ready or not to work on an machine.
you teaches me nothing
Maybe if you add what could answer lsmod... more constructive.

This card have an data sheet of the protocol of implementation ?

I not working with software imposed where I have not complete control.

Maybe also, more elementary, you must learn how to respect
the formulas of politeness conventional on your exchanges
Regards

robert-reed (Intel)'s picture

You can run an lsmod (I just did to check) but there's not much more it will tell you: I get, among all the unrelated modules, one line describing the "mic" module and its size. Running lspci at least labels the coprocessor device as such, and when the mpss service is started there are protocols started that enable ssh and you can actually run the card and see /proc/cpuinfo there. I won't try to copy the result of that command to this post--it's very large and may contain some details that are embargoed until the imminent public announcement, but it all works and looks just like any other Linux machine but with a whole lotta processors.

Also, I don't understand what offense you took from Tim's reply, which seems to me to be rather mild.

bustaf's picture

Hi
About Tim
I agree that my answer show some aggression.
but nothing against Tim personally,
it's the general attitude your group with public community developer that annoys me.
It seem that here as the restaurant that give you the menu to choose ,
but it not exist service true for eat, all the food is reserved only for his cooks.
About my question:
I have any more need this information, I have choosing to use an other solution that could be most
modern and promising for future.
I can't risk to wasting my time with your approach management of development (side programming)
that appears me doomed to produce one more time an parachute in leaded.
Regards

robert-reed (Intel)'s picture

I certainly sense your frustration but I'm having some trouble understanding the nature of that frustration or the particular development management issues that you find a waste of your time. We think we have some pretty good solutions here but if you think some other approach is better, it is certainly your choice to take that other path. But the language that separates us also frustrates us as we try to grasp your meaning.

bustaf's picture

Hi
About my frustration that you suspect existing, do not worry it does not exist..
For me your card rest only an object singular between all others without real importance.

About the sens of my words that you seems not interpret clearly.
I does not want invest the time of programming on an object, i think ,
the manner you want lead his functionality has no value to me , or rather for my customers.

I think the technical approach functional you are trying to implement is not in adequacy
with new technology requested on the current market .
If you sell this card blank with his programing is in to the charge of the customer
it's, may be, better I believe.
Regards

robert-reed (Intel)'s picture

"If you sell this card blank": I suppose we could have chosen to provide the coprocessor without any programming, a blank piece of hardware where ALL the software is the customer's responsibility. I don't think it would be accepted as well as what we did provide, a platform with a full Linux image, capable of running applications either standalone or in one of several hybrid modes including MPI and what we call "offload programming," makes it much easier for our customers to run their programming on the coprocssor.. It is true that this is not a clone of the GPGPU-style architectures favored by our competitors. We think our approach, which does require programmers to consider caches and issues with blocking computation within those caches, will provide better solutions on both host and coprocessor and be much more energy-efficient as well. If you invest more time in understanding our architecture, hopefully you will come to the same conclusion.

bustaf's picture

Hi

For confirm the value of your arguments,
Could you compile sources Postrgesql-9.2.2 database on two way for give the difference.
After ./configure and make
you execute command : time make ckeck

First with card operational on machine .

Second without card on machine ( Only gcc GNU compiler)

This test execute the tasks in parallel and could confirm value of your arguments more concretely.
You have xvidcap packages for create and permit to show the video of the two tests.
Video is better for showing each step of 131 tests

This test will demonstrate easily the real value of your card that is only programmable by your services , from the customers.

About your supposition GPU side:
I don't use GPU and I don't envisage to use GPU
but you have some test here:
http://wiki.postgresql.org/wiki/PGStrom

Regards

bustaf's picture

Hi
I have some customer that waiting results of the test
I see that you not answer me...

For make this test require less that 15 minutes, the time included download sources.

Without answer ,the customers (frustrated for retake your words), could think your product that hosted your Linux (house) imposed is not really effective as you suppose.

Regards

robert-reed (Intel)'s picture

You talkin' to me?  Sorry, I was on vacation for most of the month of December.  I have only just seen your replies.

I also think that you might not understand the basic nature of this device.  It is not an "accelerator."  It is a coprocessor.  It will not speed the operations occuring on its host processor by its mere presence.  You can get the coprocessor to participate in host-side computations through the explicit modification of your source code to take advantage of the "offload" extensions available in the Intel C++ and Fortran compilers, for parts of the computation that are highly parallel and explicitly identified.  Without those code changes to marshal data to the coprocessor and explicitly call functions compiled for both host and coprocessor, I would not expect to see any changes in performance.  These code changes appear as pragmas or directives, as with OpenMP, easier than the changes necessary, say, to port to a GPU but of a similar nature, to draw together similar computational work.  Or you can crosscompile programs on the host and run them natively on the coprocessor, whose main features are lots of cores, each with extra-wide vector units for high numerical throughput.  Programs requiring configuration utilities such as ./configure can't generally be crosscompiled, but our cluster teams have been able to natively compile many packages on the coprocessor directly.  If the 131 tests you describe are purely parallel loads, you might see a difference between running on the host and running natively on the coprocessor, but understand that many programs have a large serial component, which generally will run much faster on the big-core host than on one of the small-core coprocessor threads.  So then it would likely turn into a test of the serial/parallel balance in the 131 tests.

Despite all that, I could make some effort to come as close to your proposed tests as I could given adequate time.  Unfortunately, my boss has other plans, so I don't think I can accept your challenge.  And given that you request video confirmation of the results, in line with your previous skepticism, I doubt I would get much satisfaction from you for the effort.  I can only suggest that you learn more from press reports and such as they become available.  Hold onto your skepticism but also try to keep an open mind.  Perhaps someday soon these devices will become ubiquitous enough that you may get a chance to play with one directly.

bustaf's picture

Hi
I think it's you or your team  that are closed in approach of this card.
You want impose system stored in the NOR closed but  i see  that you are  not even  able  to improve  only  an database engine (wrote complete  in C/C   with access backend  at low level)...
I have almost 30 years of programming C/C   and I have perfectly understand how this card work.

You have add an ton of literature but where have you really an innovation.
If I resume your tools proposed:
Intel compiler  is a clone of GNU compiler ,it not standalone and it depend entirely to him.
Mpi is clone of Mpich2
OpenMp is available  to  public with compiler GNU.
the majority of  all  tool utility that you use is native of community GNU.
And for add  closure , you have also excluded Debian or his child Ubuntu ..

What I know ,If we can install  our system personal in the (Nor) of this card ,us ,we  can improve largely the results of this database engine ,with  new programing added  ,and  no the literature only.
I think that only the side  of hardware of this card have an  value of interest for we.
Your  system and your softwares imposed ,have not interest
We are able to make largely   better that you and more appropriated for our customers...

I agree,I provokes you an little  but ,I do not seek the devaluation of your card,I am the first to defend the Intel hardware that is my tool work.

I am not ready to open the free or easy way on an new  potential market of  new servers low consumption  to  the  concurrents ..
I prefer to stay on my guard ,If I remember... I've already seen your exploits when you have  manage the mobile sided with Linux.

When i find time I could add some videos for that you understand better  if I am Intel side or not.
Maybe ,also at the same time, you and your team that manage this card you take an lesson programming Unix, Linux
system.
(I can add video with source sample OpenMp used with your Intel compiler and the gnu compiler, you compare the times...)

I forget ...
About  the test video  i have proposed  it's only for see all step, it's not for control  times certified exact , me ,I trust to your answer even without videos.
   
Regards

Complement  following added.
For start I add  one first video (make check Postgresql) 1 fork and 20 forks
you observe difference with increase size of asynchronous group tests)

(Sorry it's very small machine 2 cores, it's only (personal) that I have find free in my hands),
Debian 7 have not by default  videocap in software database;
i have also compile several  external sources library for he work correctly..
I have several big machines of customer but i can not compile over (some library's are stage beta)
(also for license Intel compiler (no commercial) that i want use on future videos..

I have using (swf format) that I have tested operational  on W8. I can make (avi or other )
but same an badger I have add  (codec  protected) in (flag in  the ./configure of some library's ),
require now VLC is installed to it work correctly on W8...

I hope it work to all, I am not really very experimented in the branch of the video.
Confirm  to me please if  this first small test video is working correctly for you.

I have add (ogg) for it work on browser, but I think I must increase frame/ second.
also I have find 3 old servers XEON at the trash for mount
a cluster 24 processors, but I don't know if he are all complete ...

 After having made  the tests on format (ogg)(H.T.M.L.5)
I see that Only Firefox and Chrome may open my video directly in the browser. (Linux and Microsoft)
I'll do  all  the other  videos  with (ogg) I think this is the best compromise
(Require free time for search for better solution ...)
(ffmpeg -h full) show a plethora options ....
he can scroll the size of one or two rolls of toilet (still nine,complete) ,
and it's not yet ended...

I have add new video to you
Currently i am very busy an other side i add explication  more later.
Regards

Attachments: 

AttachmentSize
Download test-0.swf.bz215.43 MB
Download test-pg.ogg10.38 MB
Download vidtorob.ogg165.08 MB
robert-reed (Intel)'s picture

Thank you for all the effort you have made to communicate with me.  I read through your latest note several times, and watched the videos (well, I watched the small one and then watched the first part of the long one but I found it mostly incomprensible manipulations of the user interface).  I fear that I am only understanding about 30% of what you are trying to say.  The short video appears to demonstrate what you've asked me to run on the coprocessor, but looking at that output, the tests look like they are measuring Conformance, rather than Performance.  I have resisted taking the time to duplicate the experiment because I did not want to give a false impression of the performance of an application which has not been tuned for this architecture.  I am less concerned about running a conformance test, though I still have the issue of finding some time to do the work.

However, I was not able to unpack your bzip2 archive, getting instead errors that started out like this:

$ tar xjf ~/Downloads/test-0.swf.bz2
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive base-256 value is out of off_t range
tar: Archive contains `\301\200\r\006\001\260\003\002' where numeric mode_t value expected
tar: Archive base-256 value is out of time_t range
tar: Archive contains `\030@\006\001\250\030\0\320' where numeric uid_t value expected

You ask whether our coprocessor  is "able to improve" a "database engine."  I am not sure that it would speed up such an operation.  This coprocessor is really intended to accelerate numeric operations in its vector units, with just enough CPU associated with each of those vector units to handle control logic and marshal data to feed those vector ALUs.  Object-oriented database management does not seem to me to be an application that uses a lot of vectors.  I imagine that such an application would do a lot of individual object fetching and field extraction, and while the presence of hundreds of threads might mean that lots of such object manipulations could happen at the same time, the simplicity of the cores would likely not provide the expected speed increase.  Even worse, that many simultaneous threads would probably not be well tuned for the thread locks and critical sections that must exist in postgres to provide thread-safe multi-core operation.

bustaf's picture

Hi
That i want you understand in video it's the new network programing  is mounted now interfaced (asynchronous wrapped on service 80...)
I understand that is difficult for you that you understand  this video when i observe the  prehistoric way that
you propose using SSH with command line...
Me with an simple browser i can control all system
when I am on W8 Android Aix or Apple system, from the home or from outside,your card could represent for me only an part
insignificant between all the system complete
Me  i know when i make system Linux I don't impose the downgrade for it's persisted aligned kernel on the NOR...
when require  upgrade on  the external repositories.
As I see that you are not able to understand derivations networks on several different database engines
I move for you the source sample OpenMp used in the video.
Maybe you could make time test,with and without your card with  GNU and ICC
the Source is very easy ,you can modify easily better appropriated to  your card
(I have wrote long time ago already for my evaluation OpenMp compared with low level i use)
Maybe same you could be able to demonstrate  one result is concrete and  otherwise that literature ...
I have other samples more complex and more interesting but require it's linked with engine database.
I have not time to  change using  (mmap) or (ndbm) to it would work standalone ...
We(us ?) side Unix or linux we using mainly  fork (semaphore fifo ipc..) ,maybe too conservator on this side probably.
For Postgresql do not worry we are already able to optimize largely with the new hardware Intel and this
without you card obligatory used....
I council  to you , you  read the  source Postgresql before you add your speculation are very very doubted
and I chew my words for the  term used...
Regards

(rename file .cc your site have not this (mime) )

I have  forget...
About (tar) that you seem want use (tar xjf ~/Downloads/test-0.swf.bz2)
it's not .tar archive file ...
you must use only bzip2 -d test-0.swf.bz2 he result test-0.swf uncompressed
maybe ,more justly  appropriated I think ....

Attachments: 

AttachmentSize
Download gnu-ficat9.cpp21.09 KB
bustaf's picture

Hi

About the derivations  that use several engines database
Your reasoning is in the  wrong side..
Open  this link and read all the  command enumerated. (Example PQsendQueryPrepared)
(http://www.postgresql.org/docs/9.2/static/libpq-async.html)
For resume your expression.... ( If you invest more time in understanding ... hopefully you will come to the same conclusion.)
Even  with an single Intel Atom N570 (4 cores) connected on network to  help ,you can have performance improved ..
It suffice just  you  are  able to it programming correctly on an (backend) C/C++ ....

I forget ...
About your remark (Thank you for all the effort you have made to communicate with me.)
My communication effort  is not specific to you,  rather, more specially to your group.

Regards

Login to leave a comment.