Function suggestion

Function suggestion

One can use e.g. section() to provide read-only "peephole" subsets of containers into map(), but there doesn't seem to be a read-write equivalent for map() results except to createa temporary container and then replace() the results.I've tried this and ArBBdoes not seem aware enough to eliminate the temporary data copies. The performance loss depends on the fractionof data being processed.

Also the othermap() inputsmust have the same dimensionality and size, which iskind of a waste.

A trivial example would be red-eye reductionfor photos; there's no need for map() to know about the whole photo just to update a tiny portion.

Does anyone else foresee a need like this?

Best regards,
- paul

12 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

That's a good suggestion. Enhancements to map() are near and dear to our ArBB hearts right now and we've been looking for specific use cases from custoemrs. I will make this a feature request and see if we can get it in the product in the future.

BTW, this idea was created thinking about alternatesolutions re.my "Unexpected slowing" post.

It seems map() is the clearest and cleanest "portal" to harnessing Intel MIC. What's needed is a way to subset it for operating on a restricted portion of acontainer, instead of the entire container. (Something like the "checkout divider" we use at the supermarket?)

- paul

Another map() suggestion: Often map() is used to iterate a data setuntil some termination metric is achieved e.g. max_reduce() of incremental change.It seems inefficient to process the calculation and later reprocess the results to quantify termination, when the results were just in the core(s) a moment ago.

ArBB'sprimaryperformance is so good, I've become sensitive tothe slightest "motion" of data . . . every temporary copy, every"revolving door" scenario.

- paul

Thanks for the additional suggestion. We're continually trying to improve the performance of map internally and these trials add to the list of ways to improve.

Yes, I'm getting to understand"the map() way" of implementing algorithms and appreciating the relative ease with which map() makes effective use of multicore processors. Have even split a map() containing a non-constant for-loop into two map()s in sequence, the first computinganintermediate valuefrom that for-loop and the second computing the final values, with a 2x speed-up overall.It seemscounterintuitive to dice-up an algorithm into "atomic" operations, to beshared across cores, but it works! Now my call() functionsconsist mainly of a sequence of map() invocations.

I have a suggestion for the reduction (and possibly scan) functions: either overload them to accept the arguments of section() or have the JIT recognize the xxx_reduce(section(...)) idiom -- my preference as it's future-proof to changes in section(). From my timing experiments, the beta 4 release seems to create an unnecessarytemporary container.

Speaking of reduce-section, why am I not allowed to call that combination in map() on fixed arguments e.g. containers? It's a read-only operation computinga single value, bloody useful for digital-filter-like algorithms, and likelymuch faster than the for-loops I amforcedto use.

While I've got your attention,I also wish for access to SSE lower-precision reciprocal"RCPSS/PS" and lower-precision reciprocal square-root "RSQRTSS/PS" both for single-precision floating point. Ironically, ArBB has functions with the same namesthat don't seem to be the quick-and-dirty ones.

- paul

Hi Paul,

You can rely on "temporary copies of whole containers" inside of the context of an Intel ArBB function (see the note on "space efficiency"). Intel ArBB operators are operating by-value in terms of the return value, and in terms of the signature of the operators, i.e. parallel operators do not have an "in-place" signature which is modifying a container given in a non-const manner. This is by design, and it allows the JIT compiler to make helpful assumptions instead of making guesses (which are not general).

Hans

Hello Paul,

Both models, i.e. elemental functions (map) and element-wise operation on whole containers are complementary models within the programming model of Intel ArBB. For example, one can use the map-operator as often as appropriate or needed even within a single Intel ArBB function. Another example is given by the data reordering operators, having the full scope of whole containers is very helpful in contrast to kernels in the narrow sense (mapping an elemental function). There is nothing introduced in Intel ArBB which is an issue with respect to the Intel MIC architecture. Writing an Intel ArBB program is already scaling forward to Intel MIC architecture, no adjustments or limitations to the program code are needed.

Hans

Hi Paul,

thank you for your feedback on relaxing the requirements of an elemental function. At least in terms of the syntax element-wise operators and other operators on fixed arguments of an elemental functions are helpful. On the other hand, using map() is already exploiting the parallelism which is one of reasons to not allow additional parallel operators inside of an elemental function.

Hans

"I have a suggestion for the reduction (and possibly scan) functions:
either overload them to accept the arguments of section() or have the
JIT recognize the xxx_reduce(section(...)) idiom -- my preference as
it's future-proof to changes in section(). From my timing experiments,
the beta 4 release seems to create an unnecessarytemporary container."

Fusing two/more operators (section and scan/reduce in this case) is the way to go for Intel ArBB. We want to provide primitive operators, and we want to allow programmers to combine them. It would be contrary to this to provide overloads of scan/reduce to incorporate section into these operators. Here are two other notes on temporaries, #1 and #2.

Hi Hans,

I will reverse the argument to saymore avenues of parallelism is better! In this case, simply taking advantage of vector hardware (SSE/AVX) instead of resorting to clunky for-loops.

- paul

Hi Paul,

Sounds like you are propagating the Intel ArBB programming model. Thank you!
http://software.intel.com/en-us/articles/when-and-when-not-to-use-_for-loops/
http://software.intel.com/file/34410 (slide #19++)

Hans

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui