input sample

input sample

Hi,If anyone need some sample input you can get some atftp://ftp.ebi.ac.uk/pub/databases/fastafiles/I think it's better than random files (DNA is probably no just random serie of A,T,C and G).Sequence names may have spaces (unsupported by intel's sample code), you can remove them with:sed -e 's/\\s//g' inpur_file.txt > output_file.txt

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

I have updated the sample package to take sequence names with spaces
(line 14, change "s >> name;" by "getline(s,name);").
We will not use names with spaces ourselves, but at least you will be able to handle these files directly.

Nice to know!

I was just going to open a thread to point that behaviour out, but you fixed it already ;)

Do you think we should check that input contains only A,C,G,T symbols, as in the reference code? For large input, validation may be expensive.

Thank you in advance,
andrei

Input will at least contain new line character, but if it's the only character to ignore it's would be good to know.

Quoting andreib
Do you think we should check that input contains only A,C,G,T symbols, as in the reference code? For large input, validation may be expensive.

Since the comment in the reference file says "ignore every other chars (newline, etc)", I would probably leave the checks in the code. Also, I don't think that they would have much of an effect on the runtime as every character needs to be compared to '>' anyway, so the character would be in the cache or even in a register and additional checks on the same character shouldn't be very expensive.

Thanks a lot.
It is a very nice source.

It would be useful to know if the input follows this exact pattern: only the letters A C T G, with no spaces between them, except newline.

mihaio07, that 'exact pattern' affects the speed of algorithm rather than its scalability which is of most importance. Ang again: it's best if we can stick to the reference program.

I think there's no reason why they would make us waste CPU time on input format checking. As in the previous Acceler8 contest, I think everything will be formated in the same way as it is in the given example, except maybe sequence names (I think those can differe, so you can't jump directly to the sequence beginning).

On the other hand, if you're reading input the same way they are, it doesn't make much difference.

Best regards,
Nenad

Leave a Comment

Please sign in to add a comment. Not a member? Join today