New test file : Large input

New test file : Large input

Portrait de paul-guermonprez (Intel)

Hello

A quick post to inform you that a new, larger input file is available from :
http://intel-software-academic-program.com/contests/ayc/early2012/test_input_1.tar.bz2
Expected output is provded too.

Note : In order to process big files, you need to be less greedy than our sample program.
In order to do that you can avoid storing non matching pairs, by using for example hash tables instead of arrays.

regards, paul

32 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de mah.nabil

Where is the reference file and what is theminMatchLength please

Portrait de andreas86

Looking at the provided output file, I would guess that the minMatchLength is not larger than 22 and that test_input_1.fa is the reference file and Home_sapiens.... the input file.

Portrait de andreas86

But on a related note, my understanding so far was that the provided sample program was supposed to be our reference and that our program should be able to do exactly the same, although of course faster. Now, it would be nice if someone could clarify which input file sizes our program should be able to handle (i.e. an upper bound on the file sizes that will be used during the evaluation).

Portrait de mah.nabil

test_input_1.fa contain two test sequences

Portrait de andreas86

That's strange. It looks like test_input_1.fa was supposed to be the input file and Homo_sapiens... the reference file. But the first line of the expected output is "19_dna" wich is from Home_sapiens...; however, the sample code prints the names of the sequences in the input file. So it looks to me like they accidentally interchanged the two files.

Portrait de gino_lotta.

On the Russian web page is written that the length of the ref. string is less than 2^32. The number of the input strings is less than 2^32, and the sum of the lengths of the input strings is less than 2^32.

Portrait de Xavier H. (Intel)

You're right, I've just updated the output file to correct that, sorry !

Portrait de andreib

Hi!
Can we assume that there are no "illegal" characters in a sequence?
Do input sequences tend to be a lot smaller than the reference file?

Thank you in advance,
andrei

Portrait de pl

Hello

I am also wondering if we can assume that there are no other characters than A, T, G, C, '>' and \n (expect for the name of the sequence, of course).
Indeed, it simplifies the vectorization for parsing the sequences :)

Thanks

Portrait de Stevan Medić

Sample program crashes when it tries to store the whole homo_sapiens reference sequence text in STL string. I dont understand your tip : ,,In order to do that you can avoid storing non matching pairs, by using for example hash tables instead of arrays." Actually, i know what that means, but the problem is at the begining. Or maybe that happens only to me, i dont know... It tells that the application has requested the runtime to terminate in an unusual way and thats it. So, could anybody help me? Thanks a lot

Portrait de Xavier H. (Intel)

Hi,

It's not the reference sequence that should be to big for your system memory, but the L array that stores information on non matching pairs.

Portrait de hela_umbrella

You said that we should try to figure out a way how to store only information about matching pairs (and not the zeros). I have thought which numbers should I store, so...

I have a question:

If we look at the sample code, does it check, what I think it does, before priniting the result:
(this is a list of arrays)
...
... a b1 ...
... b2 b3 ...
...

Should information about a be printed only in case if this is true:
a>b1 & a>b2 & a>b3 ?

Correct me if I am wrong through an examle.

I have read "Understanding the algorithm", but I am not quite sure that I did.

Portrait de flesti

you are right abouta>b1 & a>b2 & a>b3, the code do : if(a<=b1 | a<=b2 | a<=b3)discard solution which is equivalent to: if(a>b1 & a>b2 & a>b3) keep solution

Portrait de maykelnawar

I think that the meaning of this tipis that the matrix will contain a lot of zero records.
So instead of using a matrix you can use a hash table, where you can insert only non-zero values.

i.e.:
1- instead of writing 0 to the matrix -> add no record to the hash table.
2- instead of writing a value to the matrix (L[i][j] = x) -> add the value(i,j,x) to the hash table.

Portrait de hela_umbrella

Thanks flesti!

I have more questions...:

1. What is the smallest size of minMatchLength for this large file, for which the solution should run properly?
And the biggest? It is important to know if it fits for example integer boundaries, if we get very big input sequences...

2. What is the biggest number of sequences (the sum of sequences (not including reference one) from all command arguments)? The biggest number we should expect after >test_sequence_? is...?

Portrait de andreas86
Quoting hela_umbrella I have another question: what is the smallest size of minMatchLength for this large file, for which the solution should run properly?

If my implementation is correct, it should be 22.

Portrait de mmilena
Hi! About the biggest number of sequences is alreadyposted here:

http://software.intel.com/fr-fr/forums/showpost.php?p=182023 I hope this would help.

Portrait de mah.nabil

I thinkSmallest size of minMatchLength can be 1 , because it is a command line argument as long as the result is equivalent to the reference code uploaded

Portrait de andreas86
Quoting mah.nabil I thinkSmallest size of minMatchLength can be 1 , because it is a command line argument as long as the result is equivalent to the reference code uploaded

That's right. But the 22 I mentioned above was the smallest minMatchLength for the Homo_sapiens sequence such that the output agrees with test_output_1_-_threshold-22.txt. I thought that this was what hela_umbrella asked for.

Portrait de hela_umbrella

Yes, I have asked for the large example file.

Thanks everybody for your answers!

The other question remains a mistery...

Can we get an input file with, for example:
>test_sequence_879
..
>test_sequence_8179
...
>test_sequence_12211
...
?

Portrait de mmilena

Hi,

On the Russian forum http://software.intel.com/ru-ru/articles/contest-accelerate-2012-problem/ it says that the number of all input strings in all files together is less than 2^32, which means you can get input file(s) consist(s) of sequences with arbitrary names as long as the number of all sequences lis ess than 2^32.

Hope this helps!

Portrait de neshone

Hi,

I've seen this russian thread poping up every now and then and I have to say that, in my humble opinion, that kind of requirement is unreasonable. To be able to handle 4 billion input strings, where each string, including the reference one, can be up 4 billion characters long (4 GB in size), you would have to do all kind of stuff that would make your program much slower when it comes to solving inputs of "normal size" (e.g. it would take more time to be able to sort all of the sequences) and you would lose points. I think it's quite enough for everybody to have in mind the sampled "large input" of 60 MB and other inputs of comparable size, larger or smaller. I think this thread was opened in order to have us think a bit outside the box, but not to be be able to handle something of that size.
It would be nice to have an official Intel comment on this subject, this is after all only my opinion.

Best regards,
Nenad

Portrait de andreib

Hi!
I think this question was already posted, on a separate thread. Sorry :(
What other rules does the input follow? If we are not concerned with ignoring chars, the input will be faster (we can also use in-memory mapping, etc.)
Is that a rule that after 70-72 chars there is a newline? I see that in input examples.
On my computer there's a major difference between reading files serially and reading them with mmap() - but still, that brings errors because I bypass validation.

I hope I will find some workarounds, anyway (I am thinking at one or two, but I will apply them later if needed)

Thank you in advance!

Portrait de andreas86

There is a thread about this on the French forum: http://software.intel.com/fr-fr/forums/showthread.php?t=104613

Portrait de andreib

Thank you, Andreas!

Portrait de Fuchs

Hi. Could you give us a reference time, how long should it take for these reference and output, if we start our program SEQUENTIALLY ?

Has anybody of studs testes it ? how long do u need ?

Portrait de shakal187

Does anyone use VS2008? How can i update my compiler so i can use unordered_map?

Portrait de yvanko

As far as I understand, the original program is too unefficient so it even does not work when the input data is as big as in this test. You have to improve the programm, preserving the final result.

Portrait de bdrung

Put at least 261 GB RAM in your system and you can calculate the result with the reference source. ;)

Portrait de andreas86
Quoting bdrung Put at least 261 GB RAM in your system and you can calculate the result with the reference source. ;)

And at the time the computation finishes, most computers will proabably have more than 261GB RAM and it will be very cheap...

Portrait de bdrung
Quoting andreas86

And at the time the computation finishes, most computers will proabably have more than 261GB RAM and it will be very cheap...

Nah, it wouldn't take that long. Processing this files takes a few minutes with an optimized algorithm. It would take at most a few hours with the unoptimized algorithm.

Connectez-vous pour laisser un commentaire.