I have used the file generator in http://software.intel.com/fr-fr/forums/showthread.php?t=104505&o=a&s=lr as a basis to pimp it a bit for our needs.
I've attached the source code to this post.
The main differences are:
* only data is created randomly
* #sequences and #files are not random
* you can specify a prefix for data files instead of a directory
* you can specify the probability with which substrings from the reference string are copied
* lots of tiny clean-ups under the hood
./datagen 500 4000 3 2 dataset1 0.6
* a ref sequence with 500 bases
* 2 files dataset1_input_0.txt and dataset1_input_1.txt with 3 sequences a 4000 bases each
* with a probability of 0.6 that substrings from the reference are used