Solved

creating a huge text file having a repetitive randomized string in C

Posted on 2013-01-30
17
331 Views
Last Modified: 2013-02-02
Hi there,

I am trying to create a huge text file having a repetitive randomized string in C.

I can randomize the string values and also can write to the file. What i am trying to do is to have a file having a huge size, say 500 Mb.

Is there a smart way to do this?

Regards.
0
Comment
Question by:jazzIIIlove
  • 7
  • 5
  • 3
  • +1
17 Comments
 
LVL 51

Accepted Solution

by:
Julian Hansen earned 313 total points
ID: 38834885
Is this a once off function or do you need to do this often?

When you say "smart way" what do you mean?

If you need to write the values to the file then buffer the string in a large memory block and then repeat write the block - I am assuming what you want is this

Random string: abcd1234

And you want a 500Mb comprising multiple repeats of this string so
abcd1234abcd1234abcd1234 ... abcd1234 = 500MB

In which case I would create a buffer (say 50K) - fill this with the randomized string (make the buffer a multiple of the string length) and then repeat write the buffer
int bufflen = 50 * 1024;
int strsize;
char * buffer = new char[bufflen];
char randstr[9];
*buffer = '\0';
strcpy(randstr, "abcd1234");
strsize = strlen(randstr);

while (bufflen >= 0) {
  strcat(buffer, randstr);
  bufflen -= strsize;
}
pfile = fopen('bigfile.txt','wt');
for(int i=0; i < filesize/bufflen;i++)
{
  fwrite(buffer, 1, bufflen, pfile);
}
fclose(fp);
delete buffer;

Open in new window

0
 
LVL 13

Assisted Solution

by:Hugh McCurdy
Hugh McCurdy earned 63 total points
ID: 38835049
I think Julian's answer is pretty good.  However, it's hard to know without really understanding the purpose of the project.

For instance, is the string for some sort of security scheme?  If not, what is it for?  The answer could help with finding an answer that suits your actual need, whatever it is.

Also, what do you mean by "repeatedly?"  Repeat the random string several times until you get to 500MB or do you want a very long string that is randomized?

Occurs to me that a simple approach, if using the GNU compiler and libraries is to make your string from a sequence of unique characters and then call strfry() which will "randomize" your string.  Then you can repeatedly write that out (if that's what you mean).
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38835615
Hi guys;

I like Julian's approach. For clarification. The file content is as follows. It's one off thing,

Test A:
12 0.4
.....
.....
Test B:
....
.....
Test Z:
.....
....
This is the schema where A is incremental until Z, yet it has to stay as single character
The numbers 12 0.4 are tab and space delimited, and they can repeat until the size of the file is huge.

Regards.

I am using VS as the tool and its compiler. Not GNU. I can also use GNU C for this need.
0
 
LVL 13

Expert Comment

by:Hugh McCurdy
ID: 38836049
I think Julian's approach is good too.  I was just concerned about what you are trying to do but now it appears you just want to make some test data.  I think you are good to go with Julian's answer.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38841490
I also think Julian's approach is good but I think there is problem in the while loop, when the bufflen is 0 or below, it leaves the loop with 0 or a negative value and for loop fails.

So, I added this code to the solution. It works good with this, but let's see what Julian says.

while (bufflen >= 0) {
		strcat(buffer, randstr);
		actual = bufflen;
		bufflen -= strsize;  
		if(bufflen <= 0)  
			break;
	}
	bufflen = actual;

Open in new window


Regards.

P.S. Also there is a linked question in the link:
http://www.experts-exchange.com/Programming/Languages/C/Q_28016182.html
0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 313 total points
ID: 38841582
My solution was pseduo to illustrate a point and was predicated on the fact  that bufflen was a multiple of strsize

For the algorithm to work as it is you need to make the buffer size a multiple of the string you are replicating.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38842444
Hi;

Thanks for the information.

I distort your algorithm. The following is the code. The problem is the line
for(int i = 0; i < filesize/100;i++)

Open in new window

.

I am trying to populate just as that, the inner for should be the main source of randomization and repetition. But when you run this, it will run smoothly but I cannot go for larger files more that 2.5 MB...All in all, the numbers in the file are the main source
for(int i = 1000; i < 20000; i+=1000){

Open in new window

to make the file larger. I want to see at least 500 MB or so, but it seems to populate extremely slow and I am afraid that there will be a problem with buffer.

Following is the code, the code quality seems sucks. What do you think? How can I achieve around 500 MB or so without getting stuck?

# include <stdio.h>
# include <string.h>
# include <time.h>
# include <stdlib.h>
# include <math.h>
# define filesize 50000

char * construct(char c, int bufflen);

void main()
{		
	int strsize;

	FILE *pfile = fopen("bigfile.txt","wt");		 
	int m = 65;
	srand(time(NULL));
	for(int i = 65; i <= 90;i++)
	{
		int bufflen = 50 * 1024;	
		char * buffer = new char[bufflen];
		*buffer = '\0';
		char * randstr = construct(i, bufflen);
		strsize = strlen(randstr);

		bufflen = strsize;	
		fwrite(randstr, 1, bufflen, pfile);

	}
	fclose(pfile);
}

char * construct(char c,int bufflen)
{		
	char randstr[1000000] = "Operator ";	

	randstr[9] = c;
	strcat(randstr,":\n");

	for(int i = 0; i < filesize/100;i++)
	{
		int j = 0;
		int r[50];
		double p[50];
		double tr[50];
		for(int i = 1000; i < 20000; i+=1000){
			r[j] = rand() % i + 10;
			p[j] = ((double)(rand() % i)/(double)RAND_MAX+rand() % (j+1));	 	

			char integer_string[32];
			sprintf(integer_string, "%d", r[j]);

			char double_string[32];
			sprintf(double_string, "%.2f", p[j]);
			strcat(randstr, integer_string);
			strcat(randstr, " \t");
			strcat(randstr, double_string);
			strcat(randstr, "\n");
			j++;
		}	
	}
	return randstr;
}

Open in new window

0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 313 total points
ID: 38842685
There is a lot about your algorithm I don't understand.

Why are you storing values in the p and r arrays - why not just have an integer and double value as you don't seem to be using the value after you have added it to the string.

I have not compiled and tested this - but is this not in essence what you are trying to do (we can address the speed issues later - first need to understand what you are trying to achieve)
# include <stdio.h>
# include <string.h>
# include <time.h>
# include <stdlib.h>
# include <math.h>
# define filesize 50000

char * construct(char c, int bufflen);

void main()
{		
	FILE * pfile = fopen("bigfile.txt","wt");		 
	
	srand(time(NULL));
	
	for(int i = 65; i <= 90;i++)
	{
		char * randstr = construct(i, bufflen);
		int len = strlen(randstr);
		fwrite(randstr, 1, len, pfile);
	}
	
	fclose(pfile);
}

char * construct(char c,int bufflen)
{		
	static char randstr[1000000];	
	
	strcpy(randstr, "Operator");
	randstr[9] = c;
	strcat(randstr,":\n");

	for(int i = 0; i < filesize/100;i++)
	{
		int j = 0;
		int r;
		double p;

		for(int i = 1000; i < 20000; i+=1000){
			r = rand() % i + 10;
			p = ((double)(rand() % i)/(double)RAND_MAX+rand() % (++j));	 	

			char result_string[32];
			sprintf(integer_string, "%d\t%.2f\n", r, p);
			strcat(randstr, result_string);
		}	
	}
	return randstr;
}

Open in new window

0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 84

Assisted Solution

by:ozo
ozo earned 124 total points
ID: 38842841
# include <stdio.h>
# include <string.h>
# include <time.h>
# include <stdlib.h>
# include <math.h>
# define filesize 50000

void construct(char c,FILE *pfile);
void main()
{		
	FILE * pfile = fopen("bigfile.txt","wt");		 
	
	srand(time(NULL));
	
	for(int i = 65; i <= 90;i++)
	{
		construct(i, pfile);
	}
	
	fclose(pfile);
}

void construct(char c,FILE *pfile)
{		
  	
	fprintf(pfile,"Operator%c:\n",c);

	for(int i = 0; i < filesize/100;i++)
	{
		int j = 0;
		int r;
		double p;
		for(int i = 1000; i < 20000; i+=1000){
			r = rand() % i + 10;
			p = ((double)(rand() % i)/(double)RAND_MAX+rand() % (++j));	 	

			fprintf(pfile, "%d\t%.2f\n", r, p);
		}	
	}
}

Open in new window

0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 313 total points
ID: 38843329
Ozo has optomised further - I think were things got confused is that the original recommendation was to write a pre-created buffer multiple times to the same file to generate a large file.

Your updated code shows that each buffer is essentially different so there is no benefit in precreating the buffer - as Ozo has done it makes more sense to simply output the data to the file.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38843374
Yup, that seems true. Sorry it was evolving in my mind.

The code is clean, yet i cannot have a huge filesize. /100 makes it smaller, if i remove, my machine seems stuck..

Also is there a need to free the pointer?

Regards.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38845046
another question is that;

do you think bringing a multithreaded approach helps to run faster, or better?

#pragma omp parallel for
	for(int i = 0; i < filesize;i++)

Open in new window


or should I put this to the outer loop?

#pragma omp parallel for
for(int i = 65; i <= 90;i++)
      {

or should I open a new question?

regards.
0
 
LVL 84

Expert Comment

by:ozo
ID: 38845199
Which code seems stuck?
The one that writes directly to the file,
or the one that repeatedly scans to the the end of a large buffer in order to append to it?
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38845310
thanks, your question is extremely wise. I can debug the code but per line and no slowness at all but when I run, the loop takes too long to execute.As you see, i change the loop condition from filesize/100 to filesize where filesize is 500000. I can produce around 2 GB with no failure but the creation takes time.

Do you think i should go for a threaded solution by putting that pragma line to the for loop in construct function or the loop in the main or both? do you think it can produce a notable improvement?

I went for the pragma parallel idea from:
http://stackoverflow.com/questions/4835192/threaded-for-loop-in-c
http://www.viva64.com/en/a/0054/

Regards.
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 124 total points
ID: 38846210
Unless you are using quadratic time string shuffling operations, which I've told you how to avoid, your bottleneck will probably be  just disk IO.
0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 313 total points
ID: 38846502
Disk IO is always going to be a bottle neck - parallelising your code is not going to acheive anything because you are still going to have to wait for the disk operations to complete.

What you want to do is rather make sure that the chunks you write to disk are as big as possible.
0
 
LVL 12

Author Closing Comment

by:jazzIIIlove
ID: 38847242
Thanks guys.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand opening and reading files in the C programming language.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now