jlw011597
asked on
On Unix or Windows2K platform, need a random line sampler
I have a 2000+ line text file and need to randomly select 250 lines from the file.
Any STANDARD or FREEWARE/DOWNLOADABLE tools to do this either in the Unix or Windows environments?
I have Unix (Compaq Tru64) or Windows (Win2K) available to do it on. I've also got OpenVMS but I don't expect
that there's anybody here in Expert's Exchange who can answer questions on that venerable OS... if there are,
then fine, on there too -- since that's where I NEED the resultant text file of 250 lines.
Time dependent; have to have this file before Monday.
Any STANDARD or FREEWARE/DOWNLOADABLE tools to do this either in the Unix or Windows environments?
I have Unix (Compaq Tru64) or Windows (Win2K) available to do it on. I've also got OpenVMS but I don't expect
that there's anybody here in Expert's Exchange who can answer questions on that venerable OS... if there are,
then fine, on there too -- since that's where I NEED the resultant text file of 250 lines.
Time dependent; have to have this file before Monday.
PERL runs on both Unice and Windows (ActivePerl is best there)
Note that Perl can be useful as cgi or as means to understand how to utilize underlying OS kernel features via scripts.
I often find O'Reilly as a useful resource. For example:
http://www.perl.com/pub/q/faqs
Sample ReadMe:
http://www.perl.com/CPAN-local/modules/by-module/Crypt/Crypt-Random-1.11.readme
(filename: Crypt-Random-1.11.tar.gz)
http://www.activestate.com/
Has numerous code samples for
Perl
Python
PHP
Tcl
XSLT
-- you can get Windoze versions of needed code from them (yes, even freeware, open source, all them there neat fun words)
I often find O'Reilly as a useful resource. For example:
http://www.perl.com/pub/q/faqs
Sample ReadMe:
http://www.perl.com/CPAN-local/modules/by-module/Crypt/Crypt-Random-1.11.readme
(filename: Crypt-Random-1.11.tar.gz)
http://www.activestate.com/
Has numerous code samples for
Perl
Python
PHP
Tcl
XSLT
-- you can get Windoze versions of needed code from them (yes, even freeware, open source, all them there neat fun words)
250 consecutive lines or random lines?
Are there duplicates allowed?
======
Werner
Are there duplicates allowed?
======
Werner
ASKER
<PRE>
From: griessh
250 consecutive lines or random lines?
Are there duplicates allowed?
Random lines, duplicates not allowed. They're email addresses from a sample
population for a research study. By choosing a random subset from a large population
the researchers hope to defuse complaints of SPAM by the selected individuals.
From: Others.... all suggested PERL.
Ah, well... I don't subscribe to the Unix build-it-yourself school so was hoping
for some actual application that did this. But the researchers, when told it was
a stumbling block, found a website (www.randomizer.org, I think) and had it
supply a set of 250 random, no duplicates numbers in the entire set of 2000+
records, and a 3rd party inserted the 2000+ records into an EXCEL spreadsheet,
then did a query selecting the rows that matched the 250 random numbers.
So, done. And delivered back to me via Email to my OpenVMS system where the
resultant file becomes the restricted access mailing list for sending the study request out to those 250 randomly selected members of the 2000+ member population.
From: griessh
250 consecutive lines or random lines?
Are there duplicates allowed?
Random lines, duplicates not allowed. They're email addresses from a sample
population for a research study. By choosing a random subset from a large population
the researchers hope to defuse complaints of SPAM by the selected individuals.
From: Others.... all suggested PERL.
Ah, well... I don't subscribe to the Unix build-it-yourself school so was hoping
for some actual application that did this. But the researchers, when told it was
a stumbling block, found a website (www.randomizer.org, I think) and had it
supply a set of 250 random, no duplicates numbers in the entire set of 2000+
records, and a 3rd party inserted the 2000+ records into an EXCEL spreadsheet,
then did a query selecting the rows that matched the 250 random numbers.
So, done. And delivered back to me via Email to my OpenVMS system where the
resultant file becomes the restricted access mailing list for sending the study request out to those 250 randomly selected members of the 2000+ member population.
jlw
Great! I suggest to go to Community Support at https://www.experts-exchange.com/jsp/qList.jsp?ta=commspt , post a request (with this URL included) and ask them to PAQ this question and refund your points since you have your own solution.
======
Werner
Great! I suggest to go to Community Support at https://www.experts-exchange.com/jsp/qList.jsp?ta=commspt , post a request (with this URL included) and ask them to PAQ this question and refund your points since you have your own solution.
======
Werner
qBasic can do rather quickly. It was included with NT4 'Server'!
Interesting, the no-duplicate. Either pre or post process one presumes.
I disagree with cookre on the array.. IMO the random numbers should come one at a time, no new number until the prior record processed. This means, of course, that a MUST disagree with randomizer, having prebuilt a list of numbers prior to runtime. What has 'appearance' of random, actually is not.
But since you are happy, I assume for your purpose it'll reduce your working set satisfactorily.
> the researchers hope
IMO, One unsolicited memo can be research, all subsequent having no opt-in are eSpam. Label it anything they like, if it looks like a duck, walks like a duck.....
> Time dependent; have to have this file before Monday.
hmm, probably a good editor would do as well, just delete lines, at random, until only 250 left... doesn't take all that long. I can delete faster than type.
Interesting, the no-duplicate. Either pre or post process one presumes.
I disagree with cookre on the array.. IMO the random numbers should come one at a time, no new number until the prior record processed. This means, of course, that a MUST disagree with randomizer, having prebuilt a list of numbers prior to runtime. What has 'appearance' of random, actually is not.
But since you are happy, I assume for your purpose it'll reduce your working set satisfactorily.
> the researchers hope
IMO, One unsolicited memo can be research, all subsequent having no opt-in are eSpam. Label it anything they like, if it looks like a duck, walks like a duck.....
> Time dependent; have to have this file before Monday.
hmm, probably a good editor would do as well, just delete lines, at random, until only 250 left... doesn't take all that long. I can delete faster than type.
SunBow, the array IS built at runtime with a seed based on time so one doesn't always get the same sequence. Moreover, the main purpose of the array is to keep track of those numbers already selected so they are not selected again.
jtw, griessh's suggesting is ideal, since the link to randomizer.org will likely be of value to others in the future.
(whew, now I don't have to actually write anything...)
jtw, griessh's suggesting is ideal, since the link to randomizer.org will likely be of value to others in the future.
(whew, now I don't have to actually write anything...)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
1. Make a sorted array X of 250 random, unique integer between 1 and the number of lines in the file
2. Make pass through file, selecting the lines specified by X
I'll post some code in a day or two, if nothing better shows up or if nobody else has the time to do it.