We help IT Professionals succeed at work.

New podcast episode! Our very own Community Manager, Rob Jurd, gives his insight on the value of an online community. Listen Now!

x

How do I randomly delete K lines from a N line sized file using a Linux shell script

238 Views
Last Modified: 2017-04-28
On linux I have an arbitrarily sized file file of k lines (say 500,000 lines) and I want to randomly delete  lines in the file until I have r (say 10,000) lines remaining.  How can I do this with a Linux shell?

Basically, I have a data set and I want to reduce it to a more manageable sample.

Thanks,
Chris
Comment
Watch Question

Chris JonesSenior Systems Administrator

Commented:
The way I would do this is to use tail to take the final 10k lines into a new file:

If it doesn't have to be truly random you could do:
tail --lines=10000 yourmainfile > newdataset

and then use the new file while has 10k lines.

Is this good enough or would you like me to consider a bash script for random lines?
Senior Systems Administrator
Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION
Christopher ScheneSystem Engineer/Software Engineer

Author

Commented:
OK, I'll give it a try. I'll start out with a smaller sample
Christopher ScheneSystem Engineer/Software Engineer

Author

Commented:
I reduced a 50000 line file to 1000 and it only took 2-3 sec.

The reason I am doing this is I have large samples of data that are two big to process and I want to see if they are random representations of not.

For example if I have a normal distribution in a sample of 1 million, then I can grab as little as 1500 entries or so randomly and  I can get a 3% or confidence so my intent was to compare several smaller sample sets and see if they are truly normal distributions and then I have confidence  that the over all samples are normal distributions
Chris JonesSenior Systems Administrator

Commented:
I wouldn't really be able to advise you on the statistics, other than to say that I can't guarantee that the random number generators used by sort may not give a true random representation.

Glad you managed to get such a quick process out of the command :)
Chris JonesSenior Systems Administrator

Commented:
That is to say "I can't guarantee that the random number generation is truly random".
Christopher ScheneSystem Engineer/Software Engineer

Author

Commented:
I am trying to think of a way to validate that it truly is random. I need to have a think on it.
Chris JonesSenior Systems Administrator

Commented:
You probably won't find a way to gain truly random numbers using a computer.

I haven't got time to read through these in detail but they may be what you need to consider:
https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/

"Can a computer (non-quantum) generate true random numbers or are they pseudo random." I know that certain functions are better than others, but it is a good point to make if this is part of a research project. If you are doing research I would bring this up in your appendices and assumptions.
Christopher ScheneSystem Engineer/Software Engineer

Author

Commented:
This a pretty good solution and easy to implement
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a sample view!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.