Link to home
Start Free TrialLog in
Avatar of zhshqzyc
zhshqzyc

asked on

Merge files with dos batch

Hi, I have 23 files that have the same header(the first line). I want to merge them together and only keep one header.
Each file's size is about 20MB. The command what I used was
copy chr1.assoc QT.assoc
for /L %%A in (2,1,23) do more +1 chr%%A.assoc >> QT.assoc

Open in new window

The question is it took a long time  still not finish. Even apending the second file, 12 hours passed no response. I run the batch on the server that has 32GB memory.
What is wrong?

Thanks for help.
Avatar of ReneGe
ReneGe
Flag of Canada image

Try this

 
copy chr1.assoc QT.assoc
FOR /L %%A IN (2,1,23) DO CALL :ReadFile %%A

EXIT

:ReadFile
FOR /F "delims=" %%A IN ('type chr%~1.assoc') DO (
	ECHO %%A>>QT.assoc
	exit /b
)

Open in new window

Avatar of knightEknight
At 20MB each, this script should not take that long.  What happens if you test it with just one file to the console?

copy chr1.assoc QT.assoc
for /L %%A in (2,1, 2 ) do more +1 chr%%A.assoc

Open in new window

Avatar of zhshqzyc
zhshqzyc

ASKER

Did you remove the header in the remailing files?
on line 1: use copy/y instead of just copy
copy chr1.assoc QT.assoc

Open in new window

Copying the first file is okay and just taking less than one second. BUT
for /L %%A in (2,1, 2 ) do more +1 chr%%A.assoc

Open in new window

No response at all.
I may find the problem. I tested the code and attached files
copy chr1.txt QT.txt
for /L %%A in (2,1,3) do more +1 chr%%A.txt >> QT.txt
pause

Open in new window

The merge result became
header
test1	1test2   2
test3 3

Open in new window

The expected one should be
header
test1	1
test2   2
test3 3

Open in new window

chr1.txt
chr2.txt
chr3.txt
Nice puzzle :) - I created this batch file for you:

:: create file list
dir /b file*.txt >fl.txt

:: get first file - used for creating header
for /f %%f in (fl.txt) do (
  set fname=%%f
  goto HDR 	
)

:HDR
:: get header from first file
for /f %%f in (%fname%) do (
  echo %%f
  goto APD
) > output.txt

:APD
:: append content of files to output
for /f %%f in (fl.txt) do (
   more +1 %%f >> output.txt
)

Open in new window


Put your file pattern on line 2, I used file*.txt for my 3 test files file1.txt, file2.txt and file3.txt

I tested with these 3 files:

::file.txt
header
11
12
13

::file2.txt
header
21
22
23

:file3.txt
header
31
32
33

Output of batch file is this:

::output.txt
header
11
12
13
21
22
23
31
32
33
The question is that the speed. Appending files is very very slow. I may consider to write a .net code to parse files. It might speed up.
Did not get any feedback about my script.
Why write code?

You could install Cygwin - using a simple shell script get performance figures like this:

20 files, 17.5Mb each merged (like you describe) in about 20 seconds:

$ date ; sh ./ccat.sh ; date
Sat May  7 23:02:37 WEDT 2011
Sat May  7 23:02:58 WEDT 2011

(ccat.sh is a simple shell script I wrote)
@ReneGe
Your code is not working because of wrong result.

@gerwinjansen
I can't install Cygwin on the server because of permission. I guess that more command does cost time therefore it is slow.
Would a VBS solution be acceptable, I suspect we could get a faster solution there.

~bp
ASKER CERTIFIED SOLUTION
Avatar of huacat
huacat

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Okay. Thanks, the header in the files is a sentence rather than a word. Is that okay?
Also OK for a sentence in the batch file. Please notice:
If the sentence have some KEYWORD or KEY CHARs, I'm afraid we have to using escape char.
e.g. if the sentence include |, <, > and so on... , these chars should lead with ^ char.

Another issue:
If you run the command in the command line, it's diffcult to use the TAB char.
So I recommend you write a batch file, put these command into it and we can use the TAB char easily.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@huacat: Yes, type /f is faster :)

@zhshqzyc: You can combine my batchfile with huacat's type /f command.

change line 20
from:
   more +1 %%f >> output.txt
to:
   type %%f | find /v "%header%" >> output.txt

add a line after line 13
  set header=%%f

I tested, it takes about 10s per 17.5Mb file. Total time would be around 4 minutes. Take note of huacat's remarks about special characters in the header line of your files.
@gerwinjansen,

Could you put entire code so it is clear?
Also can you add code to delete the file f1.txt after the job done?
Here it is, let me know if it works on your end.

:: create file list
dir /b test*.txt >fl.txt

:: get first file - used for creating header
for /f %%f in (fl.txt) do (
  set fname=%%f
  goto HDR 	
)

:HDR
:: get header from first file
for /f %%f in (%fname%) do (
  echo %%f
  set header=%%f
  goto APD
) > output.txt

:APD
:: append content of files to output
  for /f %%f in (fl.txt) do (
  type %%f | find /v "%header%" >> output.txt
)

del /q f1.txt

Open in new window

Thanks for your effort, but it is incorrect. The header is
 CHR         SNP   N_MISS   N_GENO   F_MISS

Open in new window

The seperators are white spaces. Using the above code, I only got the header as
CHR

Open in new window

And also the program crashed after copying the fist file, that means copying the first file successful except the header and failed appending the second file(nothing appended then crashed).
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Correction
 
@echo off

SET Output=Output.txt
IF EXIST "%Output%" DEL "%Output%"

FOR /F %%A IN ('dir /b test*.txt') DO Call :GetHeader "%%~fA"
EXIT

:GetHeader
FOR /F "usebackq delims=" %%A IN ("%~1") DO (
	ECHO %%A>>"%Output%"
	EXIT /b
)

Open in new window

@zhshqzyc:

So from reading your comment "07/05/11 11:20 AM, ID: 35712604"

I see that actually you want the second line to be sent to the output. Right?
Also, I see the word Header. Is it this the word you want to have there or it represents a common header line that you should find in all files?

The following will read the second line an put the word "header" in your output file.

@echo off

SET Output=Output.txt

ECHO HEADER>"%Output%"

FOR /F %%A IN ('dir /b chr*.txt') DO Call :GetHeader "%%~fA"
EXIT

:GetHeader
FOR /F "usebackq skip=1 delims=" %%A IN ("%~1") DO (
	ECHO %%A>>"%Output%"
	EXIT /b
)

Open in new window

@ReneGe,

Yes, but the header is not always as the word "HEDAER". I hope that it can be read  by the code instead of manually setting up it.
Do they all have the same header?
Please give examples
Please see the attached.
chr.zip
So I see they all have the same header

@echo off

SET Output=Output.txt

FOR /F %%A IN ('dir /b chr*.txt') DO Call :GetHeader "%%~fA"
FOR /F %%A IN ('dir /b chr*.txt') DO Call :GetFirstLines "%%~fA"
EXIT

:GetHeader
FOR /F "usebackq delims=" %%A IN ("%~1") DO (
	ECHO %%A>"%Output%"
	EXIT /b
)

:GetFirstLines
FOR /F "usebackq Skip=1 delims=" %%A IN ("%~1") DO (
	ECHO %%A>>"%Output%"
	EXIT /b
)

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Even if my script resolves your issue, please split points with all contributing experts.
Thanks for your input. It is still wrong though. I am going to give up and split points to everybody for your nice help.
Let me explain:
The sample file
chr1.lmiss:
 CHR         SNP   N_MISS   N_GENO   F_MISS
   1   rs4030303        0     2020        0
   1    rs940550        0     2020        0
   1   rs6594028        0     2020        0
   1  rs10458597       20     2020 0.009901
   1   rs9701055     1805     2020   0.8936
   1  rs12565286      562     2020   0.2782
   1  rs11804171      562     2020   0.2782
   1   rs2977670     1992     2020   0.9861

Open in new window

chr2.lmiss
 CHR          SNP   N_MISS   N_GENO   F_MISS
   2   rs11127467       62     2020  0.03069
   2   rs10193286       62     2020  0.03069
   2    rs4632379        7     2020 0.003465
   2    rs7595668       62     2020  0.03069
   2   rs10195681       62     2020  0.03069
   2   rs13386112       62     2020  0.03069
   2    rs7594188       62     2020  0.03069
   2    rs7594567       62     2020  0.03069
   2    rs6548217       10     2020  0.00495

Open in new window

I use the merge code:
@echo off

SET Output=QT.lmiss

FOR /F %%A IN ('dir /b chr*.lmiss') DO Call :GetHeader "%%~fA"
FOR /F %%A IN ('dir /b chr*.lmiss') DO Call :GetFirstLines "%%~fA"
pause

:GetHeader
FOR /F "usebackq delims=" %%A IN ("%~1") DO (
	ECHO %%A>"%Output%"
	EXIT /b
)

:GetFirstLines
FOR /F "usebackq Skip=1 delims=" %%A IN ("%~1") DO (
	ECHO %%A>>"%Output%"
	EXIT /b
)

Open in new window

The result is:
 CHR         SNP   N_MISS   N_GENO   F_MISS
  10   rs12218882       39     2020  0.01931
   4   rs4690249      193     2020  0.09554
   8   rs13276385      381     2020   0.1886
   5   rs10045830        3     2020 0.001485
  11   rs11605246        1     2020 0.000495
   9   rs2811026      506     2020   0.2505
  12   rs2003280       64     2020  0.03168
   6   rs7754266       13     2020 0.006436
   7   rs7457923      272     2020   0.1347
  13   rs2821685     2020     2020        1
  23   rs5939319        4     1999 0.002001
  18   rs7235612        0     2020        0
  14   rs2713521     2020     2020        1
  22   rs11089130     2020     2020        1
  19   rs7247199       10     2020  0.00495
  15   rs12443141     1950     2020   0.9653
   1   rs4030303        0     2020        0
  21   rs885550     2020     2020        1
   2   rs11127467       62     2020  0.03069
  16   rs3743872      163     2020  0.08069
  20   rs4814683       19     2020 0.009406
  17   rs17054921        3     2020 0.001485
   3   rs9756992        2     2020 0.0009901

Open in new window

I have 23 files, each file only one line was extracted. So it is wrong.
But it doesn't matter, I will try use a c# code to create a batch file.
Appreciate your guys.
@gerwinjansen:
The code is still not working, never mind it. Thanks for help.
THANKS!!!
@zhshqzyc

Since you never answered me I assumed a VBS solution was not desired.

~bp
This is confusing.

So you want to have the content of all your files, but with only one header. Correct?

 
@ECHO OFF

SET Output=Output.txt

FOR /F %%A IN ('dir /b chr*.txt') DO Call :GetHeader "%%~fA"
FOR /F %%A IN ('dir /b chr*.txt') DO FOR /F "usebackq Skip=1 delims=" %%B IN ("%%A") DO ECHO %%B>>"%Output%"
EXIT

:GetHeader
FOR /F "usebackq delims=" %%A IN ("%~1") DO (
	ECHO %%A>"%Output%"
	EXIT /b
)

Open in new window

@bp.
VBS is welcomed but I already assign points and I am not familar with it. Sorry about it, I forgot to answer your question. Do u mind my openeng a new thread?
Opened a new thread at Merge files
copy chr1.assoc QT.assoc  
for /L %%A in (2,1,23) do type (%%A).assoc | find /v "CHR      SNP      N_MISS" >> qt.assoc

I put above code to a .bat file, create 23 files to test it, and it run it correctly.
Please remember, the char after CHR must be a TAB char if you header used TAB to seperator columns.