Solved

[C++] Reading, converting and saving .csv files

Posted on 2008-06-16
63
776 Views
Last Modified: 2011-10-19
I have created a simple program in C++ (using Dev-C++) that opens a .csv file, removes every " (apostroph) and saves it to a new file.

This works great except i have some strange input files now.

The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...

If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks' (the char you see in windows when the current font does not have that char.

Looks like such a file is in another format (non ansi text?). Is there an easy way to make my software support this?
Opening and saving in a new textfile is'nt really helping our users speed up converting this files. This way they could have done a simple find-replace in their texteditor for the same result.
0
Comment
Question by:JapyDooge
  • 30
  • 24
  • 8
  • +1
63 Comments
 
LVL 53

Expert Comment

by:Infinity08
ID: 21793161
Any chance the file is in UTF-16 format ?
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793167
>> " (apostroph)
That is not an apostrophe, it is a double quote :)

>> The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...
Is it originally a Unicode file but when you copy/paste/safe it you are converting it to ASCII (wide is twice as big as narrow on Windows).

>> If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks'
Are you opening it up and reading it as a Unicode file if it is, indeed, in UTF16 format (as I suspect it is).

>> Looks like such a file is in another format (non ansi text?).
I suspect it's UTF16
http://en.wikipedia.org/wiki/UTF16

>> Is there an easy way to make my software support this
You'll have to opening it as a wide format file and either handle it as such internally or convert it to narrow using something like wcstombs()
http://www.cplusplus.com/reference/clibrary/cstdlib/wcstombs.html
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21793185
>> Any chance the file is in UTF-16 format ?

You can check that by looking at the first two bytes of the file. If they are either FE FF or FF FE, then you have a UTF-16 file :)
0
 
LVL 10

Expert Comment

by:peetm
ID: 21793191
Is it unicode perhaps?
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793208
>> You can check that by looking at the first two bytes of the file.
Just to augment, this is the bye order mark, it is used to figure out endianess of a Unicode format file
http://en.wikipedia.org/wiki/Byte-order_mark
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793241
Sounds like that is it.

I post my code here, so maybe someone has an idea what to do.
I think the code will be kinda messy, but usually i'm not a C++ programmer, i wrote this a few months ago.
#include "console.h"

#include "ConfigFile.h"

#include <iostream>

#include <stdarg.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>
 

using namespace std;

using std::string;
 

string version = "1.0.0 Build 83";
 

void setrgb(int color)

{

	switch (color)

	{

	case 0:	// White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

		

	//unused colors removed

	

	default : // White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

	}

}
 

int main (int argc, char *argv[])

{

  FILE *fp; 

  char str[128];

  string buffertmp;

  setrgb(0);

  printf ("csvConvert © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());

  if(argv[1]){

   //there are command line options

   string inputfile(argv[1]);

   string outputfile = inputfile;

   if(argv[2]){

    //output file given

    outputfile = argv[2];

   }else{

    //no output file given, save as <input>_output.csv

    for ( int i = 0; i < outputfile.length(); i++){

     if (outputfile[i] =='.'){

      outputfile.replace(i,1,"_output.");

      i = outputfile.length();

     }

    }

   }

   printf ("Input file: %s\n", inputfile.c_str());

   printf ("Output file: %s\n\n", outputfile.c_str());

   if((fp = fopen(inputfile.c_str(), "rb"))==NULL) {

    printf("Cannot open file: %s\n", inputfile.c_str());

    exit(1);

   }

   fstream file_op(outputfile.c_str(),ios::out);

   while(!feof(fp)) {

    if(fgets(str, 126, fp)) 

     buffertmp = str;

     string searchString( "\"" ); 

     string replaceString( "" );

     string::size_type pos = 0;

     while ( (pos = buffertmp.find(searchString, pos)) != string::npos ) {

      buffertmp.replace( pos, searchString.size(), replaceString );

      pos++;

     }

    //write to file

    file_op<<buffertmp.c_str();

    }

   file_op.close();

   fclose(fp);

  }else{

   //there are no command line options

   printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");

   cin.get(); //wait for key

  }

  return 0;

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793262
@evilrix:

Sounds good, but how do i implement that?

wcstombs(newstr, buffertmp.c_str());

something like that?
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793357
I checked the file in a HEX editor and it starts with FF FE (ÿ þ) so i assume its UTF-16
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793384
>> Sounds good, but how do i implement that?
Quick example...
#include <string>

#include <vector>

#include <iostream>
 

#define BUFFER_SIZE 100
 

typedef std::vector<char> charvec_t;
 

int main( void )

{

	size_t  count;

	std::wstring ws = L"Hello, world.";

	charvec_t cv(ws.size() + 1);
 

	printf("Convert wide-character string:\n" );
 

	count = wcstombs(&cv[0], ws.c_str(), cv.size());
 

	std::cout << "   Characters converted: " << count  << std::endl;

	std::cout << "    Multibyte character: " << &cv[0] << std::endl;

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793407
Gonna try that, thanks in advance. Whoever has more information on this or examples, feel free to respond.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793432
If you are coding on Windows you can also use WideCharToMultiByte, but this is Microsoft specific and not portable.
http://msdn.microsoft.com/en-us/library/ms776420(VS.85).aspx
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793493
Ok i'm having a few errors i'm not able to get rid of after trying a few things.

At first i just tried executing evilrix's code within my code so here i post my updated-non-working code.

The warning i get:
 D:\My Documents\Cpp Projects\csvConvert\main.cpp In function `int main(int, char**)':
70 D:\My Documents\Cpp Projects\csvConvert\main.cpp conversion from `char[128]' to non-scalar type `std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >' requested
 D:\My Documents\Cpp Projects\csvConvert\Makefile.win [Build Error]  [main.o] Error 1
#include "console.h"

#include "ConfigFile.h"

#include <iostream>

#include <stdarg.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <vector>
 

#define BUFFER_SIZE 100
 

typedef std::vector<char> charvec_t;
 

using namespace std;

using std::string;
 

string version = "1.0.0 Build 83";
 

void setrgb(int color)

{

	switch (color)

	{

	case 0:	// White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

		

	//unused colors removed

	

	default : // White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

	}

}
 

int main (int argc, char *argv[])

{

  FILE *fp; 

  char str[128];

  string buffertmp;

  setrgb(0);

  printf ("csvConvert © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());

  if(argv[1]){

   //there are command line options

   string inputfile(argv[1]);

   string outputfile = inputfile;

   if(argv[2]){

    //output file given

    outputfile = argv[2];

   }else{

    //no output file given, save as <input>_output.csv

    for ( int i = 0; i < outputfile.length(); i++){

     if (outputfile[i] =='.'){

      outputfile.replace(i,1,"_output.");

      i = outputfile.length();

     }

    }

   }

   printf ("Input file: %s\n", inputfile.c_str());

   printf ("Output file: %s\n\n", outputfile.c_str());

   if((fp = fopen(inputfile.c_str(), "rb"))==NULL) {

    printf("Cannot open file: %s\n", inputfile.c_str());

    exit(1);

   }

   fstream file_op(outputfile.c_str(),ios::out);

   

   

   size_t  count;

   std::wstring ws = str;

   charvec_t cv(ws.size() + 1);

   count = wcstombs(&cv[0], ws.c_str(), cv.size());

   

   

   while(!feof(fp)) {

    if(fgets(str, 126, fp)) 

     buffertmp = str;

     string searchString( "\"" ); 

     string replaceString( "" );

     string::size_type pos = 0;

     while ( (pos = buffertmp.find(searchString, pos)) != string::npos ) {

      buffertmp.replace( pos, searchString.size(), replaceString );

      pos++;

     }

    //write to file

    file_op<<buffertmp.c_str();

    }

   file_op.close();

   fclose(fp);

  }else{

   //there are no command line options

   printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");

   cin.get(); //wait for key

  }

  return 0;

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793512
@evilrix: i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793561
>>  i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
Ok, then wsctombs is probably what you want to use :)

Line 70: std::wstring ws = str;

str is a char and not a wchar_t, string is wide -- not compatible.

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793626
hmm shouldn't i use mbstowcs instead of wsctombs?
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793644
oh hmm now i see, that would to the reverse thing, confusing those terms multi-byte and wide-char.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793674
>> hmm shouldn't i use mbstowcs instead of wsctombs?
You want to go from wide to Narrow, yes?

wsc = wide

to
mbs = narrow

;)
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793741
Hmm i found converting my input to wchar_t and back to char / string works fine in my code, altrough i'm not reading files in this format yet. Do i have to do something special for that?
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793753
D:\My Documents\Cpp Projects\csvConvert\main.cpp In function `int main(int, char**)':
84 D:\My Documents\Cpp Projects\csvConvert\main.cpp cannot convert `wchar_t*' to `char*' for argument `1' to `char* fgets(char*, int, FILE*)'
 D:\My Documents\Cpp Projects\csvConvert\Makefile.win [Build Error]  [main.o] Error 1

when i use:
   //...

   wchar_t *pmbtest      = (wchar_t *)malloc( sizeof( wchar_t ));

    if(fgets(pmbtest, 126, fp)){

     //...

Open in new window

0
 
LVL 40

Expert Comment

by:evilrix
ID: 21793845
Try fgetws()
http://www.opengroup.org/onlinepubs/009695399/functions/fgetws.html

Have you tried using wifstream?
#include <fstream>

#include <string>
 
 

int main()

{

	std::wifstream wifs("somefile.csv");

	std::wstring ws;

	std::getline(wifs, ws);

}

Open in new window

0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21793964
>> If you are coding on Windows you can also use WideCharToMultiByte

>> @evilrix: i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc

Just fyi : you can use the Windows API in Dev-C++. Just #include <windows.h>
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21793974
Hmm lol using that code i get:

 D:\My Documents\Cpp Projects\csvConvertMB\main.cpp In function `int main(int, char**)':
12 D:\My Documents\Cpp Projects\csvConvertMB\main.cpp `wifstream' is not a member of `std'
12 D:\My Documents\Cpp Projects\csvConvertMB\main.cpp expected `;' before "wifs"
14 D:\My Documents\Cpp Projects\csvConvertMB\main.cpp `wifs' undeclared (first use this function)
  (Each undeclared identifier is reported only once for each function it appears in.)
 D:\My Documents\Cpp Projects\csvConvertMB\Makefile.win [Build Error]  [main.o] Error 1
#include <cstdlib>

#include <iostream>

#include <fstream>

#include <string>

#include <stdio.h>

#include <wchar.h>
 

using namespace std;
 

int main(int argc, char *argv[])

{

	std::wifstream wifs("KP12062008.csv");

	std::wstring ws;

	std::getline(wifs, ws);
 

    return EXIT_SUCCESS;

}

Open in new window

0
 
LVL 40

Expert Comment

by:evilrix
ID: 21794015
Sufficed to say that compiles fine in Visual Studio and g++ on Linux. Is Dev-CPP installed correctly? Infinity08 and I once had someone else with similar issues and installing the latest version fixed it as I recall correctly. Do you remember that I8?
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21794143
Hmm well, i downloaded it today to re-edit this project from a few months ago that was made on my old laptop (wich i ritually burned for all the years of suffering after i had a new one, lol).
It's installed with all options in the default folder (C:\Dev-CPP).

Oh and i tried including windows.h on that last few lines of code but still no luck, same errors.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21794270
>> Do you remember that I8?

Well, wide character support isn't great in Dev-C++. I think the standard libraries are compiled without support for a large part of it, hence the errors.

Try using something else than wifstream and wstring. (you can get to work wstring if you want, but I don't think it's possible to get wifstream to work ...)
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21794306
Hmm will it be a better idea to re-create the whole project using Visual-C++ Express Edition?
It is'nt that much code and it looks like i have to do nasty things to get this fully working in Dev-C++.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21794446
You can opt to use the Windows API for doing the reading (instead of the standard wide character streams).
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21794625
Hmm i found this snippet of code that should work i think...

http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.langref.doc/rzan5mst111.htm

Gonna try that tomorrow (its 17:30 here so i'm going home for today).

Thanks for all your help so far! If this code won't work for me i gonna re-code it in Visual C++
#include <errno.h>

#include <stdio.h>

#include <stdlib.h>

#include <wchar.h>

 

int main(void)

{

   FILE    *stream;

   wchar_t  wcs[100];

 

   if (NULL == (stream = fopen("fgetws.dat", "r"))) {

      printf("Unable to open: \"fgetws.dat\"\n");

      exit(1);

   }

 

   errno = 0;

   if (NULL == fgetws(wcs, 100, stream)) {

      if (EILSEQ == errno) {

         printf("An invalid wide character was encountered.\n");

         exit(1);

      }

      else if (feof(stream))

              printf("End of file reached.\n");

           else

              perror("Read error.\n");

   }

   printf("wcs = \"%ls\"\n", wcs);

   fclose(stream);

   return 0;

 

   /************************************************************

      Assuming the file fgetws.dat contains:

 

      This test string should not return -1

 

      The output should be similar to:

 

      wcs = "This test string should not return -1"

   ************************************************************/

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21800769
I'm getting a bit confused... Found lots of code all over the net, and at the moment i have the following code working (read file and print it to the console).
The sad thing my old replace-code does'nt work anymore (offcourse, becouse it was using strings) and i'm yet unable to find any code that works for me.
Am i looking at the wrong place, is Google a bitch today or is there no specific code for this?

Thanks in advance...
#include <errno.h>

#include <stdio.h>

#include <stdlib.h>

#include <wchar.h>

#include <iostream>

#include <stdarg.h>

#include <string.h>

#include <vector>
 

using namespace std;

using std::string;
 

int main(int argc, char *argv[])

{

   FILE    *stream;

   wchar_t  wcs[4];

  if(argv[1]){

   //there are command line options

   string inputfile(argv[1]);

   string outputfile = inputfile;

   if(argv[2]){

    //output file given

    outputfile = argv[2];

   }else{

    //no output file given, save as <input>_output.csv

    for ( int i = 0; i < outputfile.length(); i++){

     if (outputfile[i] =='.'){

      outputfile.replace(i,1,"_output.");

      i = outputfile.length();

     }

    }

   }

   printf ("Input file: %s\n", inputfile.c_str());

   printf ("Output file: %s\n\n", outputfile.c_str());

   

   if (NULL == (stream = fopen(inputfile.c_str(), "r"))) {

      printf("Unable to open: %s\n", inputfile.c_str());

      exit(1);

   }

 

   errno = 0;

   while(!feof(stream)) {

    if (NULL == fgetws(wcs, 2, stream)) {

      if (EILSEQ == errno) {

         printf("An invalid wide character was encountered.\n");

         exit(1);

      }

      else if (feof(stream))

              printf("End of file reached.\n");

           else

              perror("Read error.\n");

   }
 

    printf("%s", wcs);

}

   fclose(stream);

 }else{

   //there are no command line options

   printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");

   cin.get(); //wait for key

 }

 return 0;      

}

Open in new window

0
 
LVL 53

Assisted Solution

by:Infinity08
Infinity08 earned 200 total points
ID: 21800788
Just a question : do these files need to be kept as UTF-16, or can you convert them to normal ASCII ?
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21800816
I can convert them if i want, they come from various custommers/companies using various software.
The main problem that a third party software tool from years ago does'nt support the files with the " in them (well it works but it sees everything as text-only) so this software is some kind of pre-processor.

Worked fine till someone was using another kind of software for exporting them and they are in UTF-16 as i already learned :) so my tiny program does'nt accept them anymore.

For me it would be great if i could convert the chars in the buffer first to a string and then loop it trough my replace-code.
0
The curse of the end user strikes again      

You’ve updated all your end user’s email signatures. Hooray! But guess what? They’re playing around with the HTML, adding stupid taglines and ruining the imagery. Find out how you can save your signatures from end users today.

 
LVL 40

Expert Comment

by:evilrix
ID: 21800838
If you don't care about the format and you just want to remove all " then you could just read this is as a binary blob into a std::string (std::string can hold 8 bit data safely), remove all " and write that back to file as a binary blob.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21800918
Err, a blob is a database format, is'nt it?

But, if i understand you right, i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21800948
>> Err, a blob is a database format, is'nt it?
Yes, but I was using it in the more generic sense... as in raw data -- sorry for confusion :)

>> i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
Yes, I can't see any reason why that wouldn't work... the only thing you will need to consider is that if it's unicode you'll need to remove all the bytes that pertian to " -- this is the only tricky bit I can see. Depending upon the endianess (which is defined but the byte order mark) it'll either be a '\0' followed by a '"' or vice verca.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21800977
Hmm that does'nt sound that easy but i give it a try, can you give me a little idea what commands i should use to read data this way? As i noticed before, i'm not a C/C++ programmer, my expirence was in Visual Basic 5/6/2003.NET/2005.NET years ago so thats a complete other kind of development.

At least i manage to use the right syntax after a few years of PHP.

Already much thanks for all your answers so far, no wonder u both are high ranked xD
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801111
>> Hmm that does'nt sound that easy but i give it a try
Sure it is :)

>> can you give me a little idea what commands i should use to read data this way?
Try something like below. Actually, on reflection, I think using a vector<char> is better since the memory of a string isn't guaranteed to be contiguous and you'll need to read into the memory buffer it represents directly. The C+++ Standard guarantees this is safe in a vector.
#include <fstream>

#include <vector>
 

typedef std::vector<char> vec_t;
 

int main()

{

	std::ifstream ifs("datain.txt", std::ios::binary);
 

	ifs.seekg(0, std::ios::end);

	std::streamsize size = ifs.tellg();

	ifs.seekg(0, std::ios::beg);
 

	vec_t data(size);

	ifs.read(&data[0], size);
 

	// process data
 

	std::ofstream ofs("dataout.txt", std::ios::binary);

	ofs.write(&data[0], size);

}

Open in new window

0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801115
^^^ obviously, I left out error handling for the sake of brevity :)
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801157
That builds nice and fast in Dev-C++ and copy's the data in a new file so for now im looking into stripping the "-chars.

If i can't manage to get working code i'll ask here again, if it works, i post the code here and offcourse accept the solutions :)
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801169
There's no rush to close the Q, our main concern is to try and find a solution that works for you :)

Good luck.

-Rx.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801268
Lol, i found out that:

remove(data.begin(), data.end(), '\"');

Gives me an unreadable text, offcourse becouse it's multibyte and the " is not multibyte or wide of whatever, at least i understand the reason.
First line (CSV headers with no " in them) works fine, next lines don't.

Any ideas? :-)
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801338
>> Any ideas?
Try this
#include <fstream>

#include <vector>
 

typedef std::vector<char> vec_t;

typedef std::vector<wchar_t> wvec_t;
 

int main()

{

	// Open stream (as narrow)

	std::ifstream ifs("datain.txt", std::ios::binary);
 

	// Get size

	ifs.seekg(0, std::ios::end);

	std::streamsize size = ifs.tellg();

	ifs.seekg(0, std::ios::beg);
 

	// Create a wide char vector and read raw data into it

	wvec_t wdata(size);

	ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
 

	// Convert wide to narrow

	vec_t ndata(wdata.size());

	size_t res = wcstombs(&ndata[0], &wdata[2], ndata.size());
 

	// Resize out buffer to the new size

	ndata.resize(res);
 

	// Strip all " chars (not the most efficient way to do it but it's simple!)

	vec_t::iterator itr = ndata.begin();

	while(itr != ndata.end())

	{

		if(*itr == '"')

		{

			itr = ndata.erase(itr);

		}

		else

		{

			++itr;

		}

	}
 

	// Persist new data to file.

	std::ofstream ofs("dataout.txt", std::ios::binary);

	ofs.write(&ndata[0], ndata.size());

}

Open in new window

0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801352
Oops, notices a silly error in that version... please ignore and try this one...

Line 23 assumes the Unicode file starts with a BOM and so inores it (you'll have to add code to handle this for the case when it doesn't). Above I was accidentally skipping 2 chars and not one (type).
#include <fstream>

#include <vector>
 

typedef std::vector<char> vec_t;

typedef std::vector<wchar_t> wvec_t;
 

int main()

{

	// Open stream (as narrow)

	std::ifstream ifs("datain.txt", std::ios::binary);
 

	// Get size

	ifs.seekg(0, std::ios::end);

	std::streamsize size = ifs.tellg();

	ifs.seekg(0, std::ios::beg);
 

	// Create a wide char vector and read raw data into it

	wvec_t wdata(size);

	ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
 

	// Convert wide to narrow

	vec_t ndata(wdata.size());

	size_t res = wcstombs(&ndata[0], &wdata[1], ndata.size());
 

	// Resize out buffer to the new size

	ndata.resize(res);
 

	// Strip all " chars (not the most efficient way to do it but it's simple!)

	vec_t::iterator itr = ndata.begin();

	while(itr != ndata.end())

	{

		if(*itr == '"')

		{

			itr = ndata.erase(itr);

		}

		else

		{

			++itr;

		}

	}
 

	// Persist new data to file.

	std::ofstream ofs("dataout.txt", std::ios::binary);

	ofs.write(&ndata[0], ndata.size());

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801390
Hmm lol it compiles but seems to be a nice endless loop, i had it running for 5 mins were the previous versions were done in 5 secs or so on a 8mb file
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801409
>> but seems to be a nice endless loop
It does? That's odd. It worked fine for me :)

The only place it loops is when it iterators the data to remove " and that will either ++ the iterator if th char isn't a " or erase it, which will then set the iterator to the new value returned by erase... this is a typical idiom for erasing things from a vector because the original iterator is invalidated.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801466
BTW, below is my before and after test data... so you can try it yourself to see what happens.
before.txt
after.txt
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801485
Yea i don't really see where it could hang but it nicely loads my machine...
load.png
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801497
Can you provide me a small sample of your data for me to test with? Just enough to cause this problem.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801519
Okay thats strange... Using your test input file it works great...

The start of the files is the same (#FF #FE) only difference i see mine does'nt have items with " around them in the first line and is bigger... 8MB for this testfile, files can be up to 50MB
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801537
Yea i tried that, i opened my file and removed all lines except of one where i changed the contents. it's also in UTF 16 format (#FF #FE) but this one works correctly... I'm starting to get confused, lol
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801560
^^^ It could just be very slow. Erasing things from a vector is massively slow because each time you do it must shuffle everything in memory down to ensure the items are contiguous. Rather than erasing from current you could try building a new vector, this might be quicker. I would suggest you use the reserve() method on the vector to preallocate memory otherwise you'll get lots of heap allocations, which will also be slow. By using reserve you preallocate memory upfront. Note, this is slightly different from resize, which also adds default items to the vector.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801569
Okay i think i know what goes wrong, can it be some kind of a buffer overflow or other memory issue?

When i make a smaller version of my file, it works fine.

When i make it big, it hangs...

I'll upload two sample files hereby:
doesnt.csv.txt
works.csv.txt
0
 
LVL 40

Accepted Solution

by:
evilrix earned 300 total points
ID: 21801595
>> Okay i think i know what goes wrong, can it be some kind of a buffer overflow or other memory issue?
Unlikely, since everything is bounds checked :)

>> When i make a smaller version of my file, it works fine.
I think it's performance related -- try version below, which creates a new vector rather than modifying existing. This uses more memory but should be heaps faster.

>> When i make it big, it hangs...
I don't think it's hung -- I suspect ti's busy :)

Can you try the code below and let me know how this goes? It uses std::replace_copy to do the work.
http://www.cplusplus.com/reference/algorithm/replace_copy.html
#include <fstream>

#include <vector>

#include <algorithm>
 

typedef std::vector<char> vec_t;

typedef std::vector<wchar_t> wvec_t;
 

int main()

{

	// Open stream (as narrow)

	std::ifstream ifs("indata.txt", std::ios::binary);
 

	// Get size

	ifs.seekg(0, std::ios::end);

	std::streamsize size = ifs.tellg();

	ifs.seekg(0, std::ios::beg);
 

	// Create a wide char vector and read raw data into it

	wvec_t wdata(size);

	ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
 

	// Convert wide to narrow

	vec_t ndata(wdata.size());

	size_t res = wcstombs(&ndata[0], &wdata[1], ndata.size()); // NB. Ignores BOM at start of wdata
 

	// Resize out buffer to the new size

	ndata.resize(res);
 

	// Strip all " chars by copying to a new vector everything but ""

	vec_t cdata(ndata.size());

	vec_t::iterator itrEnd = std::remove_copy(ndata.begin(), ndata.end(), cdata.begin(), '"');

	cdata.erase(itrEnd, cdata.end());
 

	// Persist new data to file.

	std::ofstream ofs("outdata.txt", std::ios::binary);

	ofs.write(&cdata[0], cdata.size());

}

Open in new window

0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801666
Gonna try that. I noticed when it's busy, it's not writing to the file yet so it stores the whole file in memory? Maybe that's what makes it slow...
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801684
Yes, it is all being done in memory -- but obviously the OS will page as necessary. When you consider most PCs have a min of 512MB of RAM, 50MB is not a big file to process in memory :)
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801871
Woooow, that worked and faaast :-D

Gonna finish my code to accept input/output files in command line and i'll post it in a few mins :-)
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801884
>> Woooow, that worked and faaast
Hurrah! :)
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21801985
So, as i promised before, i will share my defenitive code.

It works great now and supports drag and drop.
main.cpp:

#include <fstream>

#include <vector>

#include <algorithm>

#include <string.h>

#include "console.h"

 

typedef std::vector<char> vec_t;

typedef std::vector<wchar_t> wvec_t;

 

using namespace std;

using std::string;

 

string version = "1.0.1 Build 1";
 

void setrgb(int color){

	switch (color){

	case 0:	// White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

		

	//unused colors removed

	

	default : // White on Black

		SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |

			FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);

		break;

	}

}
 

int main (int argc, char *argv[]){

  setrgb(0);

  printf ("csvConvert Multi-Byte version © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());

  if(argv[1]){

   //there are command line options

   string inputfile(argv[1]);

   string outputfile = inputfile;

   if(argv[2]){

    //output file given

    outputfile = argv[2];

   }else{

    //no output file given, save as <input>_output.csv

    for ( int i = 0; i < outputfile.length(); i++){

     if (outputfile[i] =='.'){

      outputfile.replace(i,1,"_output.");

      i = outputfile.length();

     }

    }

   }

   printf ("Input file: %s\n", inputfile.c_str());

   printf ("Output file: %s\n\n", outputfile.c_str());

   printf ("Converting...\n\n");

	// Open stream (as narrow)

	std::ifstream ifs(inputfile.c_str(), std::ios::binary);

 

	// Get size

	ifs.seekg(0, std::ios::end);

	std::streamsize size = ifs.tellg();

	ifs.seekg(0, std::ios::beg);

 

	// Create a wide char vector and read raw data into it

	wvec_t wdata(size);

	ifs.read(reinterpret_cast<char *>(&wdata[0]), size);

 

	// Convert wide to narrow

	vec_t ndata(wdata.size());

	size_t res = wcstombs(&ndata[0], &wdata[1], ndata.size()); // NB. Ignores BOM at start of wdata

 

	// Resize out buffer to the new size

	ndata.resize(res);

 

	// Strip all " chars by copying to a new vector everything but ""

	vec_t cdata(ndata.size());

	vec_t::iterator itrEnd = std::remove_copy(ndata.begin(), ndata.end(), cdata.begin(), '"');

	cdata.erase(itrEnd, cdata.end());

 

	// Persist new data to file.

	std::ofstream ofs(outputfile.c_str(), std::ios::binary);

	ofs.write(&cdata[0], cdata.size());

    }else{

     //there are no command line options

     printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvConvertMB <inputfile> <outputfile>\n\nPress any key to exit...");

     cin.get(); //wait for key

    }

}
 

console.h:

// console.h

//

#ifndef CONSOLE_H

#define CONSOLE_H
 

#include <iostream>

#include <iomanip>

#include <cmath>

#include <cstdlib>

#include <windows.h>
 

void clrscr();

void gotoxy(int, int);

void setrgb(int);
 

#endif

Open in new window

0
 
LVL 6

Author Closing Comment

by:JapyDooge
ID: 31467576
You guys helped me really great :-) Thanks for that all!
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21801998
>> and supports drag and drop
Show off ;)

Thanks for sharing you final code... it'll be very good for the PAQ databse.

Good luck my friend.

-Rx.
0
 
LVL 6

Author Comment

by:JapyDooge
ID: 21802018
>>>> and supports drag and drop
>>Show off ;)
Bwhehehe :-P

>>Thanks for sharing you final code... it'll be very good for the PAQ databse.
No problem, this is truely a community project and those have to be open source lol xD

>>Good luck my friend.
>>-Rx.
Thanks, same to you xD now i gonna make an employee happy who's otherwise changing the column data types by hand and doing find-and-replaces in Programmers Notepad
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21802360
Wow, seems I missed all the action. Good work, you two !! ;)
0
 
LVL 40

Expert Comment

by:evilrix
ID: 21802394
>> Wow, seems I missed all the action
No doubt, knowing you, busy in some Belgian bar all afternoon ;)

Cheers I8.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21802476
I wish ... lol.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Recently Microsoft released a brand new function called CONCAT. It's supposed to replace its predecessor CONCATENATE. But how does it work? And what's new? In this article, we take a closer look at all of this - we even included an exercise file for…
Possible fixes for Windows 7 and Windows Server 2008 updating problem. Solutions mentioned are from Microsoft themselves. I started a case with them from our Microsoft Silver Partner option to open a case and get direct support from Microsoft. If s…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
Windows 8 came with a dramatically different user interface known as Metro. Notably missing from that interface was a Start button and Start Menu. Microsoft responded to negative user feedback of the Metro interface, bringing back the Start button a…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now