JapyDooge
asked on
[C++] Reading, converting and saving .csv files
I have created a simple program in C++ (using Dev-C++) that opens a .csv file, removes every " (apostroph) and saves it to a new file.
This works great except i have some strange input files now.
The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...
If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks' (the char you see in windows when the current font does not have that char.
Looks like such a file is in another format (non ansi text?). Is there an easy way to make my software support this?
Opening and saving in a new textfile is'nt really helping our users speed up converting this files. This way they could have done a simple find-replace in their texteditor for the same result.
This works great except i have some strange input files now.
The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...
If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks' (the char you see in windows when the current font does not have that char.
Looks like such a file is in another format (non ansi text?). Is there an easy way to make my software support this?
Opening and saving in a new textfile is'nt really helping our users speed up converting this files. This way they could have done a simple find-replace in their texteditor for the same result.
Any chance the file is in UTF-16 format ?
>> " (apostroph)
That is not an apostrophe, it is a double quote :)
>> The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...
Is it originally a Unicode file but when you copy/paste/safe it you are converting it to ASCII (wide is twice as big as narrow on Windows).
>> If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks'
Are you opening it up and reading it as a Unicode file if it is, indeed, in UTF16 format (as I suspect it is).
>> Looks like such a file is in another format (non ansi text?).
I suspect it's UTF16
http://en.wikipedia.org/wiki/UTF16
>> Is there an easy way to make my software support this
You'll have to opening it as a wide format file and either handle it as such internally or convert it to narrow using something like wcstombs()
http://www.cplusplus.com/reference/clibrary/cstdlib/wcstombs.html
That is not an apostrophe, it is a double quote :)
>> The input file is 8MB in size, if i open it in notepad for example, copy everything in a new file in notepad and save it, its only 4MB in size...
Is it originally a Unicode file but when you copy/paste/safe it you are converting it to ASCII (wide is twice as big as narrow on Windows).
>> If i load one of these files in my program, my program only exports a file with 32 bytes of 'blocks'
Are you opening it up and reading it as a Unicode file if it is, indeed, in UTF16 format (as I suspect it is).
>> Looks like such a file is in another format (non ansi text?).
I suspect it's UTF16
http://en.wikipedia.org/wiki/UTF16
>> Is there an easy way to make my software support this
You'll have to opening it as a wide format file and either handle it as such internally or convert it to narrow using something like wcstombs()
http://www.cplusplus.com/reference/clibrary/cstdlib/wcstombs.html
>> Any chance the file is in UTF-16 format ?
You can check that by looking at the first two bytes of the file. If they are either FE FF or FF FE, then you have a UTF-16 file :)
You can check that by looking at the first two bytes of the file. If they are either FE FF or FF FE, then you have a UTF-16 file :)
Is it unicode perhaps?
>> You can check that by looking at the first two bytes of the file.
Just to augment, this is the bye order mark, it is used to figure out endianess of a Unicode format file
http://en.wikipedia.org/wiki/Byte-order_mark
Just to augment, this is the bye order mark, it is used to figure out endianess of a Unicode format file
http://en.wikipedia.org/wiki/Byte-order_mark
ASKER
Sounds like that is it.
I post my code here, so maybe someone has an idea what to do.
I think the code will be kinda messy, but usually i'm not a C++ programmer, i wrote this a few months ago.
I post my code here, so maybe someone has an idea what to do.
I think the code will be kinda messy, but usually i'm not a C++ programmer, i wrote this a few months ago.
#include "console.h"
#include "ConfigFile.h"
#include <iostream>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
using namespace std;
using std::string;
string version = "1.0.0 Build 83";
void setrgb(int color)
{
switch (color)
{
case 0: // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
//unused colors removed
default : // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
}
}
int main (int argc, char *argv[])
{
FILE *fp;
char str[128];
string buffertmp;
setrgb(0);
printf ("csvConvert © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());
if(argv[1]){
//there are command line options
string inputfile(argv[1]);
string outputfile = inputfile;
if(argv[2]){
//output file given
outputfile = argv[2];
}else{
//no output file given, save as <input>_output.csv
for ( int i = 0; i < outputfile.length(); i++){
if (outputfile[i] =='.'){
outputfile.replace(i,1,"_output.");
i = outputfile.length();
}
}
}
printf ("Input file: %s\n", inputfile.c_str());
printf ("Output file: %s\n\n", outputfile.c_str());
if((fp = fopen(inputfile.c_str(), "rb"))==NULL) {
printf("Cannot open file: %s\n", inputfile.c_str());
exit(1);
}
fstream file_op(outputfile.c_str(),ios::out);
while(!feof(fp)) {
if(fgets(str, 126, fp))
buffertmp = str;
string searchString( "\"" );
string replaceString( "" );
string::size_type pos = 0;
while ( (pos = buffertmp.find(searchString, pos)) != string::npos ) {
buffertmp.replace( pos, searchString.size(), replaceString );
pos++;
}
//write to file
file_op<<buffertmp.c_str();
}
file_op.close();
fclose(fp);
}else{
//there are no command line options
printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");
cin.get(); //wait for key
}
return 0;
}
ASKER
@evilrix:
Sounds good, but how do i implement that?
wcstombs(newstr, buffertmp.c_str());
something like that?
Sounds good, but how do i implement that?
wcstombs(newstr, buffertmp.c_str());
something like that?
ASKER
I checked the file in a HEX editor and it starts with FF FE (ÿ þ) so i assume its UTF-16
>> Sounds good, but how do i implement that?
Quick example...
Quick example...
#include <string>
#include <vector>
#include <iostream>
#define BUFFER_SIZE 100
typedef std::vector<char> charvec_t;
int main( void )
{
size_t count;
std::wstring ws = L"Hello, world.";
charvec_t cv(ws.size() + 1);
printf("Convert wide-character string:\n" );
count = wcstombs(&cv[0], ws.c_str(), cv.size());
std::cout << " Characters converted: " << count << std::endl;
std::cout << " Multibyte character: " << &cv[0] << std::endl;
}
ASKER
Gonna try that, thanks in advance. Whoever has more information on this or examples, feel free to respond.
If you are coding on Windows you can also use WideCharToMultiByte, but this is Microsoft specific and not portable.
http://msdn.microsoft.com/en-us/library/ms776420(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms776420(VS.85).aspx
ASKER
Ok i'm having a few errors i'm not able to get rid of after trying a few things.
At first i just tried executing evilrix's code within my code so here i post my updated-non-working code.
The warning i get:
D:\My Documents\Cpp Projects\csvConvert\main.c pp In function `int main(int, char**)':
70 D:\My Documents\Cpp Projects\csvConvert\main.c pp conversion from `char[128]' to non-scalar type `std::basic_string<wchar_t , std::char_traits<wchar_t>, std::allocator<wchar_t> >' requested
D:\My Documents\Cpp Projects\csvConvert\Makefi le.win [Build Error] [main.o] Error 1
At first i just tried executing evilrix's code within my code so here i post my updated-non-working code.
The warning i get:
D:\My Documents\Cpp Projects\csvConvert\main.c
70 D:\My Documents\Cpp Projects\csvConvert\main.c
D:\My Documents\Cpp Projects\csvConvert\Makefi
#include "console.h"
#include "ConfigFile.h"
#include <iostream>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#define BUFFER_SIZE 100
typedef std::vector<char> charvec_t;
using namespace std;
using std::string;
string version = "1.0.0 Build 83";
void setrgb(int color)
{
switch (color)
{
case 0: // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
//unused colors removed
default : // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
}
}
int main (int argc, char *argv[])
{
FILE *fp;
char str[128];
string buffertmp;
setrgb(0);
printf ("csvConvert © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());
if(argv[1]){
//there are command line options
string inputfile(argv[1]);
string outputfile = inputfile;
if(argv[2]){
//output file given
outputfile = argv[2];
}else{
//no output file given, save as <input>_output.csv
for ( int i = 0; i < outputfile.length(); i++){
if (outputfile[i] =='.'){
outputfile.replace(i,1,"_output.");
i = outputfile.length();
}
}
}
printf ("Input file: %s\n", inputfile.c_str());
printf ("Output file: %s\n\n", outputfile.c_str());
if((fp = fopen(inputfile.c_str(), "rb"))==NULL) {
printf("Cannot open file: %s\n", inputfile.c_str());
exit(1);
}
fstream file_op(outputfile.c_str(),ios::out);
size_t count;
std::wstring ws = str;
charvec_t cv(ws.size() + 1);
count = wcstombs(&cv[0], ws.c_str(), cv.size());
while(!feof(fp)) {
if(fgets(str, 126, fp))
buffertmp = str;
string searchString( "\"" );
string replaceString( "" );
string::size_type pos = 0;
while ( (pos = buffertmp.find(searchString, pos)) != string::npos ) {
buffertmp.replace( pos, searchString.size(), replaceString );
pos++;
}
//write to file
file_op<<buffertmp.c_str();
}
file_op.close();
fclose(fp);
}else{
//there are no command line options
printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");
cin.get(); //wait for key
}
return 0;
}
ASKER
@evilrix: i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
>> i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
Ok, then wsctombs is probably what you want to use :)
Line 70: std::wstring ws = str;
str is a char and not a wchar_t, string is wide -- not compatible.
Ok, then wsctombs is probably what you want to use :)
Line 70: std::wstring ws = str;
str is a char and not a wchar_t, string is wide -- not compatible.
ASKER
hmm shouldn't i use mbstowcs instead of wsctombs?
ASKER
oh hmm now i see, that would to the reverse thing, confusing those terms multi-byte and wide-char.
>> hmm shouldn't i use mbstowcs instead of wsctombs?
You want to go from wide to Narrow, yes?
wsc = wide
to
mbs = narrow
;)
You want to go from wide to Narrow, yes?
wsc = wide
to
mbs = narrow
;)
ASKER
Hmm i found converting my input to wchar_t and back to char / string works fine in my code, altrough i'm not reading files in this format yet. Do i have to do something special for that?
ASKER
D:\My Documents\Cpp Projects\csvConvert\main.c pp In function `int main(int, char**)':
84 D:\My Documents\Cpp Projects\csvConvert\main.c pp cannot convert `wchar_t*' to `char*' for argument `1' to `char* fgets(char*, int, FILE*)'
D:\My Documents\Cpp Projects\csvConvert\Makefi le.win [Build Error] [main.o] Error 1
when i use:
84 D:\My Documents\Cpp Projects\csvConvert\main.c
D:\My Documents\Cpp Projects\csvConvert\Makefi
when i use:
//...
wchar_t *pmbtest = (wchar_t *)malloc( sizeof( wchar_t ));
if(fgets(pmbtest, 126, fp)){
//...
Try fgetws()
http://www.opengroup.org/onlinepubs/009695399/functions/fgetws.html
Have you tried using wifstream?
http://www.opengroup.org/onlinepubs/009695399/functions/fgetws.html
Have you tried using wifstream?
#include <fstream>
#include <string>
int main()
{
std::wifstream wifs("somefile.csv");
std::wstring ws;
std::getline(wifs, ws);
}
>> If you are coding on Windows you can also use WideCharToMultiByte
>> @evilrix: i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
Just fyi : you can use the Windows API in Dev-C++. Just #include <windows.h>
>> @evilrix: i'm coding on Windows but not using Microsoft Visual-C++. I'm using Dev-C++ 4.9.9.2 and gcc
Just fyi : you can use the Windows API in Dev-C++. Just #include <windows.h>
ASKER
Hmm lol using that code i get:
D:\My Documents\Cpp Projects\csvConvertMB\main .cpp In function `int main(int, char**)':
12 D:\My Documents\Cpp Projects\csvConvertMB\main .cpp `wifstream' is not a member of `std'
12 D:\My Documents\Cpp Projects\csvConvertMB\main .cpp expected `;' before "wifs"
14 D:\My Documents\Cpp Projects\csvConvertMB\main .cpp `wifs' undeclared (first use this function)
(Each undeclared identifier is reported only once for each function it appears in.)
D:\My Documents\Cpp Projects\csvConvertMB\Make file.win [Build Error] [main.o] Error 1
D:\My Documents\Cpp Projects\csvConvertMB\main
12 D:\My Documents\Cpp Projects\csvConvertMB\main
12 D:\My Documents\Cpp Projects\csvConvertMB\main
14 D:\My Documents\Cpp Projects\csvConvertMB\main
(Each undeclared identifier is reported only once for each function it appears in.)
D:\My Documents\Cpp Projects\csvConvertMB\Make
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <string>
#include <stdio.h>
#include <wchar.h>
using namespace std;
int main(int argc, char *argv[])
{
std::wifstream wifs("KP12062008.csv");
std::wstring ws;
std::getline(wifs, ws);
return EXIT_SUCCESS;
}
Sufficed to say that compiles fine in Visual Studio and g++ on Linux. Is Dev-CPP installed correctly? Infinity08 and I once had someone else with similar issues and installing the latest version fixed it as I recall correctly. Do you remember that I8?
ASKER
Hmm well, i downloaded it today to re-edit this project from a few months ago that was made on my old laptop (wich i ritually burned for all the years of suffering after i had a new one, lol).
It's installed with all options in the default folder (C:\Dev-CPP).
Oh and i tried including windows.h on that last few lines of code but still no luck, same errors.
It's installed with all options in the default folder (C:\Dev-CPP).
Oh and i tried including windows.h on that last few lines of code but still no luck, same errors.
>> Do you remember that I8?
Well, wide character support isn't great in Dev-C++. I think the standard libraries are compiled without support for a large part of it, hence the errors.
Try using something else than wifstream and wstring. (you can get to work wstring if you want, but I don't think it's possible to get wifstream to work ...)
Well, wide character support isn't great in Dev-C++. I think the standard libraries are compiled without support for a large part of it, hence the errors.
Try using something else than wifstream and wstring. (you can get to work wstring if you want, but I don't think it's possible to get wifstream to work ...)
ASKER
Hmm will it be a better idea to re-create the whole project using Visual-C++ Express Edition?
It is'nt that much code and it looks like i have to do nasty things to get this fully working in Dev-C++.
It is'nt that much code and it looks like i have to do nasty things to get this fully working in Dev-C++.
You can opt to use the Windows API for doing the reading (instead of the standard wide character streams).
ASKER
Hmm i found this snippet of code that should work i think...
http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.langref.doc/rzan5mst111.htm
Gonna try that tomorrow (its 17:30 here so i'm going home for today).
Thanks for all your help so far! If this code won't work for me i gonna re-code it in Visual C++
http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.langref.doc/rzan5mst111.htm
Gonna try that tomorrow (its 17:30 here so i'm going home for today).
Thanks for all your help so far! If this code won't work for me i gonna re-code it in Visual C++
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
int main(void)
{
FILE *stream;
wchar_t wcs[100];
if (NULL == (stream = fopen("fgetws.dat", "r"))) {
printf("Unable to open: \"fgetws.dat\"\n");
exit(1);
}
errno = 0;
if (NULL == fgetws(wcs, 100, stream)) {
if (EILSEQ == errno) {
printf("An invalid wide character was encountered.\n");
exit(1);
}
else if (feof(stream))
printf("End of file reached.\n");
else
perror("Read error.\n");
}
printf("wcs = \"%ls\"\n", wcs);
fclose(stream);
return 0;
/************************************************************
Assuming the file fgetws.dat contains:
This test string should not return -1
The output should be similar to:
wcs = "This test string should not return -1"
************************************************************/
}
ASKER
I'm getting a bit confused... Found lots of code all over the net, and at the moment i have the following code working (read file and print it to the console).
The sad thing my old replace-code does'nt work anymore (offcourse, becouse it was using strings) and i'm yet unable to find any code that works for me.
Am i looking at the wrong place, is Google a bitch today or is there no specific code for this?
Thanks in advance...
The sad thing my old replace-code does'nt work anymore (offcourse, becouse it was using strings) and i'm yet unable to find any code that works for me.
Am i looking at the wrong place, is Google a bitch today or is there no specific code for this?
Thanks in advance...
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <iostream>
#include <stdarg.h>
#include <string.h>
#include <vector>
using namespace std;
using std::string;
int main(int argc, char *argv[])
{
FILE *stream;
wchar_t wcs[4];
if(argv[1]){
//there are command line options
string inputfile(argv[1]);
string outputfile = inputfile;
if(argv[2]){
//output file given
outputfile = argv[2];
}else{
//no output file given, save as <input>_output.csv
for ( int i = 0; i < outputfile.length(); i++){
if (outputfile[i] =='.'){
outputfile.replace(i,1,"_output.");
i = outputfile.length();
}
}
}
printf ("Input file: %s\n", inputfile.c_str());
printf ("Output file: %s\n\n", outputfile.c_str());
if (NULL == (stream = fopen(inputfile.c_str(), "r"))) {
printf("Unable to open: %s\n", inputfile.c_str());
exit(1);
}
errno = 0;
while(!feof(stream)) {
if (NULL == fgetws(wcs, 2, stream)) {
if (EILSEQ == errno) {
printf("An invalid wide character was encountered.\n");
exit(1);
}
else if (feof(stream))
printf("End of file reached.\n");
else
perror("Read error.\n");
}
printf("%s", wcs);
}
fclose(stream);
}else{
//there are no command line options
printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvconvert <inputfile> <outputfile>\n\nPress any key to exit...");
cin.get(); //wait for key
}
return 0;
}
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
I can convert them if i want, they come from various custommers/companies using various software.
The main problem that a third party software tool from years ago does'nt support the files with the " in them (well it works but it sees everything as text-only) so this software is some kind of pre-processor.
Worked fine till someone was using another kind of software for exporting them and they are in UTF-16 as i already learned :) so my tiny program does'nt accept them anymore.
For me it would be great if i could convert the chars in the buffer first to a string and then loop it trough my replace-code.
The main problem that a third party software tool from years ago does'nt support the files with the " in them (well it works but it sees everything as text-only) so this software is some kind of pre-processor.
Worked fine till someone was using another kind of software for exporting them and they are in UTF-16 as i already learned :) so my tiny program does'nt accept them anymore.
For me it would be great if i could convert the chars in the buffer first to a string and then loop it trough my replace-code.
If you don't care about the format and you just want to remove all " then you could just read this is as a binary blob into a std::string (std::string can hold 8 bit data safely), remove all " and write that back to file as a binary blob.
ASKER
Err, a blob is a database format, is'nt it?
But, if i understand you right, i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
But, if i understand you right, i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
>> Err, a blob is a database format, is'nt it?
Yes, but I was using it in the more generic sense... as in raw data -- sorry for confusion :)
>> i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
Yes, I can't see any reason why that wouldn't work... the only thing you will need to consider is that if it's unicode you'll need to remove all the bytes that pertian to " -- this is the only tricky bit I can see. Depending upon the endianess (which is defined but the byte order mark) it'll either be a '\0' followed by a '"' or vice verca.
Yes, but I was using it in the more generic sense... as in raw data -- sorry for confusion :)
>> i should read it in a binary mode (not text), and do a find & replace in the same way i did earlier and write it again in binary mode.
Yes, I can't see any reason why that wouldn't work... the only thing you will need to consider is that if it's unicode you'll need to remove all the bytes that pertian to " -- this is the only tricky bit I can see. Depending upon the endianess (which is defined but the byte order mark) it'll either be a '\0' followed by a '"' or vice verca.
ASKER
Hmm that does'nt sound that easy but i give it a try, can you give me a little idea what commands i should use to read data this way? As i noticed before, i'm not a C/C++ programmer, my expirence was in Visual Basic 5/6/2003.NET/2005.NET years ago so thats a complete other kind of development.
At least i manage to use the right syntax after a few years of PHP.
Already much thanks for all your answers so far, no wonder u both are high ranked xD
At least i manage to use the right syntax after a few years of PHP.
Already much thanks for all your answers so far, no wonder u both are high ranked xD
>> Hmm that does'nt sound that easy but i give it a try
Sure it is :)
>> can you give me a little idea what commands i should use to read data this way?
Try something like below. Actually, on reflection, I think using a vector<char> is better since the memory of a string isn't guaranteed to be contiguous and you'll need to read into the memory buffer it represents directly. The C+++ Standard guarantees this is safe in a vector.
Sure it is :)
>> can you give me a little idea what commands i should use to read data this way?
Try something like below. Actually, on reflection, I think using a vector<char> is better since the memory of a string isn't guaranteed to be contiguous and you'll need to read into the memory buffer it represents directly. The C+++ Standard guarantees this is safe in a vector.
#include <fstream>
#include <vector>
typedef std::vector<char> vec_t;
int main()
{
std::ifstream ifs("datain.txt", std::ios::binary);
ifs.seekg(0, std::ios::end);
std::streamsize size = ifs.tellg();
ifs.seekg(0, std::ios::beg);
vec_t data(size);
ifs.read(&data[0], size);
// process data
std::ofstream ofs("dataout.txt", std::ios::binary);
ofs.write(&data[0], size);
}
^^^ obviously, I left out error handling for the sake of brevity :)
ASKER
That builds nice and fast in Dev-C++ and copy's the data in a new file so for now im looking into stripping the "-chars.
If i can't manage to get working code i'll ask here again, if it works, i post the code here and offcourse accept the solutions :)
If i can't manage to get working code i'll ask here again, if it works, i post the code here and offcourse accept the solutions :)
There's no rush to close the Q, our main concern is to try and find a solution that works for you :)
Good luck.
-Rx.
Good luck.
-Rx.
ASKER
Lol, i found out that:
remove(data.begin(), data.end(), '\"');
Gives me an unreadable text, offcourse becouse it's multibyte and the " is not multibyte or wide of whatever, at least i understand the reason.
First line (CSV headers with no " in them) works fine, next lines don't.
Any ideas? :-)
remove(data.begin(), data.end(), '\"');
Gives me an unreadable text, offcourse becouse it's multibyte and the " is not multibyte or wide of whatever, at least i understand the reason.
First line (CSV headers with no " in them) works fine, next lines don't.
Any ideas? :-)
>> Any ideas?
Try this
Try this
#include <fstream>
#include <vector>
typedef std::vector<char> vec_t;
typedef std::vector<wchar_t> wvec_t;
int main()
{
// Open stream (as narrow)
std::ifstream ifs("datain.txt", std::ios::binary);
// Get size
ifs.seekg(0, std::ios::end);
std::streamsize size = ifs.tellg();
ifs.seekg(0, std::ios::beg);
// Create a wide char vector and read raw data into it
wvec_t wdata(size);
ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
// Convert wide to narrow
vec_t ndata(wdata.size());
size_t res = wcstombs(&ndata[0], &wdata[2], ndata.size());
// Resize out buffer to the new size
ndata.resize(res);
// Strip all " chars (not the most efficient way to do it but it's simple!)
vec_t::iterator itr = ndata.begin();
while(itr != ndata.end())
{
if(*itr == '"')
{
itr = ndata.erase(itr);
}
else
{
++itr;
}
}
// Persist new data to file.
std::ofstream ofs("dataout.txt", std::ios::binary);
ofs.write(&ndata[0], ndata.size());
}
Oops, notices a silly error in that version... please ignore and try this one...
Line 23 assumes the Unicode file starts with a BOM and so inores it (you'll have to add code to handle this for the case when it doesn't). Above I was accidentally skipping 2 chars and not one (type).
Line 23 assumes the Unicode file starts with a BOM and so inores it (you'll have to add code to handle this for the case when it doesn't). Above I was accidentally skipping 2 chars and not one (type).
#include <fstream>
#include <vector>
typedef std::vector<char> vec_t;
typedef std::vector<wchar_t> wvec_t;
int main()
{
// Open stream (as narrow)
std::ifstream ifs("datain.txt", std::ios::binary);
// Get size
ifs.seekg(0, std::ios::end);
std::streamsize size = ifs.tellg();
ifs.seekg(0, std::ios::beg);
// Create a wide char vector and read raw data into it
wvec_t wdata(size);
ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
// Convert wide to narrow
vec_t ndata(wdata.size());
size_t res = wcstombs(&ndata[0], &wdata[1], ndata.size());
// Resize out buffer to the new size
ndata.resize(res);
// Strip all " chars (not the most efficient way to do it but it's simple!)
vec_t::iterator itr = ndata.begin();
while(itr != ndata.end())
{
if(*itr == '"')
{
itr = ndata.erase(itr);
}
else
{
++itr;
}
}
// Persist new data to file.
std::ofstream ofs("dataout.txt", std::ios::binary);
ofs.write(&ndata[0], ndata.size());
}
ASKER
Hmm lol it compiles but seems to be a nice endless loop, i had it running for 5 mins were the previous versions were done in 5 secs or so on a 8mb file
>> but seems to be a nice endless loop
It does? That's odd. It worked fine for me :)
The only place it loops is when it iterators the data to remove " and that will either ++ the iterator if th char isn't a " or erase it, which will then set the iterator to the new value returned by erase... this is a typical idiom for erasing things from a vector because the original iterator is invalidated.
It does? That's odd. It worked fine for me :)
The only place it loops is when it iterators the data to remove " and that will either ++ the iterator if th char isn't a " or erase it, which will then set the iterator to the new value returned by erase... this is a typical idiom for erasing things from a vector because the original iterator is invalidated.
BTW, below is my before and after test data... so you can try it yourself to see what happens.
before.txt
after.txt
before.txt
after.txt
ASKER
Yea i don't really see where it could hang but it nicely loads my machine...
load.png
load.png
Can you provide me a small sample of your data for me to test with? Just enough to cause this problem.
ASKER
Okay thats strange... Using your test input file it works great...
The start of the files is the same (#FF #FE) only difference i see mine does'nt have items with " around them in the first line and is bigger... 8MB for this testfile, files can be up to 50MB
The start of the files is the same (#FF #FE) only difference i see mine does'nt have items with " around them in the first line and is bigger... 8MB for this testfile, files can be up to 50MB
ASKER
Yea i tried that, i opened my file and removed all lines except of one where i changed the contents. it's also in UTF 16 format (#FF #FE) but this one works correctly... I'm starting to get confused, lol
^^^ It could just be very slow. Erasing things from a vector is massively slow because each time you do it must shuffle everything in memory down to ensure the items are contiguous. Rather than erasing from current you could try building a new vector, this might be quicker. I would suggest you use the reserve() method on the vector to preallocate memory otherwise you'll get lots of heap allocations, which will also be slow. By using reserve you preallocate memory upfront. Note, this is slightly different from resize, which also adds default items to the vector.
ASKER
Okay i think i know what goes wrong, can it be some kind of a buffer overflow or other memory issue?
When i make a smaller version of my file, it works fine.
When i make it big, it hangs...
I'll upload two sample files hereby:
doesnt.csv.txt
works.csv.txt
When i make a smaller version of my file, it works fine.
When i make it big, it hangs...
I'll upload two sample files hereby:
doesnt.csv.txt
works.csv.txt
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Gonna try that. I noticed when it's busy, it's not writing to the file yet so it stores the whole file in memory? Maybe that's what makes it slow...
Yes, it is all being done in memory -- but obviously the OS will page as necessary. When you consider most PCs have a min of 512MB of RAM, 50MB is not a big file to process in memory :)
ASKER
Woooow, that worked and faaast :-D
Gonna finish my code to accept input/output files in command line and i'll post it in a few mins :-)
Gonna finish my code to accept input/output files in command line and i'll post it in a few mins :-)
>> Woooow, that worked and faaast
Hurrah! :)
Hurrah! :)
ASKER
So, as i promised before, i will share my defenitive code.
It works great now and supports drag and drop.
It works great now and supports drag and drop.
main.cpp:
#include <fstream>
#include <vector>
#include <algorithm>
#include <string.h>
#include "console.h"
typedef std::vector<char> vec_t;
typedef std::vector<wchar_t> wvec_t;
using namespace std;
using std::string;
string version = "1.0.1 Build 1";
void setrgb(int color){
switch (color){
case 0: // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
//unused colors removed
default : // White on Black
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE),FOREGROUND_INTENSITY |
FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE);
break;
}
}
int main (int argc, char *argv[]){
setrgb(0);
printf ("csvConvert Multi-Byte version © 2008 by Jaap-Willem Dooge, version %s\n\n", version.c_str());
if(argv[1]){
//there are command line options
string inputfile(argv[1]);
string outputfile = inputfile;
if(argv[2]){
//output file given
outputfile = argv[2];
}else{
//no output file given, save as <input>_output.csv
for ( int i = 0; i < outputfile.length(); i++){
if (outputfile[i] =='.'){
outputfile.replace(i,1,"_output.");
i = outputfile.length();
}
}
}
printf ("Input file: %s\n", inputfile.c_str());
printf ("Output file: %s\n\n", outputfile.c_str());
printf ("Converting...\n\n");
// Open stream (as narrow)
std::ifstream ifs(inputfile.c_str(), std::ios::binary);
// Get size
ifs.seekg(0, std::ios::end);
std::streamsize size = ifs.tellg();
ifs.seekg(0, std::ios::beg);
// Create a wide char vector and read raw data into it
wvec_t wdata(size);
ifs.read(reinterpret_cast<char *>(&wdata[0]), size);
// Convert wide to narrow
vec_t ndata(wdata.size());
size_t res = wcstombs(&ndata[0], &wdata[1], ndata.size()); // NB. Ignores BOM at start of wdata
// Resize out buffer to the new size
ndata.resize(res);
// Strip all " chars by copying to a new vector everything but ""
vec_t cdata(ndata.size());
vec_t::iterator itrEnd = std::remove_copy(ndata.begin(), ndata.end(), cdata.begin(), '"');
cdata.erase(itrEnd, cdata.end());
// Persist new data to file.
std::ofstream ofs(outputfile.c_str(), std::ios::binary);
ofs.write(&cdata[0], cdata.size());
}else{
//there are no command line options
printf ("Strip double quotes from tab-divided text files.\n\nUSAGE: csvConvertMB <inputfile> <outputfile>\n\nPress any key to exit...");
cin.get(); //wait for key
}
}
console.h:
// console.h
//
#ifndef CONSOLE_H
#define CONSOLE_H
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <windows.h>
void clrscr();
void gotoxy(int, int);
void setrgb(int);
#endif
ASKER
You guys helped me really great :-) Thanks for that all!
>> and supports drag and drop
Show off ;)
Thanks for sharing you final code... it'll be very good for the PAQ databse.
Good luck my friend.
-Rx.
Show off ;)
Thanks for sharing you final code... it'll be very good for the PAQ databse.
Good luck my friend.
-Rx.
ASKER
>>>> and supports drag and drop
>>Show off ;)
Bwhehehe :-P
>>Thanks for sharing you final code... it'll be very good for the PAQ databse.
No problem, this is truely a community project and those have to be open source lol xD
>>Good luck my friend.
>>-Rx.
Thanks, same to you xD now i gonna make an employee happy who's otherwise changing the column data types by hand and doing find-and-replaces in Programmers Notepad
>>Show off ;)
Bwhehehe :-P
>>Thanks for sharing you final code... it'll be very good for the PAQ databse.
No problem, this is truely a community project and those have to be open source lol xD
>>Good luck my friend.
>>-Rx.
Thanks, same to you xD now i gonna make an employee happy who's otherwise changing the column data types by hand and doing find-and-replaces in Programmers Notepad
Wow, seems I missed all the action. Good work, you two !! ;)
>> Wow, seems I missed all the action
No doubt, knowing you, busy in some Belgian bar all afternoon ;)
Cheers I8.
No doubt, knowing you, busy in some Belgian bar all afternoon ;)
Cheers I8.
I wish ... lol.