Solved

C++ iostream UTF-16 file i/o with CRLF translation

Posted on 2013-11-01
17
769 Views
Last Modified: 2013-12-14
I want to read and write utf-16 files which use CR LF line separators (L"\r\n"). Using C++ (Microsoft Visual Studio 2010) iostreams. I want every L"\n" written to the stream to be translated to L"\r\n" transparently. Using the codecvt_utf16 locale facet requires to open the fstream in ios::binary mode, losing the usual text mode \n to \r\n translation.

std::wofstream wofs;
wofs.open("try_utf16.txt", std::ios::binary);
wofs.imbue(
    std::locale(
        wofs.getloc(),
        new std::codecvt_utf16<wchar_t, 0x10ffff, std::generate_header>));
wofs << L"Hi!\n"; // i want a '\r' to be inserted before the '\n' in the output file
wofs.close();

Open in new window


I want a solution without needing extra libraries like BOOST.
0
Comment
Question by:j66st
  • 10
  • 7
17 Comments
 
LVL 32

Expert Comment

by:sarabande
ID: 39621141
is there a reason why you didn't add a '\r' before the '\n'?

00000000 FE FF 00 48 00 69 00 21 00 0D 00 0A             ...H.i.!....

Open in new window

looks pretty well when I do so.

Sara
0
 

Author Comment

by:j66st
ID: 39622612
Of course, but i have a big program with a lot of string resources and code that writes to a generic ostream with only '\n' separators. So I want the '\r\n' to '\n' translation to happen behind the ostream interface.
0
 

Author Comment

by:j66st
ID: 39622639
Hint: I would think it should be possible by a customized basic_filebuf, overloading the overflow or underflow method. Or else by a custom codecvt class.

I am not familiar with this stuff, so I hope to find an expert who did this before, or who can point me to sample code.
0
 

Author Comment

by:j66st
ID: 39622789
OK, increasing points, if that may help. I thought it would be an issue that many have come across.
0
 
LVL 32

Expert Comment

by:sarabande
ID: 39624492
why not simply use a static member of some suitable class of type const std::wstring  which holds the crlf pair?

// global.h
...
class Global
{
public:
    static const std::wstring  lf;
    ....

// global.cpp
...
const std::wstring  Global::lf(L"\r\n");

Open in new window


you could use the wendl wherever a linefeed was needed, like

wofs << L"Hi!" << Global::lf; 

Open in new window


overloading of filebuf or codecvt class is a difficult thing where I don't have a really good idea how you could replace wide linefeed character by a wide carriage-return/linefeed string without negative side effects. you should see that those classes are internal classes which don't have information of the class objects they were used from.

it also makes less sense to derive from wofstream and have new template functions for operator<< regarding wide text input from wchar_t * or wstring because operator<< functions have wostream& as writeable argument and not wofstream& (or your derived class). because of that you would need to provide overloads of operator<< for all possible operands, and not only for the wide text types. otherwise code like

mywofstream mwofs;
mwofs << 123<< L"abc\n" << "a " ;

Open in new window

would not use  your template function

wostream& operator<<(mywofstream& os, const wchar_t * & wsz) 

Open in new window

because the above statement resolves to

operator<<(operator<<(operator<<(mwofs,  "a "), "L"abc\n"), 123);

Open in new window

and the most inner call returns a wostream& and not a mywofstream&.

however, you could provide an overload of

std::wostream& operator<<(std::wostream& os, const std::wstring & ws)
{
      std::wstring s;
      size_t pos1, pos2 = 0;
      while ((pos1 = ws.find(L'\n', pos2)) != std::wstring::npos)
      {
            s += ws.substr(pos2, pos1-pos2);
            s += L"\r\n";
            pos2 = pos1+1;
      }
      s += ws.substr(pos2);

      os.write(s.c_str(), s.length()*sizeof(wchar_t));
      return os;   
}

Open in new window

which should do the job for wstring (and similar for const wchar_t*) however, you should consider that the overload would be used for any call of operator<< the operands fit, not only for your file operations. if that is an issue you would need to derive from wofstream nevertheless and use that mywofstream class instead of wofstream. then the operator<< overloads could check the given stream by means of a dynamic_cast whether it "is a" mywofstream and does the replacements only for that.

Sara
0
 

Author Comment

by:j66st
ID: 39628596
Defining a simple constant for the "\r\n" string is not a solution, as I said before. The \n characters are embedded in existing string resources and functions generating dynamic output that I cannot change. Overloading the << operator is not sufficient, it would not cover output via the putc and write methods.

Overloading of codecvt looks complicated.

To me overloading of basic_filebuf seems the way to go. I don't consider this an "internal class"; it is properly documented, and made to be overloaded. What would be the negative side effects you are talking about? Perhaps the unget/putback methods need some extra attention?

This will also need a custom "mywofstream" because there is no basic_ofstream constructor accepting an alternate streambuf. So the mywofstream constructor will then call the basic_ostream constructor with "mywfilebuf".
0
 
LVL 32

Expert Comment

by:sarabande
ID: 39629518
This will also need a custom "mywofstream" because there is no basic_ofstream constructor accepting an alternate streambuf.

you found the answer why an overload of basic_filebuf is not the way to go yourself. with "internal class" I didn't mean that it is not accessible but that the creation and usage of the class is outside of your influence. moreover, basic_filebuf has some specials, for example that it always uses type char for template argument regardless of the type the stream class is associated to. see comment in the remarks section of msdn to basic_filebuf:

Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.

basic_filebuf has some constructors and operator members which all are not virtual and rarely can be replaced all by non-virtual calls in your overloads. the only virtual function suitable for your purpose is the setbuf member function but it is not likely that setbuf is the only function that assigns a text buffer to a  stream.

in my opinion, you should try to overload operator<<(std::ostream &, std::wstring) and operator<<(std::ostream&, const wchar_t*) as described before what should work with little efforts.

Sara
0
 

Author Comment

by:j66st
ID: 39632990
in my opinion, you should try to overload operator<<(std::ostream &, std::wstring) and operator<<(std::ostream&, const wchar_t*) as described
I don't think I was clear enough. I want a myofstream class that can be passed to any existing function like
void Greet(std::wostream& wos) 
{ wos << L"Hi!\n"; wos.putc(L'\n'); wos.write(L"Bye!\n", 5); }

Open in new window

The function does not know about '\r' conventions nor about any myofstream, so it won't call any custom global operator<<(myofstream&, ...) function. This can only be solved within the myofstream class.

it always uses type char for template argument regardless of the type the stream class is associated to. see comment in the remarks section of msdn to basic_filebuf
An example on that same msdn page shows that this can be solved by a pubsetbuf method call. This could be done in the myofstream constructor.

Thanks for your insights so far, maybe you are right, but I'm not convinced yet that it is overly complicated to make a std::ostream-derived myofstream with a myfilebuf. I'm too busy right now with more urgent matters to give it a try.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 32

Expert Comment

by:sarabande
ID: 39633077
my suggested solution was to provide overloads of operator<< (wostream &, wstring) and  operator<< (wostream &, wchar_t *). it is an overload of the stl provided operators. it would work for all stream operations done by using <<. the only precondition is that the overloaded functions are available for the compiler.

it doesn't work on write function, though.

the myofstream is optional, in case you want to be able to do the operation only for file operations where a myofstream was involved.

Sara
0
 

Author Comment

by:j66st
ID: 39670387
I think it boils down to using a custom stream buffer, attaching a custom wide character buffer by calling pubsetbuf and overloading the virtual overflow method, let it insert a '\r' before every '\n' that comes in. Like this:
	virtual int_type overflow(int_type ch = traits_type::eof())
	{
		if (ch == '\n') {
			int_type iRet = BASE::overflow('\r');
			if (iRet != traits_type::not_eof('\r')) return iRet;
		}
		return BASE::overflow(ch);
	}

Open in new window

The wcrlfofstream derives from std::wostream and its constructor initializes the base class with a new wcrlf_filebuf object.
The stream can then be used by any functions that can write to a generic std::wostream.
 
I did a small test and it seems to work fine. Am I overlooking any issues?
0
 

Accepted Solution

by:
j66st earned 0 total points
ID: 39679318
Here is the complete code:
#include <iostream>
#include <fstream>

class wcrlf_filebuf : public std::basic_filebuf<wchar_t>
{
	typedef std::basic_filebuf<wchar_t> BASE;
	wchar_t awch[128];
	bool bBomWritten;
public:
	wcrlf_filebuf() 
		: bBomWritten(false)
	{ memset(awch, 0, sizeof awch); }

	wcrlf_filebuf(const wchar_t *wszFilespec, std::ios_base::open_mode _Mode = std::ios_base::out) 
		: bBomWritten(false)
	{
		memset(awch, 0, sizeof awch);
		BASE::open(wszFilespec, _Mode | std::ios_base::binary);
		pubsetbuf(awch, _countof(awch));
	}

	wcrlf_filebuf *open(const wchar_t *wszFilespec, std::ios_base::open_mode _Mode = std::ios_base::out)
	{	
		BASE::open(wszFilespec, _Mode | std::ios_base::binary);
		pubsetbuf(awch, _countof(awch));
		return this;
	}

	virtual int_type overflow(int_type ch = traits_type::eof())
	{
		if (!bBomWritten) {
			bBomWritten = true;
			int_type iRet = BASE::overflow(0xfeff);
			if (iRet != traits_type::not_eof(0xfeff)) return iRet;
		}
		if (ch == '\n') {
			int_type iRet = BASE::overflow('\r');
			if (iRet != traits_type::not_eof('\r')) return iRet;
		}
		return BASE::overflow(ch);
	}
};

class wcrlfofstream : public std::wostream
{
	typedef std::wostream BASE;
public:
	wcrlfofstream(const wchar_t *wszFilespec, std::ios_base::open_mode _Mode = std::ios_base::out) : std::wostream(new wcrlf_filebuf(wszFilespec, _Mode))
	{}

	wcrlf_filebuf* rdbuf()
	{
		return dynamic_cast<wcrlf_filebuf*>(std::wostream::rdbuf());
	}

	void close()
	{
		rdbuf()->close();
	}
};

Open in new window

Comments are welcome.
0
 
LVL 32

Expert Comment

by:sarabande
ID: 39680422
you should try whether statements like

wcrlfofstream wstrm(L"xxxx.txt", std::ios::binary | std::ios::out);
wstrm << L"xyz\n" << 12345 << L"abc\n";

Open in new window


do work.

in the <ios> and <ostream> headers file there are no virtual functions beside of the destructors. i would have assumed the rdbuf and close functions you provided were not called. but maybe i didn't have elaborated your approach deeply enough.

Sara
0
 

Author Comment

by:j66st
ID: 39690070
Yes, I tried your example, it works as expected.
std::basic_filebuf is defined in file <fstream>, std::basic_streambuf in <streambuf> and they contain many virtual methods.
My solution works because the overflow method is declared virtual in the basic_streambuf template class. Every character which is leaving the buffer is passed through this method.

Alternatively, we could filter characters entering the buffer by overloading the virtual xsputn method. But that solution would be incomplete because the single character basic_ostream::put method is not virtual and will bypass xsputn.
0
 
LVL 32

Expert Comment

by:sarabande
ID: 39706003
there are a lot of comments to the question which are worth to be kept in the knowledgebase. you should orderly close the question instead and accept your own comment as the solution (0 points).

Sara
0
 

Author Comment

by:j66st
ID: 39706014
That's exactly what I did: I clicked on my own comment #a39679318 to accept it as a solution, and added a comment to explain. Status is now "Close request pending", I understand that others get a few days to object. But I am not familiar with how these official procedures work on EE so please tell me if I did wrong.
0
 
LVL 32

Expert Comment

by:sarabande
ID: 39706139
no, you did right. sorry, I misread the closing comment as a request for deletion.

Moderator, please delete my objection.

Sara
0
 

Author Closing Comment

by:j66st
ID: 39718497
I think I clearly explained my goal: providing LF to CR LF expansion in wide character mode behind the basic_ostream interface. sarabande first seemed not to understand my goal, then thought it was not possible, and gave some advice for a workaround. (Sara, thanks for your time, though).
I finally decided to dig into the library sourcecode and I think I found a properly working solution myself now.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

This article shows you how to optimize memory allocations in C++ using placement new. Applicable especially to usecases dealing with creation of large number of objects. A brief on problem: Lets take example problem for simplicity: - I have a G…
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now