Link to home
Start Free TrialLog in
Avatar of Molko
Molko

asked on

XSLT - Plaint Text To XML

Is it possible to take a "structured" non xml based plain text file and transform it into XML via XSLT ?
Avatar of Randy Downs
Randy Downs
Flag of United States of America image

No it only works with XML.

http://www.xml.com/pub/a/2003/11/26/learnXSLT.html

Extensible Stylesheet Language Transformations or XSLT is a language that allows you to transform XML documents into XML, HTML, XHTML, or plain text documents.
I would argue against this since there are much better techniques,
using eg. python or ruby regular expressions,
or various parser builders that exist for various structured formats

But if the structured text is utf-8 encoded, you could wrap a root tag around it

If you were using XSLT2, you could then use the regex functionality to construct XML

If you are using XSLT2 anyway there are techniques to read non XML text formats and use regexes on them. I am still in favour to keep the heavy lifting out of the XSLT

Note that tools such as XHTML Tidy or TagSoup can be used to transform lausy html or files that look like XML from a distance, into real XML/XHTML. In  a next step you can clean up using XSLT if you wish

I could give some more directions, if you gave us the feeling of what exactly the structured text looked like.

Anyway, if you were just looking for an answer "Is it possible?"
Yes it is,
I just finished an XSLT1 stylesheet that takes an EDI message in to properly structured XML... it can be done, but there is more fun in life :-)
@Number-1
No it only works with XML.

given that you reference a 9 year old article on a 12 year old language... there has been some evolution.

Your quote holds true only if you consider the text file unchanged as the input file to an XSLT1 process, not taking into account the extensions some XSLT1 processors had.

You imply a LOT of limitations in your reply, and none of them were implied by the question asked

- unchanged: as I said, you can wrap a root tag around it (simple piping in a command line) and then you have XML (preferably add CDATA sections). Or you could have a preprocess step as suggested before
- input file: you could have a dummy input file (or none at all, since from XSLT2 you can call a named template as the starting point) and pull in the text file as a string param argument(XSLT2 and 1), or read it through the unparsed-text() function (XSLT2 only)
- XSLT1: XSLT2 is stable enough and for a task like this I don't recommend recursive substring processing if you know you have regular expression functionality in XSLT2
- extensions: some XSLT1 processors have extensions that pull in some XSLT2 functionality in XSLT1 already (it is worth looking at www.exslt.org
Avatar of Molko
Molko

ASKER

I want to take this :

 Volume in drive C has no label.
 Volume Serial Number is 9C8E-C68B

 Directory of C:\Java

28/02/2012  10:30    <DIR>          .
28/02/2012  10:30    <DIR>          ..
28/10/2011  20:57    <DIR>          jre6
23/10/2011  16:50    <DIR>          lib
06/01/2012  15:03    <DIR>          workspace
23/10/2011  16:50                        helloworld.java
               0 File(s)              0 bytes
               7 Dir(s)  696,677,314,560 bytes free

Open in new window


into something like
<disk>
	<dir>
		<name>C:/</name>
		<dir>
			<name>java</name>
			<directory>
				<name>jre6</name>
			</directory>
			<directory>
				<name>lib</name>
			</directory>
			<directory>
				<name>workspace</name>
			</directory>
			<file>
				<name>helloworld.java</name>
			</file>
		</dir>
	</dir>
</disk>

Open in new window

Well, you could wrap a <root> tag around this and add CDATA
<root><![CDATA[....
...]]></root>
and then regex your way through it (XSLT2)
or even substring through it with some recursion (XSLT1)

But is there a reason why you would want to do this?
Because the infrastructure is in place?
Or you don't have other tools available

I would throw some lines of ruby with XML builder and this is done, easy and concise
Avatar of Molko

ASKER

Thanks.

I might have a look at the <root><![CDATA[.......]]></root> then apply some fancy regex - i'll see how i get on.

Yes, the reason I am considering XSLT2 for this, is that the infrastructure it already in place and if I could do it in XSLT2 it would save a lot of Java coding...well thats the theory :-)

Not really sure i could use Ruby in a JEE stack...hmm. Failing all the above i'll have to resort to parsing the file in Java
Avatar of Molko

ASKER

out of interest...how would the CDATA help ? I guess the regex would match on 'Directory of' and then each '<DIR>' ...actually it might be better for the regex to match on the endofline as i would probably need the datetimes as well.

Could you show me a quick example ?....
If this is XSLT2, have a look at unparsed-text()
and use that in a named template that you trigger using -it (initial template)
... it is java, so I assume saxon
ASKER CERTIFIED SOLUTION
Avatar of Gertone (Geert Bormans)
Gertone (Geert Bormans)
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
match="/" template is only there for the case you would have a dummy source XML and no -it set
Avatar of Molko

ASKER

wow...thanks i'll take a good look at that.

yes, XSLT2 and Saxon.


Thankyou
Avatar of Molko

ASKER

Hi

Thanks again for this, I am going through it now....

The thing that I am concerned with is that usually i would issue java code like this :

 
       
            Transformer transformer =  
                tFactory.newTransformer(xslt);  
  
            transformer.transform(xml,  output);  
 

Open in new window


How would that work with the XLST you have provided ? The one you have shown reads a file from a disk location ? Just wondering how i could invoke it in the above style ?

Thanks again
well, I am not much of a java man.

you could make "xml" a dummy xml document with one empty root tag "<root/>"
the XSLT I did works in two modes
- either pass the dummy xml with the empty root tag as shown
- or don't pass a source XML, but use the -it parameter to indicate that the transformm needs to start with the template name="start"

The second option is cleaner, but I have reverted to the earlier option, with the dummy xml, before, in order to allow java developers in my team to reuse the java code as they had it available... it feels more classic

In your question you were referring to a file
non xml based plain text file
So, pass the parameter $input-file-uri the uri of this file
this can either be a url, or simply (if it is on disk) the path to the file,
having the protocol file:/ in front of it (as is in my example)

I would not know how to pass the uri if it were not a file on disk
but than you could pass in a string as a parameter (you would not even need unparsed-text)
or save it on disk temporarily
Avatar of Molko

ASKER

Sorry I perhaps unintentionally misled you somewhat.

The data is stored in a file, but i have already read this into my Java app, with the contents of the file I would then apply the xslt tranformation.
    /** 
     * Simple transformation method. 
     * @param sourcePath - Absolute path to source xml file. 
     * @param xsltPath - Absolute path to xslt file. 
     * @param resultDir - Directory where you want to put resulting files. 
     */  
    public static void simpleTransform(String sourcePath, String xsltPath,  
                                       String resultDir) {  
        TransformerFactory tFactory = TransformerFactory.newInstance();  
        try {  
            Transformer transformer =  
                tFactory.newTransformer(new StreamSource(new File(xsltPath)));  
  
            transformer.transform(new StreamSource(new File(sourcePath)),  
                                  new StreamResult(new File(resultDir)));  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
    }  

Open in new window

If the information is in a file... leave it there and let teh XSLT processor deal with it

If you insist in having it as the XML source, you need to make it XML.... and you will hit encoding issues, no doubt.

Loading the file uri as unparsed-text()
- takes away that risk
- saves you a bunch of java code

I would know my prefered strategy
Avatar of Molko

ASKER

Thankyou

It in a 'File' now, becuase I am working with it locally to get the XSLT to work etc.

In essence its only in a File now, as I am working with it.

Once i have finished the XSLT, the 'real' data will reach my Java component as a String, which i guess i will need to wrap with something like

<?xml version="1.0" encoding="UTF-8"?>
<root> .....</root>

Open in new window


and CDATA arund the 'DIR' etc.

and then instantiate an XML DOM and invoke the SAXON parser to tranform XSL DOM with the XSLT.

Thats the plan.........:-)

sorry for the confusion re. 'File'.

Thanks
Anyway, you will hit errors because of encoding issues
Just wrap a CDATA around the whole string

<?xml version="1.0" encoding="iso-8859-1"?>
<root><![CDATA[...]]></root>

Open in new window



drop the template match="/"

make this

  <xsl:template name="start">
        <xsl:variable name="input-str" select="unparsed-text($input-file-uri, 'iso..."/>

Open in new window


into

  <xsl:template select="root">
        <xsl:variable name="input-str" select="."/>

Open in new window


and it will work the same

I just hope that by parsing the parsing the string (that is exactly what will happen, your pseudo XML will hit the xml parser prior to getting to the XSLT) will not kill your new lines

In theory CR and LF are normalized to a single '\n' in XML before parsing... but it will depend on your application to be sure
Avatar of Molko

ASKER

I thought everthing would be encoded to UTF-8 ? No ?

The XSLT adjusted to :

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
	<xsl:output indent="yes"/>

	<xsl:template match="root">
		<xsl:variable name="input-str" select="."/>
		<disk>
			<dir>......
......
and XML-ised input like
......

<?xml version="1.0" encoding="UTF-8"?>
<root> 
<![CDATA[Volume in drive D has no label.
 Volume Serial Number is 145C-E872

 Directory of D:\

25/02/2012  13:22             BACKKUP
18/04/2012  11:43                 0 dir.txt
12/09/2009  07:23              Documents and Settings

Open in new window


Running your stuff (as adjusted above) through XML SPY (using built in XSLT transformer) seems to be getting results.
very unlikely that a dir command line returns UTF-8,
you need to set teh encoding of the generated XML to what you expect from teh text file
Avatar of Molko

ASKER

fair point. I'll look into that.

Thanks very much for your help, more than I expected....

I have a bit of more work to do on this, as i need to expand the 'pwd'

 Directory of D:\temp\etc  <------------------this bit

25/02/2012  13:22    <DIR>          BACKUP
25/02/2012  13:22    <DIR>          BACKUP1
25/02/2012  13:22                        example.txt


into

<?xml version="1.0" encoding="UTF-8"?>
<disk>
      <dir>
            <name>d:</name>
            <dir>
                  <name>temp</name>
                  <dir>
                        <name>etc</name>
                        <directory>
                              <name>BACKUP</name>
                        </directory>
                        <directory>
                              <name>BACKUP1</name>
                        </directory>
                        <file>
                              <name>example.txt</name>
                        </file>
                  </dir>
            </dir>
      </dir>
</disk>


I'll have a stab at that....

So I will close this question now..

Thanks again for giving me a headstart...
welcome,

splitting that last bit out is not a big task

I would have a replace function to get the line out with "Directory of" upto "\n"
do a tokenize-string on that result, splitting on the ":\"
first part is the "d", second part is the rest
it is pretty straightforward. If you have issues with that, I can help you with it

have fun
Avatar of Molko

ASKER

Thankyou

Excellent.
Avatar of Molko

ASKER

Hi

Do you want me to open a new question (happy to do so).

This is what I have produced so far, what do you think ?  :

I have a slight issue with the structure of the output, cant seem to figure out (yet...)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
	<xsl:output indent="yes"/>
	<xsl:template match="root">
		<xsl:variable name="input-str" select="."/>
		<disk>
			<dir>
				<xsl:analyze-string select="$input-str" regex="\n">
					<xsl:matching-substring/>
					<xsl:non-matching-substring>
						<xsl:analyze-string select="." regex="(Directory of\s)(\w:.+)">
							<xsl:matching-substring>
								<xsl:call-template name="process-path">
									<xsl:with-param name="path" select="regex-group(2)"/>
								</xsl:call-template>
							</xsl:matching-substring>
							<xsl:non-matching-substring>
								<xsl:analyze-string select="." regex="(\d+/\d+/\d+\s+\d+:\d+\s+)((&lt;DIR&gt;)?)?\s+(.+)">
									<xsl:matching-substring>
										<xsl:variable name="elem-name" select="if(normalize-space(regex-group(2))) then('directory') else('file')"/>
										<xsl:element name="{$elem-name}">
											<name>
												<xsl:choose>
													<xsl:when test="$elem-name= 'file'">
														<xsl:value-of select="substring-after(normalize-space(regex-group(4)),' ')"/>
													</xsl:when>
													<xsl:otherwise>
														<xsl:value-of select="normalize-space(regex-group(4))"/>
													</xsl:otherwise>
												</xsl:choose>
											</name>
											<date>
												<xsl:value-of select="normalize-space(regex-group(1))"/>
											</date>
											<size>
												<xsl:choose>
													<xsl:when test="$elem-name= 'file'">
														<xsl:value-of select="substring-before(normalize-space(regex-group(4)),' ')"/>
													</xsl:when>
													<xsl:otherwise>
														<xsl:value-of select="'0'"/>
													</xsl:otherwise>
												</xsl:choose>
											</size>
										</xsl:element>
									</xsl:matching-substring>
									<xsl:non-matching-substring/>
								</xsl:analyze-string>
							</xsl:non-matching-substring>
						</xsl:analyze-string>
					</xsl:non-matching-substring>
				</xsl:analyze-string>
			</dir>
		</disk>
	</xsl:template>
	<xsl:template name="process-path">
		<xsl:param name="path"/>
		<xsl:choose>
			<xsl:when test="contains($path, '\')">
				<dir>
					<xsl:choose>
						<xsl:when test="contains(substring-before($path, '\'), ':')">
							<name>
								<xsl:value-of select="substring-before($path, '\')"/>
							</name>
							<xsl:call-template name="process-path">
								<xsl:with-param name="path" select="substring-after($path, '\')"/>
							</xsl:call-template>
						</xsl:when>
						<xsl:otherwise>
							<name>
								<xsl:value-of select="substring-before($path, '\')"/>
							</name>
							<xsl:call-template name="process-path">
								<xsl:with-param name="path" select="substring-after($path, '\')"/>
							</xsl:call-template>
						</xsl:otherwise>
					</xsl:choose>
				</dir>
			</xsl:when>
			<xsl:when test="string-length($path) > 0">
				<dir>
					<name>
						<xsl:value-of select="$path"/>
					</name>
				</dir>
			</xsl:when>
		</xsl:choose>
	</xsl:template>
</xsl:stylesheet>

Open in new window


<?xml version="1.0" encoding="UTF-8"?>
<root><![CDATA[Volume in drive D has no label.
 Volume Serial Number is 145C-E872

 Directory of D:\temp\etc

25/02/2012  13:22    <DIR>          BACKUP
18/04/2012  11:43                 5000 TextFile.txt
12/09/2009  07:23    <DIR>          Documents and Settings
25/02/2012  16:51    <DIR>          Program Files
02/05/2009  17:24    <DIR>          wamp
25/02/2012  16:51    <DIR>          WINDOWS
10/03/2010  20:19    <DIR>          workspace
               1 File(s)              0 bytes
               6 Dir(s)  46,973,202,432 bytes free
]]></root>

Open in new window

You should always use
$path
instead of
string-length($path) > 0
an empty string in a boolean expression evaluates to false
personally I always do normalize-space($path) as a test

I did some smarter regex, to get rid of the chooses (I hate chooses when not necessary),
they clutter the code

Here is what I would make out of this
maybe you like it

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     version="2.0">
    <xsl:output indent="yes"/>
    <xsl:template match="root">
        <xsl:variable name="input-str" select="."/>
        <disk>
            <dir>
                <xsl:analyze-string select="$input-str" regex="\n">
                    <xsl:matching-substring/>
                    <xsl:non-matching-substring>
                        <xsl:analyze-string select="." regex="(Directory of\s)(\w:)\\?(.*)">
                            <xsl:matching-substring>
                                <dir>
                                    <name><xsl:value-of select="regex-group(2)"/></name>
                                    <xsl:call-template name="process-path">
                                        <xsl:with-param name="path" select="tokenize(regex-group(3), '\\')"/>
                                    </xsl:call-template>
                                </dir>
                            </xsl:matching-substring>
                            <xsl:non-matching-substring>
                                <xsl:analyze-string select="." regex="(\d+/\d+/\d+\s+\d+:\d+)\s+(&lt;DIR&gt;|\d+)\s+(.+)">
                                    <xsl:matching-substring>
                                        <xsl:variable name="elem-name" select="if(matches(regex-group(2), '\d+')) then('directory') else('file')"/>
                                        <xsl:element name="{$elem-name}">
                                            <name>
                                                <xsl:value-of select="normalize-space(regex-group(3))"/>
                                            </name>
                                            <date>
                                                <xsl:value-of select="normalize-space(regex-group(1))"/>
                                            </date>
                                            <size>
                                                <xsl:value-of select="number(translate(regex-group(2), '&lt;&gt;DIRdir', '0'))"/>
                                            </size>
                                        </xsl:element>
                                    </xsl:matching-substring>
                                    <xsl:non-matching-substring/>
                                </xsl:analyze-string>
                            </xsl:non-matching-substring>
                        </xsl:analyze-string>
                    </xsl:non-matching-substring>
                </xsl:analyze-string>
            </dir>
        </disk>
    </xsl:template>
    <xsl:template name="process-path">
        <xsl:param name="path"/>
        <xsl:if test="count($path) > 0">
            <dir>
                <name>
                    <xsl:value-of select="$path[1]"/>
                </name>
                <xsl:call-template name="process-path">
                    <xsl:with-param name="path" select="$path[position() > 1]"/>
                </xsl:call-template>
            </dir>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Open in new window

Avatar of Molko

ASKER

wow....yes, i like !...i like a lot...although you have mixed up directories with files :-)

I am still left with my structural problem though, i cant seem to solve it....

I need the output to be
<?xml version="1.0" encoding="UTF-8"?>
<disk>
		<dir>
			<name>D:</name>
			<dir>
				<name>temp</name>
				<dir>
					<name>etc</name>
					<directory>
						<name>BACKUP</name>
						<date>25/02/2012 13:22</date>
						<size>0</size>
					</directory>
					<file>
						<name>TextFile.txt</name>
						<date>18/04/2012 11:43</date>
						<size>5000</size>
					</file>
					<directory
						<name>Documents and Settings</name>
						<date>12/09/2009 07:23</date>
						<size>0</size>
					</directory>
					<directory>
						<name>Program Files</name>
						<date>25/02/2012 16:51</date>
						<size>0</size>
					</directory>
					<directory>
						<name>wamp</name>
						<date>02/05/2009 17:24</date>
						<size>0</size>
					</directory>
					<directory>
						<name>WINDOWS</name>
						<date>25/02/2012 16:51</date>
						<size>0</size>
					</directory>
					<directory>
						<name>workspace</name>
						<date>10/03/2010 20:19</date>
						<size>0</size>
					</directory>
				</dir>
			</dir>
		</dir>
</disk>

Open in new window


at the moment its :

<?xml version="1.0" encoding="UTF-8"?>
<disk>
	<dir>
		<dir>
			<name>D:</name>
			<dir>
				<name>temp</name>
				<dir>
					<name>etc</name>
				</dir>
			</dir>
		</dir>
		<file>
			<name>BACKUP</name>
			<date>25/02/2012 13:22</date>
			<size>0</size>
		</file>
		<directory>
			<name>TextFile.txt</name>
			<date>18/04/2012 11:43</date>
			<size>5000</size>
		</directory>
		<file>
			<name>Documents and Settings</name>
			<date>12/09/2009 07:23</date>
			<size>0</size>
		</file>
		<file>
			<name>Program Files</name>
			<date>25/02/2012 16:51</date>
			<size>0</size>
		</file>
		<file>
			<name>wamp</name>
			<date>02/05/2009 17:24</date>
			<size>0</size>
		</file>
		<file>
			<name>WINDOWS</name>
			<date>25/02/2012 16:51</date>
			<size>0</size>
		</file>
		<file>
			<name>workspace</name>
			<date>10/03/2010 20:19</date>
			<size>0</size>
		</file>
	</dir>
</disk>

Open in new window

I did some restructuring (this way it also works for the root dir)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     version="2.0">
    <xsl:output indent="yes"/>
    <xsl:variable name="input-str" select="/root"/>
    
    <xsl:template match="/">
        <disk>
            <xsl:analyze-string select="$input-str" regex="Directory\s+of\s+(\w:)\\?([^\n]*)\n">
                <xsl:matching-substring>
                    <xsl:variable name="this-path" select="tokenize(regex-group(2), '\\')"/>
                    <dir>
                        <name><xsl:value-of select="regex-group(1)"/></name>
                        <xsl:call-template name="process-path">
                            <xsl:with-param name="path" select="$this-path"/>
                        </xsl:call-template>
                        <xsl:call-template name="process-file-list">
                            <xsl:with-param name="path" select="$this-path"/>
                        </xsl:call-template>
                    </dir>
                </xsl:matching-substring>
                <xsl:non-matching-substring/>
            </xsl:analyze-string>
        </disk>
    </xsl:template>
    <xsl:template name="process-path">
        <xsl:param name="path"/>
        <xsl:if test="count($path) > 0">
            <dir>
                <name>
                    <xsl:value-of select="$path[1]"/>
                </name>
                <xsl:call-template name="process-path">
                    <xsl:with-param name="path" select="$path[position() > 1]"/>
                </xsl:call-template>
            </dir>
        </xsl:if>
        <xsl:call-template name="process-file-list">
            <xsl:with-param name="path" select="$path"/>
        </xsl:call-template>
    </xsl:template>
    
    <xsl:template name="process-file-list">
        <xsl:param name="path"/>
        <xsl:if test="count($path) = 0">
            <xsl:analyze-string select="$input-str" regex="\n">
                <xsl:matching-substring/>
                <xsl:non-matching-substring>
                    <xsl:analyze-string select="." regex="(\d+/\d+/\d+\s+\d+:\d+)\s+(&lt;DIR&gt;|\d+)\s+(.+)">
                        <xsl:matching-substring>
                            <xsl:variable name="elem-name" select="if(matches(regex-group(2), '\d+')) then('file') else('directory')"/>
                            <xsl:element name="{$elem-name}">
                                <name>
                                    <xsl:value-of select="normalize-space(regex-group(3))"/>
                                </name>
                                <date>
                                    <xsl:value-of select="normalize-space(regex-group(1))"/>
                                </date>
                                <size>
                                    <xsl:value-of select="number(translate(regex-group(2), '&lt;&gt;DIRdir', '0'))"/>
                                </size>
                            </xsl:element>
                        </xsl:matching-substring>
                        <xsl:non-matching-substring/>
                    </xsl:analyze-string>
                </xsl:non-matching-substring>
            </xsl:analyze-string>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Open in new window

Avatar of Molko

ASKER

Hi

Is your latest regex correct ?
regex="Directory\s+of\s+(\w:)\\?([^\n]*)\n">

Open in new window


It does not seem to work, however if i change it to
regex="Directory\s+of\s+(\w:)\\?([^\n].*)">

Open in new window

then it seems to be OK.

Just wondering what your changes intended to do.

Thankyou
since I use that on the full input-str, I need this ([^\n]*) for grabbing the filename (and stopping at the end-of line)

([^\n]*) means anything but a \n
([^\n].*) means one character that is not \n and a bunch of other things
That can't be right

If I run the XSLT I posted, this is what I get

<disk>
   <dir>
      <name>D:</name>
      <dir>
         <name>temp</name>
         <dir>
            <name>etc</name>
            <directory>
               <name>BACKUP</name>
               <date>25/02/2012 13:22</date>
               <size>0</size>
            </directory>
...

Open in new window

Avatar of Molko

ASKER

I  thought '.' would match on any character except \n

(.*) would mean anything but a \n
(.*) would mean anything but a \n

this depends on the mode actually, I tend to do a lot of multiline regexes, so I work in "dot-all" mode from time to time... so I tend to be more prudent than necessary

without modifier you are right
and this should be the same as what I had originally written
<xsl:analyze-string select="$input-str" regex="Directory\s+of\s+(\w:)\\?(.*)">

Open in new window

having no reference at all to the \n