?
Solved

some values of tags in XML file are to be replaced with their corresponding Hash'ed values:ALGORITHM

Posted on 2009-02-22
12
Medium Priority
?
370 Views
Last Modified: 2013-11-11
I have XML Files, of which some tags' values are to be anonymised.
say I will only anonymise tags <c> and <e> values.
How is this to be achieved.
I cannot construct the algorithm so far, Regular Expressions would be a choice, but am a beginner there.

I have a big XML file, around 100 MegaBytes. But number of fields to anonymise is some 5, I can set them manually in an array structure before starting processing.


<a>
  <b>value_of_b</b>
  <c>value_of_c</c>
  <d>
      <e attr_e_1="ae1">value_e</e>
      <f attr_f_1="af1" />
  </d>
  <g />
</a>

Open in new window

0
Comment
Question by:rusdemezale
  • 5
  • 4
  • 3
12 Comments
 
LVL 92

Expert Comment

by:objects
ID: 23707062
easiest would be to use XSL

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 23707185
You really need a specialized API to do that. Xalan is a good thing to use. Have a look at

http://xml.apache.org/xalan-j/usagepatterns.html
0
 

Author Comment

by:rusdemezale
ID: 23707247
@CEHJ
is it not easier to handle it through some regex manipulations?
I do't have much time. and the tags to handle are only 5-6. so I can process them manually.
I just want a quick and working solution, not a generic elegant one so to say...

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 92

Expert Comment

by:objects
ID: 23707250
xalan is deffinitely not needed. we often do transformations like that and have never needed to use xalan.

Let me know how you go and if you have any problems.

0
 
LVL 92

Expert Comment

by:objects
ID: 23707254
what did you want to change the fields to?
0
 

Author Comment

by:rusdemezale
ID: 23707261
I mean, I will read a chunk of 1024 characters, and find the 1st occurrence of <tag_to_hash in, say "<tag_to_hash attr_1="sth">my_value</tag_to_hash>    then go to end character > and read into StringBuffer the value till </tag_to_hash>

change the stringbuffer and write the whole bunch as Stream into a second XML File.


0
 

Author Comment

by:rusdemezale
ID: 23707271
@objects
I will anonymise the data of some tags. say hash them via MD5 or similar algorithm
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 23707288
>>is it not easier to handle it through some regex manipulations?

In the end, using regex to parse tagged markup is a poor and fragile solution, For one thing, you can only apply regexes to strings with any ease. e.g

>>I mean, I will read a chunk of 1024 characters,

what are you going to do if the tag starts at offset, say, 1023?



0
 
LVL 92

Assisted Solution

by:objects
objects earned 135 total points
ID: 23707293
in the long run it will be easier to iterate thru the dom rather than using regexp, and it will be a lot more reliable.


firstly read the xml into a dom
http://helpdesk.objects.com.au/java/how-do-i-create-a-dom-document-from-an-xml-file

then loop thru them changing attributes as required
http://www.exampledepot.com/egs/org.w3c.dom/WalkDom.html

0
 
LVL 92

Expert Comment

by:objects
ID: 23707301
following shows how to access the attributes
http://www.exampledepot.com/egs/org.w3c.dom/GetAttr.html
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 225 total points
ID: 23707312
In your case, a SAXFilter would probably the best solution actually, and more performant than xslt
0
 

Author Comment

by:rusdemezale
ID: 23733651
thank you very much.
I created the file using routine SAX Parsing.
REGEX was very impressing but too complicated for this task. The Java source file is attached and working...

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.Writer;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
 
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
 
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
/**
 * This is a monolitic class used to create MD5'ed hash values of some critical private fields in an XML File
 * Usage example: >SAXParserHandlerWriter -XMLFileDirPath    "H:\My Documents\1_md5_xml"     -XMLFileName     BZP2SL_Avaloq_20090112_000009.xml
 * 
 * @author meyi
 * @version $Revision: $
 */
/**
 * This program reads a text file line by line and print to the console. It uses
 * FileOutputStream to read the file.
 */
public class SAXParserHandlerWriter extends DefaultHandler {
 
    public static String DIRECTORY_PATH = null; // "c:\\My Documents\\md5_xml\\"; // XMLFileDirPath
    public static String FILENAME = null; // "file.xml"; // XMLFileName
    static {
        //TODO give default paths
        if (File.separatorChar == '/') { // Unix
            DIRECTORY_PATH = "/"; // XMLFileDirPath = ;
            FILENAME = "file1.xml"; // XMLFileName      
        }
        else if (File.separatorChar == '\\') { // Win
            DIRECTORY_PATH = "c:\\My Documents\\md5_xml\\"; // XMLFileDirPath = ;
            FILENAME = "file.xml"; // XMLFileName        
        }
    }
 
    static MessageDigest md = null;
    static {
        try {
            md = MessageDigest.getInstance("MD5"); // getting a 'MD5-Instance'
        }
        catch (NoSuchAlgorithmException e) {
            System.out.println("No Such Algorithm Exception!");
            System.exit(-1);
        }
    }
 
    private static final String[] elementsToMD5 = new String[] { "street", "city", "name", 
                                                                 "telefonNumber",
                                                                 "middlename", "surname" };
 
    public static void main(String[] args) {
 
        boolean error = false;
        if (args.length % 2 != 0) {
            error = true;
            System.out.println("invalid options");
            System.exit(-1);
        }
        else {
            for (int i = 0; i < args.length; i++) {
                //even numbers; option name
                if (i % 2 == 0) {
                    if (args[i].equalsIgnoreCase("-XMLFileDirPath")) {
                        i++; //skip
                        DIRECTORY_PATH = args[i];
                        System.out.println("XMLFileDirPath=" + DIRECTORY_PATH);
                    }
                    else if (args[i].equalsIgnoreCase("-XMLFileName")) {
                        i++; //skip
                        FILENAME = args[i];
                        System.out.println("XMLFileName=" + FILENAME);
                    }
                    else { //wrong option spelling
                        System.out.println("invalid option or spelling in command line argument.Aborting)");
                        error = true;
                        System.exit(-1);
                    }
                }
            } //endfor
        }
 
        File infile = new File(DIRECTORY_PATH, FILENAME);
        String infileStr = infile.toString();
 
        try {
            List entries = null;
            if (infileStr.indexOf("zip", infileStr.length()) != -1) {
                entries = unzipEntries(infileStr);
                if (!entries.isEmpty()) {
                    infileStr = (String) entries.get(0);
                    infile = new File(infileStr);
                }
                else {
                    System.out.println("no File in the ZIP File");
                    System.exit(-1);
                }
            }
        }
        catch (Exception e) {
            System.out.println(e + " " + infileStr);
        }
 
        File outfile = new File(DIRECTORY_PATH, FILENAME.substring(0, FILENAME.length() - 4) + "_b.xml");
        String outfileStr = outfile.toString();
 
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;
 
        //        try {
        //            //TODO FileIO.copyFile(infile.toString(), outfile.toString());
        //        }
        //        catch (Exception e) {
        //            System.out.println(outfile);
        //        }
 
        try {
            fis = new FileInputStream(infile);
            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);
 
            FileOutputStream fos = new FileOutputStream(outfile);
            BufferedOutputStream bos = new BufferedOutputStream(fos);
            DataOutputStream dos = new DataOutputStream(bos);
            out = new FileWriter(outfile, true);
 
            // Create a handler to handle the SAX events generated during parsing
            DefaultHandler handler = new SAXParserHandlerWriter();
            // Parse the file using the handler
            parseXmlFile(infileStr, handler, false);
 
            /*
                       String line = null;
                       // dis.available() returns 0 if the file does not have more lines.
                       while (dis.available() != 0) {
                           // this statement reads the line from the file and print it to
                           // the console.
                           line = dis.readLine();
                           System.out.println(line);
                           
                           String myRegex = "\\d+\\w+"; // This provides for \d+\w+
                           java.util.regex.Pattern p = java.util.regex.Pattern.compile(myRegex);
                           java.util.regex.Matcher m = p.matcher(line);
                           if (m.find()) {
                               String matchedText = m.group();
                               int matchedFrom = m.start();
                               int matchedTo = m.end();
                               System.out.println("matched [" + matchedText + "] " +
                                                  "from " + matchedFrom +
                                                  " to " + matchedTo + ".");
                           }
                           
                           BufferedWriter out = new BufferedWriter(new FileWriter(outfile, true));
                           out.write("aString");
                           out.close();
                       }
                        */
            // dispose all the resources after using them.
            fis.close();
            bis.close();
            dis.close();
        }
        catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    // Parses an XML file using a SAX parser.
    // If validating is true, the contents is validated against the DTD
    // specified in the file.
    public static void parseXmlFile(String filename, DefaultHandler handler, boolean validating) {
        try {
            // Create a builder factory
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(validating);
            //factory.setFeature("http://apache.org/xml/features/dom/include-ignorable-whitespace", true);
            // Create the builder and parse the file
            SAXParser parser = factory.newSAXParser();
            parser.parse(new File(filename), handler);
        }
        catch (SAXException e) {
            // A parsing error occurred; the xml input is not valid
            System.out.println(e);
        }
        catch (ParserConfigurationException e) {
            System.out.println(e);
        }
        catch (IOException e) {
            System.out.println(e);
        }
    }
 
    /**
     *
     * @param src Copies src file to dst file.
     * @param dst If the dst file does not exist, it is created
     * @throws IOException
     */
    void copy(File src, File dst) throws IOException {
        InputStream in = new FileInputStream(src);
        OutputStream out = new FileOutputStream(dst);
 
        // Transfer bytes from in to out
        byte[] buf = new byte[1024];
        int len;
        while ((len = in.read(buf)) > 0) {
            out.write(buf, 0, len);
        }
        in.close();
        out.close();
    }
 
    /**
     * unzip a ZIP File
     *
     * @param zipFileName
     */
    public static final List unzipEntries(String zipFileName) {
        List list = new ArrayList();
        try {
            ZipFile zipFile = new ZipFile(zipFileName);
 
            Enumeration entries = zipFile.entries();
 
            while (entries.hasMoreElements()) {
                ZipEntry entry = (ZipEntry) entries.nextElement();
 
                if (entry.isDirectory()) {
                    // Assume directories are stored parents first then children.
                    System.err.println("Extracting directory: " + entry.getName());
                    // This is not robust, just for demonstration purposes.
                    (new File(entry.getName())).mkdir();
                    continue;
                }
 
                System.err.println("Extracting file: " + entry.getName());
                copyInputStream(zipFile.getInputStream(entry), new BufferedOutputStream(new FileOutputStream(entry
                    .getName())));
                list.add(entry.getName());
 
            }
 
            zipFile.close();
        }
        catch (IOException ioe) {
            System.err.println("Unhandled exception:");
            ioe.printStackTrace();
            return list;
        }
        return list;
    }
 
    /**
     * to be used while unzipping a ZIP File
     * @param in
     * @param out
     * @throws IOException
     */
    public static final void copyInputStream(InputStream in, OutputStream out) throws IOException {
        byte[] buffer = new byte[1024];
        int len;
 
        while ((len = in.read(buffer)) >= 0)
            out.write(buffer, 0, len);
 
        in.close();
        out.close();
    }
 
    // This class listens for startElement SAX events
    Locator locator;
    int indent;
    StringBuffer textBuffer;
    static private Writer out;
    /** <Kunde> is a special case. <Valor> element has also a child element named <name>: not to be MD5'ed */
    boolean isNameFromKunde = false;
 
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }
 
    public void startDocument() {
        try {
            out.write("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n");
        }
        catch (Exception e) {
            System.out.println(e);
        }
        finally {
 
        }
 
        System.out.println("start document");
    }
 
    public void endDocument() {
        System.out.println("end document");
    }
 
    // This method is called when an element is encountered
    public void startElement(String namespaceURI, String localName, // simple name
                             String qName, // qualified name
                             Attributes atts) throws SAXException {
        emit("<" + qName);
 
        /* </Kunde> detected. <name> element must be MD5'ed later on */
        if (qName.equalsIgnoreCase("Kunde")) {
            isNameFromKunde = true;
        }
        //echoText();
 
        int col, line;
        String publicId = null;
        String systemId = null;
        if (locator != null) {
            col = locator.getColumnNumber();
            line = locator.getLineNumber();
            publicId = locator.getPublicId();
            systemId = locator.getSystemId();
        }
 
        // Get the number of attribute
        int length = atts.getLength();
 
        // Process each attribute
        for (int i = 0; i < length; i++) {
            // Get names and values for each attribute
            String name = atts.getQName(i);
            String value = atts.getValue(i);
 
            // The following methods are valid only if the parser is namespace-aware
 
            // The uri of the attribute's namespace
            String nsUri = atts.getURI(i);
 
            // This is the name without the prefix
            String lName = atts.getLocalName(i);
 
            nl();
            emit(" " + name + "=" + "\"" + value + "\" ");
            nl();
        }
        emit(" >");
        
        System.out.print("start element: " + qName + " textBuffer:" + textBuffer + " ");
    }
 
    public void endElement(String uri, String localName, String qName) throws SAXException {
        /* </Kunde> detected. <name> element must not be MD5'ed later on */
        if (qName.equalsIgnoreCase("Kunde")) {
            isNameFromKunde = false;
        }
        
        if (isAnyOfThese(qName) && isNameFromKunde) {
            echoText(true);
        }
        else {
            echoText(false);
        }
 
        emit("</" + qName + ">");
 
        indent--;
        printIndent();
        System.out.println("end element: " + qName + " textBuffer:" + textBuffer);
    }
 
    public void ignorableWhitespace(char[] ch, int start, int length) {
        printIndent();
        System.out.println("whitespace, length " + length);
    }
 
    public void processingInstruction(String target, String data) {
        printIndent();
        System.out.println("processing instruction: " + target);
    }
 
    public void characters(char buf[], int offset, int len) throws SAXException {
        String s = new String(buf, offset, len);
        if (textBuffer == null) {
            textBuffer = new StringBuffer(s);
        }
        else {
            textBuffer.append(s);
        }
        System.out.print("characters(...) character data: " + s + " len:" + +len + " , textBuffer: " + textBuffer);
    }
 
    void printIndent() {
        for (int i = 0; i < indent; i++) {
            //System.out.print("-");
        }
    }
 
    //===========================================================
    // Utility Methods ...
    //===========================================================
    private void echoText(boolean md5) throws SAXException {
        if (textBuffer == null) {
            return;
        }
 
        //nl();
        //emit("CHARS: |");
 
        String s = "" + textBuffer;
        if (md5) {
            s = makeMD5(s);
        }
        s = s.trim();
        s = s.replaceAll("&", "&amp;");
        s = s.replaceAll("<", "&lt;");
        s = s.replaceAll(">", "&gt;");
        s = s.replaceAll("\"", "&quot;");
        s = s.replaceAll("\'", "&apos;");
        
        emit(s.trim());
        //emit("|");
        textBuffer = null;
    }
 
    // Wrap I/O exceptions in SAX exceptions, to
    // suit handler signature requirements
    private void emit(String s) throws SAXException {
        try {
 
            out.write(s.trim()); // s.replaceAll("\n", "") );
            out.flush();
        }
        catch (IOException e) {
            throw new SAXException("I/O error", e);
        }
    }
 
    // Start a new line
    private void nl() throws SAXException {
        String lineEnd = System.getProperty("line.separator");
 
        try {
            out.write(lineEnd);
        }
        catch (IOException e) {
            throw new SAXException("I/O error", e);
        }
    }
 
    public final boolean isAnyOfThese(String str) {
        boolean ret = false;
        for (int i = 0; i < elementsToMD5.length; i++) {
            if (elementsToMD5[i].equalsIgnoreCase(str)) {
                ret = true;
            }
        }
        return ret;
    }
 
    /**
     * <u>MD5-Hash generate</u>
     * @return
     */
    static final String makeMD5(String text) {
        String hash;
        byte[] encryptMsg = null;
 
        //TODO md = MessageDigest.getInstance( "MD5" );        // getting a 'MD5-Instance'
        byte[] textAsByteArr = text.getBytes();
        encryptMsg = md.digest(textAsByteArr); // solving the MD5-Hash
 
        String swap = ""; // swap-string for the result
        String byteStr = ""; // swap-string for current hex-value of byte
        StringBuffer strBuf = new StringBuffer();
 
        for (int i = 0; i <= encryptMsg.length - 1; i++) {
 
            byteStr = Integer.toHexString(encryptMsg[i]); // swap-string for current hex-value of byte
 
            switch (byteStr.length()) {
            case 1: // if hex-number length is 1, add a '0' before
                swap = "0" + Integer.toHexString(encryptMsg[i]);
                break;
 
            case 2: // correct hex-letter
                swap = Integer.toHexString(encryptMsg[i]);
                break;
 
            case 8: // get the correct substring
                swap = (Integer.toHexString(encryptMsg[i])).substring(6, 8);
                break;
            }
            strBuf.append(swap); // appending swap to get complete hash-key
        }
 
        hash = strBuf.toString(); // String with the MD5-Hash
        return hash; // returns the MD5-Hash
    }
}

Open in new window

0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Create a Windows 10 custom Image with custom task bar and custom start menu using XML for deployment.
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
This video teaches viewers about errors in exception handling.
Suggested Courses
Course of the Month14 days, 22 hours left to enroll

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question