some values of tags in XML file are to be replaced with their corresponding Hash'ed values:ALGORITHM

I have XML Files, of which some tags' values are to be anonymised.
say I will only anonymise tags <c> and <e> values.
How is this to be achieved.
I cannot construct the algorithm so far, Regular Expressions would be a choice, but am a beginner there.

I have a big XML file, around 100 MegaBytes. But number of fields to anonymise is some 5, I can set them manually in an array structure before starting processing.


<a>
  <b>value_of_b</b>
  <c>value_of_c</c>
  <d>
      <e attr_e_1="ae1">value_e</e>
      <f attr_f_1="af1" />
  </d>
  <g />
</a>

Open in new window

rusdemezaleAsked:
Who is Participating?
 
CEHJConnect With a Mentor Commented:
In your case, a SAXFilter would probably the best solution actually, and more performant than xslt
0
 
objectsCommented:
easiest would be to use XSL

0
 
CEHJCommented:
You really need a specialized API to do that. Xalan is a good thing to use. Have a look at

http://xml.apache.org/xalan-j/usagepatterns.html
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
rusdemezaleAuthor Commented:
@CEHJ
is it not easier to handle it through some regex manipulations?
I do't have much time. and the tags to handle are only 5-6. so I can process them manually.
I just want a quick and working solution, not a generic elegant one so to say...

0
 
objectsCommented:
xalan is deffinitely not needed. we often do transformations like that and have never needed to use xalan.

Let me know how you go and if you have any problems.

0
 
objectsCommented:
what did you want to change the fields to?
0
 
rusdemezaleAuthor Commented:
I mean, I will read a chunk of 1024 characters, and find the 1st occurrence of <tag_to_hash in, say "<tag_to_hash attr_1="sth">my_value</tag_to_hash>    then go to end character > and read into StringBuffer the value till </tag_to_hash>

change the stringbuffer and write the whole bunch as Stream into a second XML File.


0
 
rusdemezaleAuthor Commented:
@objects
I will anonymise the data of some tags. say hash them via MD5 or similar algorithm
0
 
CEHJCommented:
>>is it not easier to handle it through some regex manipulations?

In the end, using regex to parse tagged markup is a poor and fragile solution, For one thing, you can only apply regexes to strings with any ease. e.g

>>I mean, I will read a chunk of 1024 characters,

what are you going to do if the tag starts at offset, say, 1023?



0
 
objectsConnect With a Mentor Commented:
in the long run it will be easier to iterate thru the dom rather than using regexp, and it will be a lot more reliable.


firstly read the xml into a dom
http://helpdesk.objects.com.au/java/how-do-i-create-a-dom-document-from-an-xml-file

then loop thru them changing attributes as required
http://www.exampledepot.com/egs/org.w3c.dom/WalkDom.html

0
 
objectsCommented:
following shows how to access the attributes
http://www.exampledepot.com/egs/org.w3c.dom/GetAttr.html
0
 
rusdemezaleAuthor Commented:
thank you very much.
I created the file using routine SAX Parsing.
REGEX was very impressing but too complicated for this task. The Java source file is attached and working...

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.Writer;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
 
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
 
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
/**
 * This is a monolitic class used to create MD5'ed hash values of some critical private fields in an XML File
 * Usage example: >SAXParserHandlerWriter -XMLFileDirPath    "H:\My Documents\1_md5_xml"     -XMLFileName     BZP2SL_Avaloq_20090112_000009.xml
 * 
 * @author meyi
 * @version $Revision: $
 */
/**
 * This program reads a text file line by line and print to the console. It uses
 * FileOutputStream to read the file.
 */
public class SAXParserHandlerWriter extends DefaultHandler {
 
    public static String DIRECTORY_PATH = null; // "c:\\My Documents\\md5_xml\\"; // XMLFileDirPath
    public static String FILENAME = null; // "file.xml"; // XMLFileName
    static {
        //TODO give default paths
        if (File.separatorChar == '/') { // Unix
            DIRECTORY_PATH = "/"; // XMLFileDirPath = ;
            FILENAME = "file1.xml"; // XMLFileName      
        }
        else if (File.separatorChar == '\\') { // Win
            DIRECTORY_PATH = "c:\\My Documents\\md5_xml\\"; // XMLFileDirPath = ;
            FILENAME = "file.xml"; // XMLFileName        
        }
    }
 
    static MessageDigest md = null;
    static {
        try {
            md = MessageDigest.getInstance("MD5"); // getting a 'MD5-Instance'
        }
        catch (NoSuchAlgorithmException e) {
            System.out.println("No Such Algorithm Exception!");
            System.exit(-1);
        }
    }
 
    private static final String[] elementsToMD5 = new String[] { "street", "city", "name", 
                                                                 "telefonNumber",
                                                                 "middlename", "surname" };
 
    public static void main(String[] args) {
 
        boolean error = false;
        if (args.length % 2 != 0) {
            error = true;
            System.out.println("invalid options");
            System.exit(-1);
        }
        else {
            for (int i = 0; i < args.length; i++) {
                //even numbers; option name
                if (i % 2 == 0) {
                    if (args[i].equalsIgnoreCase("-XMLFileDirPath")) {
                        i++; //skip
                        DIRECTORY_PATH = args[i];
                        System.out.println("XMLFileDirPath=" + DIRECTORY_PATH);
                    }
                    else if (args[i].equalsIgnoreCase("-XMLFileName")) {
                        i++; //skip
                        FILENAME = args[i];
                        System.out.println("XMLFileName=" + FILENAME);
                    }
                    else { //wrong option spelling
                        System.out.println("invalid option or spelling in command line argument.Aborting)");
                        error = true;
                        System.exit(-1);
                    }
                }
            } //endfor
        }
 
        File infile = new File(DIRECTORY_PATH, FILENAME);
        String infileStr = infile.toString();
 
        try {
            List entries = null;
            if (infileStr.indexOf("zip", infileStr.length()) != -1) {
                entries = unzipEntries(infileStr);
                if (!entries.isEmpty()) {
                    infileStr = (String) entries.get(0);
                    infile = new File(infileStr);
                }
                else {
                    System.out.println("no File in the ZIP File");
                    System.exit(-1);
                }
            }
        }
        catch (Exception e) {
            System.out.println(e + " " + infileStr);
        }
 
        File outfile = new File(DIRECTORY_PATH, FILENAME.substring(0, FILENAME.length() - 4) + "_b.xml");
        String outfileStr = outfile.toString();
 
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;
 
        //        try {
        //            //TODO FileIO.copyFile(infile.toString(), outfile.toString());
        //        }
        //        catch (Exception e) {
        //            System.out.println(outfile);
        //        }
 
        try {
            fis = new FileInputStream(infile);
            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);
 
            FileOutputStream fos = new FileOutputStream(outfile);
            BufferedOutputStream bos = new BufferedOutputStream(fos);
            DataOutputStream dos = new DataOutputStream(bos);
            out = new FileWriter(outfile, true);
 
            // Create a handler to handle the SAX events generated during parsing
            DefaultHandler handler = new SAXParserHandlerWriter();
            // Parse the file using the handler
            parseXmlFile(infileStr, handler, false);
 
            /*
                       String line = null;
                       // dis.available() returns 0 if the file does not have more lines.
                       while (dis.available() != 0) {
                           // this statement reads the line from the file and print it to
                           // the console.
                           line = dis.readLine();
                           System.out.println(line);
                           
                           String myRegex = "\\d+\\w+"; // This provides for \d+\w+
                           java.util.regex.Pattern p = java.util.regex.Pattern.compile(myRegex);
                           java.util.regex.Matcher m = p.matcher(line);
                           if (m.find()) {
                               String matchedText = m.group();
                               int matchedFrom = m.start();
                               int matchedTo = m.end();
                               System.out.println("matched [" + matchedText + "] " +
                                                  "from " + matchedFrom +
                                                  " to " + matchedTo + ".");
                           }
                           
                           BufferedWriter out = new BufferedWriter(new FileWriter(outfile, true));
                           out.write("aString");
                           out.close();
                       }
                        */
            // dispose all the resources after using them.
            fis.close();
            bis.close();
            dis.close();
        }
        catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    // Parses an XML file using a SAX parser.
    // If validating is true, the contents is validated against the DTD
    // specified in the file.
    public static void parseXmlFile(String filename, DefaultHandler handler, boolean validating) {
        try {
            // Create a builder factory
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(validating);
            //factory.setFeature("http://apache.org/xml/features/dom/include-ignorable-whitespace", true);
            // Create the builder and parse the file
            SAXParser parser = factory.newSAXParser();
            parser.parse(new File(filename), handler);
        }
        catch (SAXException e) {
            // A parsing error occurred; the xml input is not valid
            System.out.println(e);
        }
        catch (ParserConfigurationException e) {
            System.out.println(e);
        }
        catch (IOException e) {
            System.out.println(e);
        }
    }
 
    /**
     *
     * @param src Copies src file to dst file.
     * @param dst If the dst file does not exist, it is created
     * @throws IOException
     */
    void copy(File src, File dst) throws IOException {
        InputStream in = new FileInputStream(src);
        OutputStream out = new FileOutputStream(dst);
 
        // Transfer bytes from in to out
        byte[] buf = new byte[1024];
        int len;
        while ((len = in.read(buf)) > 0) {
            out.write(buf, 0, len);
        }
        in.close();
        out.close();
    }
 
    /**
     * unzip a ZIP File
     *
     * @param zipFileName
     */
    public static final List unzipEntries(String zipFileName) {
        List list = new ArrayList();
        try {
            ZipFile zipFile = new ZipFile(zipFileName);
 
            Enumeration entries = zipFile.entries();
 
            while (entries.hasMoreElements()) {
                ZipEntry entry = (ZipEntry) entries.nextElement();
 
                if (entry.isDirectory()) {
                    // Assume directories are stored parents first then children.
                    System.err.println("Extracting directory: " + entry.getName());
                    // This is not robust, just for demonstration purposes.
                    (new File(entry.getName())).mkdir();
                    continue;
                }
 
                System.err.println("Extracting file: " + entry.getName());
                copyInputStream(zipFile.getInputStream(entry), new BufferedOutputStream(new FileOutputStream(entry
                    .getName())));
                list.add(entry.getName());
 
            }
 
            zipFile.close();
        }
        catch (IOException ioe) {
            System.err.println("Unhandled exception:");
            ioe.printStackTrace();
            return list;
        }
        return list;
    }
 
    /**
     * to be used while unzipping a ZIP File
     * @param in
     * @param out
     * @throws IOException
     */
    public static final void copyInputStream(InputStream in, OutputStream out) throws IOException {
        byte[] buffer = new byte[1024];
        int len;
 
        while ((len = in.read(buffer)) >= 0)
            out.write(buffer, 0, len);
 
        in.close();
        out.close();
    }
 
    // This class listens for startElement SAX events
    Locator locator;
    int indent;
    StringBuffer textBuffer;
    static private Writer out;
    /** <Kunde> is a special case. <Valor> element has also a child element named <name>: not to be MD5'ed */
    boolean isNameFromKunde = false;
 
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }
 
    public void startDocument() {
        try {
            out.write("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n");
        }
        catch (Exception e) {
            System.out.println(e);
        }
        finally {
 
        }
 
        System.out.println("start document");
    }
 
    public void endDocument() {
        System.out.println("end document");
    }
 
    // This method is called when an element is encountered
    public void startElement(String namespaceURI, String localName, // simple name
                             String qName, // qualified name
                             Attributes atts) throws SAXException {
        emit("<" + qName);
 
        /* </Kunde> detected. <name> element must be MD5'ed later on */
        if (qName.equalsIgnoreCase("Kunde")) {
            isNameFromKunde = true;
        }
        //echoText();
 
        int col, line;
        String publicId = null;
        String systemId = null;
        if (locator != null) {
            col = locator.getColumnNumber();
            line = locator.getLineNumber();
            publicId = locator.getPublicId();
            systemId = locator.getSystemId();
        }
 
        // Get the number of attribute
        int length = atts.getLength();
 
        // Process each attribute
        for (int i = 0; i < length; i++) {
            // Get names and values for each attribute
            String name = atts.getQName(i);
            String value = atts.getValue(i);
 
            // The following methods are valid only if the parser is namespace-aware
 
            // The uri of the attribute's namespace
            String nsUri = atts.getURI(i);
 
            // This is the name without the prefix
            String lName = atts.getLocalName(i);
 
            nl();
            emit(" " + name + "=" + "\"" + value + "\" ");
            nl();
        }
        emit(" >");
        
        System.out.print("start element: " + qName + " textBuffer:" + textBuffer + " ");
    }
 
    public void endElement(String uri, String localName, String qName) throws SAXException {
        /* </Kunde> detected. <name> element must not be MD5'ed later on */
        if (qName.equalsIgnoreCase("Kunde")) {
            isNameFromKunde = false;
        }
        
        if (isAnyOfThese(qName) && isNameFromKunde) {
            echoText(true);
        }
        else {
            echoText(false);
        }
 
        emit("</" + qName + ">");
 
        indent--;
        printIndent();
        System.out.println("end element: " + qName + " textBuffer:" + textBuffer);
    }
 
    public void ignorableWhitespace(char[] ch, int start, int length) {
        printIndent();
        System.out.println("whitespace, length " + length);
    }
 
    public void processingInstruction(String target, String data) {
        printIndent();
        System.out.println("processing instruction: " + target);
    }
 
    public void characters(char buf[], int offset, int len) throws SAXException {
        String s = new String(buf, offset, len);
        if (textBuffer == null) {
            textBuffer = new StringBuffer(s);
        }
        else {
            textBuffer.append(s);
        }
        System.out.print("characters(...) character data: " + s + " len:" + +len + " , textBuffer: " + textBuffer);
    }
 
    void printIndent() {
        for (int i = 0; i < indent; i++) {
            //System.out.print("-");
        }
    }
 
    //===========================================================
    // Utility Methods ...
    //===========================================================
    private void echoText(boolean md5) throws SAXException {
        if (textBuffer == null) {
            return;
        }
 
        //nl();
        //emit("CHARS: |");
 
        String s = "" + textBuffer;
        if (md5) {
            s = makeMD5(s);
        }
        s = s.trim();
        s = s.replaceAll("&", "&amp;");
        s = s.replaceAll("<", "&lt;");
        s = s.replaceAll(">", "&gt;");
        s = s.replaceAll("\"", "&quot;");
        s = s.replaceAll("\'", "&apos;");
        
        emit(s.trim());
        //emit("|");
        textBuffer = null;
    }
 
    // Wrap I/O exceptions in SAX exceptions, to
    // suit handler signature requirements
    private void emit(String s) throws SAXException {
        try {
 
            out.write(s.trim()); // s.replaceAll("\n", "") );
            out.flush();
        }
        catch (IOException e) {
            throw new SAXException("I/O error", e);
        }
    }
 
    // Start a new line
    private void nl() throws SAXException {
        String lineEnd = System.getProperty("line.separator");
 
        try {
            out.write(lineEnd);
        }
        catch (IOException e) {
            throw new SAXException("I/O error", e);
        }
    }
 
    public final boolean isAnyOfThese(String str) {
        boolean ret = false;
        for (int i = 0; i < elementsToMD5.length; i++) {
            if (elementsToMD5[i].equalsIgnoreCase(str)) {
                ret = true;
            }
        }
        return ret;
    }
 
    /**
     * <u>MD5-Hash generate</u>
     * @return
     */
    static final String makeMD5(String text) {
        String hash;
        byte[] encryptMsg = null;
 
        //TODO md = MessageDigest.getInstance( "MD5" );        // getting a 'MD5-Instance'
        byte[] textAsByteArr = text.getBytes();
        encryptMsg = md.digest(textAsByteArr); // solving the MD5-Hash
 
        String swap = ""; // swap-string for the result
        String byteStr = ""; // swap-string for current hex-value of byte
        StringBuffer strBuf = new StringBuffer();
 
        for (int i = 0; i <= encryptMsg.length - 1; i++) {
 
            byteStr = Integer.toHexString(encryptMsg[i]); // swap-string for current hex-value of byte
 
            switch (byteStr.length()) {
            case 1: // if hex-number length is 1, add a '0' before
                swap = "0" + Integer.toHexString(encryptMsg[i]);
                break;
 
            case 2: // correct hex-letter
                swap = Integer.toHexString(encryptMsg[i]);
                break;
 
            case 8: // get the correct substring
                swap = (Integer.toHexString(encryptMsg[i])).substring(6, 8);
                break;
            }
            strBuf.append(swap); // appending swap to get complete hash-key
        }
 
        hash = strBuf.toString(); // String with the MD5-Hash
        return hash; // returns the MD5-Hash
    }
}

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.