Link to home
Start Free TrialLog in
Avatar of Suda_RamanaReddy
Suda_RamanaReddy

asked on

Need pattern string occured from the regular expression

Hi All,

I am splitting a string based on regular expression..... which looks like below.

public List splitString (String str, String regularExp){
 // Doing splitting and segregating it to chunks...
String [] splitArrey = str.split(regularExp);

// Want to get the actual pattern occuring each time...

}
Invoking above method
String regEx =  "<?(br|p|-- end --)>"
List resultList = new Splitter.splitString (str, regEx);
----------------------------------------------------------------------------

All I need in the above method now I want to find out what is the mattching pattern occured. Say it could be <br> or <p> or could be <-- end -->.

When I try to access the pattern String I'm getting the regular Expression I passed.. Instead I want to get the actual pattern string which was occured.. because I need to add that pattern string at the end of my chunk.

All I need is to access the pattern string which is splitting the string.... is there any method to get it...
Thanks
Avatar of dankuck
dankuck

Unlike Perl, Java does not offer parentheses to solve this problem (or at least, the API documentation doesn't mention it).  If parentheses were included around the regex in Perl, the results would include
the delimiters as components of the resulting array.

One way to solve this in Java would be to use Matcher.find to search for the delimiter and then use Matcher.start and Matcher.end to determine where and what the delimiter was.  By keeping a little extra info as we loop, we can determine what the component between the delimiters was.

Example:

To split the String "Four score\tand seven  years ago" using whitespace as the delimiter, the following code could be used.

public static void main(String[] args){
      String t = "Four score\tand seven  years ago";

      Matcher r = Pattern.compile("\\s").matcher(t);

      int previousEnd = 0;

      while (r.find()){
            System.out.println("component : \"" + t.substring(previousEnd, r.start()) + "\"");
            System.out.println("delimiter : \"" + t.substring(r.start(), r.end()) + "\"");
            previousEnd = r.end();
      }
      System.out.println("component : " + t.substring(previousEnd));

}

The previousEnd variable records where the last match ended and therefore where the next component begins.

When the loop is completed it is likely that one more component remains in the String, so the previousEnd variable can be used again to grab all content from the last delimiter to the end of the String.

Note that that the \s will match a space, a tab, or a newline character and will match the double space between "seven" and "years" twice, yielding a single zero-length string between them.

The output of this code would be:

component : "Four"
delimiter : " "
component : "score"
delimiter : "   "
component : "and"
delimiter : " "
component : "seven"
delimiter : " "
component : ""
delimiter : " "
component : "years"
delimiter : " "
component : ago
You can also do it using the indexOf() and substring methods as follows:

public List splitString (String str, String regularExp){
 // Doing splitting and segregating it to chunks...
String [] splitArrey = str.split(regularExp);

String pattern = str.substring(str.indexOf(splitArrey[0]) + splitArrey[0].length(), str.indexOf(splitArrey[1]));

}
indexOf can be used too, but will not match regular expresssions.  Also, it may give misleading results if the same token shows up in the String twice.
  public static List<String> splitString(String str, String regularExp) {
        Pattern p = Pattern.compile(regularExp);
       
        List<String> result = new ArrayList<String>();
        Matcher matcher = p.matcher(str);
       
        while (matcher.find()) {
            result.add(matcher.group());
        }
       
        return result;
    }    
Avatar of Suda_RamanaReddy

ASKER

Hi,

Here is my code......

public class Splitter2 {
            
      /**
       * <p>
       * Method which accepts a String and Split it into chunks when ever a regular expression pattern is found.
       * </p>
       * @param inputString - String to split into chunks
       * @param regex - Pattern String based on which String needs to be splitted.
       * @param maximumCount - Maximum number of chunks for a page
       * @return - minimumCount - Minimum number of chunks for a page
       */
             
      public List splitString(String inputString, String regex, int maximumCount, int minimumCount) {
                  //String patternStr = "(/<?(br|p|-- end --)>/";
            
                  /*
                   * Pattern pattern = Pattern.compile(regex);
                   * Matcher matcher = pattern.matcher(inputString);
                   *
                   * while(matcher.find()){
                   * }
                   */
            
                  String patternStrg = regex;
                  int maxCount = maximumCount;
                  int minCount = minimumCount -1;
                  
                  LinkedList linkedList = new LinkedList();
                  int limit=0;
                                          
                  // Split String into chunks at all occurences of pattern
                  String[] splitString = inputString.split(patternStrg);  
           
                  // if no of chunks are less than maxCount return a list add details to that...
                   if (splitString.length > maxCount){
                         for (int x=0; x<splitString.length;){
                               String tempStr ="";
                               
                               //System.out.println("SplitString lenght"+splitString.length);
                               
                               if ((splitString.length - x) <= minCount){
                                           String lastStr = (String)linkedList.removeLast();
                                             lastStr += splitString[splitString.length - minCount];
                                             
                                             //System.out.println("last String"+lastStr);
                                             linkedList.addLast(lastStr);                                                                            
                               }else{
                                     limit = Math.min(maxCount,splitString.length);
                                     int newlimit = ((x+limit) > splitString.length) ? splitString.length:(x+limit);
                                     for (int j = x ; j < newlimit; j++){
                                           //str1 += splitString[j]+patternStrg;
                                           tempStr += splitString[j];
                                     }
                                     linkedList.addLast(tempStr);
                               }
                               x += limit;
                       }
                   }
                   else {
                               String str1="";
                                                        
                               for (int k=0; k<splitString.length; k++){                                   
                                     
                                     if(patternStrg.equalsIgnoreCase("<!-- page break -->")){
                                           System.out.println("page break occured");
                                     }                                     
                                     str1 += splitString[k];
                                     System.out.println("pattern String"+ patternStrg);
                               }
                               linkedList.addLast(str1);
                               System.out.println("Split size if less than mincount" + linkedList.size());
                   }
            return linkedList;      
    }
            
      public static void main(String[] args) {
            BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            System.out.println("Please enter a String: ");
            try{
                  String str = in.readLine();
                  //String regex = "<(Br|p|!-- page break --)>+( <Br>)?"; (Working..)
                  
                  String regex = "<(Br|p|!-- page break --)>+( <Br>)?";
                                    
                  List result = new Splitter2().splitString(str,regex,7,2);
                  
                  Iterator i = result.iterator();
            while (i.hasNext()) {
                System.out.println("Test check...." + i.next());
            }
            }
            catch(IOException ioe){
                  System.out.println(ioe);
            }
            catch(Exception e){
                  System.out.println(e);
            }
      }
}

/// I need to get the pattern String, because I have to do return the list if <!-- page break --> occurs...

Pattern pattern = Pattern.compile(patternStr);
                  Matcher matcher = pattern.matcher(str);
                  
                  while(matcher.find()){                       
                       count = count+1;
                       String[] splitString = str.split(patternStr);

........................do some thing....}

The problem here.. is I may get multiple occurances of pattern Strings... ( <Br> or <p> or ... sth else )
Earlier the treatment was just splitting string irrespective of what pattern String it is..... but now if I get <!--page break --> I have to return the list...

please let me know what could be the best possible way to achive it..
and one more problem with

while(matcher.find()) {} it repeats the o/p equal to the no.of matched pattern Strings in the given String.
> dankuck:
>     Unlike Perl, Java does not offer parentheses to solve this problem (or at least, the API documentation doesn't mention it).

Yes, it does. They are called capturing groups. See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html#cg

Coming back to original question:
Pattern p = Pattern.compile(".*(<(a|p|br|tr)) .*");
Matcher m1 = p.matcher("And you can go to <a href=\"http://www.yahoo.com\">yahoo</a> for details");
boolean matches = m1.matches(); // evals to true, modulo my spelling errors
String matchString = m1.group(1); // "<a"
String tag = m1.group(2); // "a"
int start = m1.start(1); // index of '<'
<etc>

> freeexpert:
>     Yes, it does. They are called capturing groups...

Ah, I guess I should have said "I don't understand the API documentation".  But I do now, thanks!  Anyway, capture groups don't work with the split method as in some other languages.

Suda_RamanaReddy:

I'm sorry, I don't completely understand the purpose of your code above, however the following method will produce a String[] identically to split, except that it will include the delimiter between each chunk.

Each even-numbered element will be a chunk (0, 2, 4, etc) and each odd-numbered element will be a delimiter (1, 3, 5, etc).  The last element will be a chunk even if it is a zero-length String.

      public static String[] splitIncludingDelimiter(String input, String regex){
            Pattern pattern = Pattern.compile(regex);
            Matcher matcher = pattern.matcher(input);

            List<String> list = new ArrayList<String>();

            int previousEnd = 0;

            while(matcher.find()){
                  String chunk = input.substring(previousEnd, matcher.end());
                  String delimiter = matcher.group();
                  previousEnd = matcher.end();
                  list.add(chunk);
                  list.add(delimiter);
            }
            String chunk = input.substring(previousEnd);
            list.add(chunk);

            String[] results = new String[list.size()];
            list.toArray(results);
            return results;
      }

If you use this method instead of String.split, you'll want to check every odd numbered element to see if it's the "<!-- page break -->" String you're looking for, and treat every even numbered element as you would treat a chunk.

You can check to see if a number is even by using:
if (number % 2 == 0)
   /* even */
else
   /* odd */

(Note: the splitIncludingDelimiter method is not optimized, but instead written for understanding.)
(Note: optimally, the splitIncludingDelimiter method would be written in a more Object-Oriented fashion, perhaps using some type of token or iterator, but here it's written for compatibility with the original String.split method.)
Thanks.
I'm sorry.. but I need some thing different answer...
Well.. I'm doing this for pagination application, where I have to split the input string based on the regular Expression. I need to do different validations based on the type pf pattern String...
My string may consists of " two or more Continuos <Br>  (or) <!-- Page Break -->. I have split it so that I can spread the input across the pages....  If the pattern string is only <Br>'s.. I'm just addint chunks to a temp String and once the chunks count is 7... I;m returing it as the first element of the array.. and continuing the process....

The problem is with <!-- Page Break --> When it occurs... Irrespective of the count I have to return that String.. so that It will be a  new page... For this I used ((((((((matcher.find))))))))) method... which is doing the validation each and everytime it finds the pattern String.. I need this method.. because I have to find the pattern STring occured.. so that I can do the comparison.... At the same time I don;t want it to repeat the process everytime it occurs...

Hope I explained the problem clearly....!

My program is ......
import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Splitter2 {
            
      /**
       * <p>
       * Method which accepts a String and Split it into chunks when ever a regular expression pattern is found.
       * </p>
       * @param inputString - String to split into chunks
       * @param regex - Pattern String based on which String needs to be splitted.
       * @param maximumCount - Maximum number of chunks for a page
       * @return - minimumCount - Minimum number of chunks for a page
       */
             
      public List splitString(String inputString, String regex, int maximumCount, int minimumCount) {
                              
                  /*
                   * Pattern pattern = Pattern.compile(regex);
                   * Matcher matcher = pattern.matcher(inputString);
                   *
                   * while(matcher.find()){
                   * }
                   */
                  String patternStrg = regex;
                  int maxCount = maximumCount;
                  int minCount = minimumCount -1;
                  
                  LinkedList linkedList = new LinkedList();
                  int limit=0;
                  String subStr ="";
                                                
                  Pattern pattern = Pattern.compile(regex);
                  Matcher matcher = pattern.matcher(inputString);
                  while(matcher.find()){
            
                        int startIndex = matcher.start();
                        int endIndex = matcher.end();
                        
                        subStr= inputString.substring(startIndex, endIndex);
                        //System.out.println("Actual Pattern String"+inputString.substring(startIndex, endIndex));
                        
                        // Split String into chunks at all occurences of pattern
                        //String[] splitString = inputString.split(patternStrg,6);  
                  
                        String[] splitString = inputString.split(patternStrg);
                        
                        // if no of chunks are less than maxCount return a list add details to that...
                         if (splitString.length > maxCount){
                               for (int x=0; x<splitString.length;){
                                     String tempStr ="";
                               
                                     //System.out.println("SplitString lenght"+splitString.length);
                                     
                                     if ((splitString.length - x) <= minCount){
                                                 String lastStr = (String)linkedList.removeLast();
                                                   lastStr += splitString[splitString.length - minCount];
                                                   
                                                   //System.out.println("last String"+lastStr);
                                                   linkedList.addLast(lastStr);                                                                            
                               }else{
                                     limit = Math.min(maxCount,splitString.length);
                                     int newlimit = ((x+limit) > splitString.length) ? splitString.length:(x+limit);
                                     for (int j = x ; j < newlimit; j++){
                                           //str1 += splitString[j]+patternStrg;
                                           tempStr += splitString[j];
                                     }
                                     linkedList.addLast(tempStr);
                               }
                               x += limit;
                       }
                   }
                   else {
                               String str1="";
                               for (int k=0; k<splitString.length; k++){                                   
                                     if(subStr.equalsIgnoreCase("<!-- page break -->")){
                                           str1 += splitString[k];
                                           linkedList.addLast(str1);
                                           System.out.println("page break occured"+subStr);
                                           str1="";
                                     }
                                     else{
                                           str1 += splitString[k];
                                           System.out.println("value of k in else part"+k +"String value"+ str1);
                                     }                                     
                               }
                               linkedList.addLast(str1);
                               System.out.println("Split size if less than mincount" + linkedList.size());
                   }
                  }
            return linkedList;
            
    }
            
      public static void main(String[] args) {
            BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            System.out.println("Please enter a String: ");
            try{
                  String str = in.readLine();
                  //String regex = "<(Br|p|!-- page break --)>+( <Br>)?"; (Working..)
                  
                  String regex = "<(Br|p|!-- page break --)>+( <Br>)*";
                                    
                  List result = new Splitter2().splitString(str,regex,7,2);
                  
                  Iterator i = result.iterator();
            while (i.hasNext()) {
                System.out.println("Test check...." + i.next());
            }
            }
            catch(IOException ioe){
                  System.out.println(ioe);
            }
            catch(Exception e){
                  System.out.println(e);
            }
      }
}
------------------------------------------------------------------------------------------
and the i/p string is : 1st Chunk <p> 2nd Chunk <!-- page break --> 3rd Chunk <Br> <Br>

I should retrun the splitted string only once....
First element of the array is 1st Chunk  2nd Chunk
Second element of the array is 3rd Chunk

Similarly...
1st Chunk <p> 2nd Chunk <!-- page break --> 3rd Chunk <Br> <Br> 4th Chunk <p> 5th Chunk <p> 6th Chunk <Br> <Br> 7 th Chunk <!-- page break --> 8 th Chunk <!-- page break -->

I should retrun the splitted string only once....
First element of the array is 1st Chunk  2nd Chunk
Second element of the array is 3rd Chunk 4th Chunk 5th Chunk 6th Chunk 7 th Chunk
Third element of the array is 8 th Chunk.

Thanks in advance for all your help..

 





It might be easier to do it in two classes:

class Chunk {
   String _string,
   String _delimiter;
}

class CoreSplitter{
    CoreSplitter(String input, Stringg delimeter) {
    }
    // you should be able to write this method now...
   // you can also have a hasNext method or return null from next at end...
    Chunk next() {
    }
}

Class PageSplitter {
   static  Vector<String> split(String input) {
             Vector<String> result = new Vector<String>();
            CoreSplitter splitter = new Splitter(input, ....);
            count = 0;
            StringBuffer page = new StringBuffer();
           while ((Chunk chunk = spliiter.next()) != null) {
                 page.append(chunk._string);
                 if (++count > 6 || chunk._delim.equals(PAGE_BREAK) {
                              result.append(page);
                               page = new StringBuffer();
                               count = 0;
                  }
             }
       }
}


I am sure there are a bunch of compile errors and couple of logical errors there, but you get the idea...
Thanks to all.. I did this now without using split method of String API
ASKER CERTIFIED SOLUTION
Avatar of freeexpert
freeexpert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial