Avatar of krakatoa
krakatoaFlag for United Kingdom of Great Britain and Northern Ireland

asked on 

Getting correct find position for a String search in text.

The code finds the String that it's asked to look for within text read-in from a file.

But when trying to find the position of that String, it over-reports it by the length of the String. This may have something to do with line feeds etc ? But if it does, then how can this be put right?

I'm using a text from novel "Pride and Prejudice" at the moment, which is read in char by char. In the method that builds the proposed found String, the var 'total' should give the number of characters processed. If the String is found in the text, then it should state its position (end) based on this 'total' variable.

If the term being searched for is at the start of the text, then it does find the correct position. But if there is another occurrence of the search term later in the text, then it overreports the position.

Here's the code :


import java.io.BufferedReader ;
import java.io.File ;
import java.io.FileReader ;
import java.util.Iterator ;
import java.util.Set ;
import java.util.Map ;
import java.util.HashMap ;
import java.util.HashSet ;
import java.util.Scanner ;



class CountChars {


static StringBuilder strb = new StringBuilder();
static int total = 0;

static String SOUGHT;
     
        public static void main (String[] args){
        
        Scanner scanner = new Scanner(System.in) ;
        
        System.out.println("\n\nInput your search . . . \n") ;
               
        String fromConsole = scanner.nextLine() ;
        
        if(fromConsole.length()==0){System.exit(0);}
        
        SOUGHT = fromConsole.toString() ;
        
        try{
        
            File file = new File("C:/EE_Q_CODE/PrideAndPrejudice.txt");
            //File file = new File("C:/EE_Q_CODE/TheBible.txt");
            //File file = new File("C:/EE_Q_CODE/WarAndPeace.txt");
            //File file = new File("C:/EE_Q_CODE/Leviathan.txt");
            
            FileReader fr = new FileReader(file);
                        
            BufferedReader br = new BufferedReader(fr);      
            
            Set <Integer> chS = new HashSet<>();
            Map <Character, Integer> count = new HashMap<>();
         
            int c;
            
            while(! ((c=(fr.read()))==-1)){
            
                if(!chS.contains((int)c)){chS.add((int)c);count.put((char)c,0);}
                int i = count.get((char)c);
                if(chS.contains((int)c)){count.replace((char)c,i,i+1);}
                builder(c);
            }
               
            Iterator it = chS.iterator();
            while(it.hasNext()){System.out.print((char)(((Integer)it.next()).intValue())+" ");}
            
            System.out.println();
            System.out.println("\nTotal characters processed : "+total);
            System.out.println();
            
                /* Alternative output method. (I prefer forEach below).
                count.entrySet().stream().forEach(e -> {
                    System.out.format("key: %s, value: %d%n", e.getKey(), e.getValue());
                });
                */
                
            count.forEach((k, v) -> { System.out.format("key: %s, value: %d%n ", k, v); });
            }catch(Exception e){e.printStackTrace();} 
            
            SOUGHT = null;
            fromConsole = null;
            System.gc();
 
        }
              
        
        static void builder(int ch){
        
            int len = strb.length();
            total++;
                
                if((char)ch==SOUGHT.charAt(0)){strb.append((char)ch);return;}
                
                if(len>SOUGHT.length()-1){strb.delete(0,strb.length());return;}
                              
                if ((char)ch==SOUGHT.charAt(len)&&SOUGHT.charAt(len-1)==strb.charAt(len-1)){strb.append((char)ch);}
                else{strb.delete(0,strb.length());return;}
                if(strb.toString().equals(SOUGHT)){System.out.printf("%n\"%s\"%s%d%n",SOUGHT," <<< found at position ",total);strb.delete(0,strb.length());}
            
            
        }
        
 }

Open in new window

builder() is the method thatdoes the searching and reports the position - (and the String).
Any ideas on how to iron this out ? Thanks.
Java

Avatar of undefined
Last Comment
krakatoa
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

I must confess I'm in a 'why? phase' looking at the code but 'total' in
 System.out.printf("%n\"%s\"%s%d%n", SOUGHT, " <<< found at position ", total);

Open in new window

is being treated as an offset into the file. The problem could be that it's only being incremented conditionally in your while block and not as a matter of course
Avatar of krakatoa
krakatoa
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Yes CEHJ - it is being treated as an offset; but of course the builder() method does get called for every character, and so it's hard to envisage any leak happening with that, isn't it, or . . . ? It's (as I see 'total') not confined by anything more than it should be as a consequence of being called in the while loop . . . ? I'm maybe being dense again.
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

'total' is only being incremented conditionally as a consequence of something being found in your Map. Any offset counter should be incremented without fail, i.e. unconditionally.

                if (chS.contains((int) c)) {
                    count.replace((char) c, i, i + 1);
                    total++;
                }

Open in new window

You need to move that out of there and make it the last line before end while loop
Avatar of krakatoa
krakatoa
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

I'm using a version of the text I mentioned taken from Project Gutenberg. I don't suppose that might have anything to do with my travails, does it ?

BTW, if I input a search string less than 4 characters long, it won't find it at all ! Bit bizarre, isn't it ? 
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Be careful as some of those files have a BOM so might throw out your count
Avatar of krakatoa
krakatoa
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Right. That’s a term I’ve seen (Byte Order Mark ?), but I don’t recall seeing it anywhere in the Java API? How is it normally handled then, and coded for?
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of krakatoa
krakatoa
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Yes, of course . . . this is a nicer algo you've given here. I was too concerned with ensuring that the previous character from the search term matched the previous character in the Stringbuilder . . . but that was naive, whereas you've matched the next character instead. Good work that. I definitely should have thought about the approach more closely.

And the BufferedReader / FileReader issue I didn't think about either.
Java
Java

Java is a platform-independent, object-oriented programming language and run-time environment, designed to have as few implementation dependencies as possible such that developers can write one set of code across all platforms using libraries. Most devices will not run Java natively, and require a run-time component to be installed in order to execute a Java program.

102K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo