Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Regex to filter urls w/ certain words inside

Posted on 2011-09-21
25
302 Views
Last Modified: 2012-06-21
Hi,

I would like to exclude urls that contain certain words like "login", "contactus", etc., and preserve the rest other


Thanks for help!
0
Comment
Question by:wsyy
  • 8
  • 8
  • 5
  • +2
25 Comments
 
LVL 4

Expert Comment

by:reijnemans
ID: 36573148
You mean something like this:

public static void main(String...urls) {
		String[] wordsToExclude = new String[] {"login, contactus"};
		List<String> accpetedUrls = new ArrayList<String>();
		
		String regex = toRegex(wordsToExclude);
		
		for (String url : urls) {
			if (!url.matches(regex)) {
				accpetedUrls.add(url);
			}
		}
	}
	
	public static String toRegex(String[] wordsToExclude) {
		StringBuilder sb = new StringBuilder();
		for (String wordToExclude : wordsToExclude) {
			sb.append("(");
			sb.append(wordToExclude);
			sb.append(") |");
		}
		// exclude the last | sign
		return sb.substring(0, sb.length() - 1);
	}

Open in new window

0
 

Author Comment

by:wsyy
ID: 36573172
I meant that a regular expression can filter the target urls and keep the others.

Say,

String regex="something";
System.out.println("http://www.experts-exchange.com/login.html".matches(regex));
==return FALSE

System.out.println("http://www.experts-exchange.com/Programming/Languages/Java/Q_27319546.html".matches(regex));
==return TRUE
0
 
LVL 4

Expert Comment

by:reijnemans
ID: 36573211
you mean somting like this

	public static void main(String...urls1) {
		String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus"};
		
		for (String string : urls) {
			System.out.println(urlOK(string));
		}
	}
	
	public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return url.matches(regex);
	}

Open in new window

0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 4

Expert Comment

by:reijnemans
ID: 36573215
The example aboe returns:

true
false
false
true
0
 
LVL 4

Expert Comment

by:reijnemans
ID: 36573231
BTWL this is a good url to test your regex: http://www.regexpal.com/
0
 

Author Comment

by:wsyy
ID: 36573465
unfortunately, you misunderstood.

I want the results exactly opposite.

The results should be according to your example:

true
false
false
true
0
 

Author Comment

by:wsyy
ID: 36573589
anyone can help?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36574071


why wiould not you simply check indexOf()



for(url : urls){

boolean good = true;
if(url.indexOf("login")>-1)good = false;
if(url.indexOf("contact") > -1)good = false;
if(!good) .....//exclude

//use good ones
}
0
 
LVL 4

Expert Comment

by:reijnemans
ID: 36574343
instead of
public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return url.matches(regex);
	}

Open in new window


you do this

public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return !url.matches(regex);
	}

Open in new window


now the result is the opposite
0
 

Author Comment

by:wsyy
ID: 36574770
indexOf() doesn't accept regex
0
 

Author Comment

by:wsyy
ID: 36574779
reijnemans:

since we have quite a few things to check, we want to make a uniform call to the matches function. So a ! upon some check while no ! on the other doesn't look a good choice for us.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36574781
No, why do you need regex - this is just case for indexOf()
0
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 62 total points
ID: 36575586
I would

a. keep your words in a text file, from which you can make List<String>. That way you can extend/edit your choices without recompilation
http://technojeeves.com/joomla/index.php/free/74-string-list
b. Avoid regex. You don't need the overhead. Just loop through the List calling


boolean excluded = false;
for (String currentWordInList : wordList) {
   excluded = urlString.contains(currentWordInList);
   if(excluded) break;
}

Open in new window

0
 

Author Comment

by:wsyy
ID: 36578148
for_yan,

need regex as there are a few words to exclude.

CEHJ,

need regex as the word-exclusion regex is commingled with other regex.
0
 

Author Comment

by:wsyy
ID: 36578149
i think i actually find one solution
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36578159
Put these words into array or arraylist and check using indexOf() - don't think regex will make it better if you need more words
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36578168
regex is good when you have some complicated conditions for searching, and this is not the case here
0
 
LVL 47

Accepted Solution

by:
for_yan earned 63 total points
ID: 36578233
       String [] badWords = {"login","contactus"};
        String patStr = "\\b(?:";
        for(String sar : badWords){
            patStr += sar + "|";
        }
        patStr = patStr.substring(0,patStr.length()-1);
        patStr += ")\\b";
        System.out.println(patStr);


String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus", "www.blaat.com/contactus/index.html"};

     //   Pattern p11 = Pattern.compile("\\b(?:login|contactus)\\b");

         Pattern p11 = Pattern.compile(patStr);

        for(String url : urls){
            Matcher mu = p11.matcher(url);
            if(mu.find())System.out.println(url + " to be excluded");
            else    System.out.println(url + " to be included");


        }

Open in new window


Output:
\b(?:login|contactus)\b
www.login.htm to be excluded
www.lloogg.htm to be included
www.hello.htm to be included
www.blaat.com/contactus to be excluded
www.blaat.com/contactus/index.html to be excluded

Open in new window

0
 

Author Comment

by:wsyy
ID: 36578282
Here is the solution I figured out:

(?!.*(login|rss|member|contactus|aboutus|logout|reg|help).*).+

it works so far, but not sure of its performance.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36578303
How many URLs do you have?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36578329
You probably want to start the internal parentheses with (?: - to make it non-capturing
if you are concerned with performance

0
 
LVL 47

Expert Comment

by:for_yan
ID: 36578360
Thinking about performance in theory is not always the best approach.

Start using it and then you'll know if the performance will be an issue.
It is very often we spend time saving tiny milliseconds, and the bottleneck ends up
to be in quite different place
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 36578775
>>
Here is the solution I figured out:

(?!.*(login|rss|member|contactus|aboutus|logout|reg|help).*).+

it works so far, but not sure of its performance.
>>

Exactly the same can be done with the approach i suggested, which is more extensible and more performant
0
 
LVL 10

Expert Comment

by:gordon_vt02
ID: 36580536
Agreed with CEHJ.  If all your regex is doing is filtering exact words and you don't have a need for actual pattern matching beyond a simple String.contains() feature, iterating over a list of words is much faster and easier to maintain.  Every time you want to add a new filter, you have to modify the regex -- likely in code -- making it more complicated and difficult to read.  A simple List can be easily appended to and stored in an external file (the regex could as well) with one filter per line, making it a lot easier to read and maintain.
0
 
LVL 10

Expert Comment

by:gordon_vt02
ID: 36580546
Make sure you use the right tool for the job.  Sure, you can flip a screwdriver around and use the handle to knock in a nail, but a hammer is going to be much more efficient.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
The viewer will learn how to implement Singleton Design Pattern in Java.

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question