Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

Regex to filter urls w/ certain words inside

Hi,

I would like to exclude urls that contain certain words like "login", "contactus", etc., and preserve the rest other


Thanks for help!
Avatar of reijnemans
reijnemans
Flag of Netherlands image

You mean something like this:

public static void main(String...urls) {
		String[] wordsToExclude = new String[] {"login, contactus"};
		List<String> accpetedUrls = new ArrayList<String>();
		
		String regex = toRegex(wordsToExclude);
		
		for (String url : urls) {
			if (!url.matches(regex)) {
				accpetedUrls.add(url);
			}
		}
	}
	
	public static String toRegex(String[] wordsToExclude) {
		StringBuilder sb = new StringBuilder();
		for (String wordToExclude : wordsToExclude) {
			sb.append("(");
			sb.append(wordToExclude);
			sb.append(") |");
		}
		// exclude the last | sign
		return sb.substring(0, sb.length() - 1);
	}

Open in new window

Avatar of wsyy
wsyy

ASKER

I meant that a regular expression can filter the target urls and keep the others.

Say,

String regex="something";
System.out.println("https://www.experts-exchange.com/login.html".matches(regex));
==return FALSE

System.out.println("https://www.experts-exchange.com/questions/27319546/Regex-to-filter-urls-w-certain-words-inside.html".matches(regex));
==return TRUE
you mean somting like this

	public static void main(String...urls1) {
		String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus"};
		
		for (String string : urls) {
			System.out.println(urlOK(string));
		}
	}
	
	public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return url.matches(regex);
	}

Open in new window

The example aboe returns:

true
false
false
true
BTWL this is a good url to test your regex: http://www.regexpal.com/
Avatar of wsyy

ASKER

unfortunately, you misunderstood.

I want the results exactly opposite.

The results should be according to your example:

true
false
false
true
Avatar of wsyy

ASKER

anyone can help?


why wiould not you simply check indexOf()



for(url : urls){

boolean good = true;
if(url.indexOf("login")>-1)good = false;
if(url.indexOf("contact") > -1)good = false;
if(!good) .....//exclude

//use good ones
}
instead of
public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return url.matches(regex);
	}

Open in new window


you do this

public static boolean urlOK(String url) {
		String regex = "(.*)(login|contactus)(.*)";
		return !url.matches(regex);
	}

Open in new window


now the result is the opposite
Avatar of wsyy

ASKER

indexOf() doesn't accept regex
Avatar of wsyy

ASKER

reijnemans:

since we have quite a few things to check, we want to make a uniform call to the matches function. So a ! upon some check while no ! on the other doesn't look a good choice for us.
No, why do you need regex - this is just case for indexOf()
SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of wsyy

ASKER

for_yan,

need regex as there are a few words to exclude.

CEHJ,

need regex as the word-exclusion regex is commingled with other regex.
Avatar of wsyy

ASKER

i think i actually find one solution
Put these words into array or arraylist and check using indexOf() - don't think regex will make it better if you need more words
regex is good when you have some complicated conditions for searching, and this is not the case here
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of wsyy

ASKER

Here is the solution I figured out:

(?!.*(login|rss|member|contactus|aboutus|logout|reg|help).*).+

it works so far, but not sure of its performance.
How many URLs do you have?
You probably want to start the internal parentheses with (?: - to make it non-capturing
if you are concerned with performance

Thinking about performance in theory is not always the best approach.

Start using it and then you'll know if the performance will be an issue.
It is very often we spend time saving tiny milliseconds, and the bottleneck ends up
to be in quite different place
>>
Here is the solution I figured out:

(?!.*(login|rss|member|contactus|aboutus|logout|reg|help).*).+

it works so far, but not sure of its performance.
>>

Exactly the same can be done with the approach i suggested, which is more extensible and more performant
Agreed with CEHJ.  If all your regex is doing is filtering exact words and you don't have a need for actual pattern matching beyond a simple String.contains() feature, iterating over a list of words is much faster and easier to maintain.  Every time you want to add a new filter, you have to modify the regex -- likely in code -- making it more complicated and difficult to read.  A simple List can be easily appended to and stored in an external file (the regex could as well) with one filter per line, making it a lot easier to read and maintain.
Make sure you use the right tool for the job.  Sure, you can flip a screwdriver around and use the handle to knock in a nail, but a hammer is going to be much more efficient.