Avatar of mmmcdee
mmmcdee
Flag for United States of America asked on

Excel data extraction - seeking numbers between text

I am trying to find the first incidence where a 3, 4, 5 or 6 digit number (that is not a dollar amount) is either followed by the word "miles" or is preceded by the word "miles". In either event, extract that number from cell containing the subject matter and copy that number to an adjacent field.

My attached Excel file has my effort/formula at achieving this. The formula is not fool proof and locks up the sheet as the data grows. Not sure how to move to a regular expression for this.
Miles-Extraction.xlsm
Microsoft Excel

Avatar of undefined
Last Comment
mmmcdee

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
telyni19

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
SANTABABY

Added code(in module 1) and a button. The updated workbook is attached.
Miles-Extraction.xlsm
dlmille

The regular expression I came up with is:

(miles \d{3,6} | \d{3,6} miles) - returns any matches 3 to 6 digits with miles before or after the number.  One would get results like 111000 miles or 51000 miles.  So we can wrap another RegExp around that to just get the number.

I used matthewsPatrick's utility RegExpFind to do this - see article, here:
https://www.experts-exchange.com/Programming/Languages/Visual_Basic/A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html

If we're only looking for one instance per line, then we can test for either and use the result as follows.

Use this formula, in K8:

[K8]=RegExpFind(RegExpFind($B8,"(\d{3,6} miles|miles \d{3,6})",1,FALSE,0,FALSE),"\d{3,6}",1,FALSE,0,FALSE)

And copy down.

In a public module, you'll need the RegExpFind() function, which can be found in a new module I created in the file, called RegExpHelpers.

This should be very fast, in comparison to string manipulation.

See attached final solution.

Cheers,

Dave
Miles-Extraction.xlsm
mmmcdee

ASKER
Thanks for the prompt response. The function works great.  Side Question:  Is there a way to use this same Module to pull YEAR (XXXX or XX) year in the description AND if two digit year the module can add "20" to the front?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
telyni19

The year isn't as well defined as the miles number, is it? It can't be defined as being next to anything in particular. And with a smaller number, it's more likely you'll get an irrelevant number instead of the year. A four-digit year can be defined as a number between 1900 and 2100, perhaps, but a two-digit year is only going to be a number between 0 and 100, and there are other numbers in your text that match that description.
mmmcdee

ASKER
I would say that the majority is a four digit year - no really next to anything - somewhat random. If I could start with he four digit year and worry about the two digit later.  Could the module work on this? - thanks
telyni19

Try this:
Function YearValue(strText As String, Optional strDel As String = " ") As Long
Dim varSplit As Variant

varSplit = Split(strText, strDel)

For i = 0 To UBound(varSplit)
    YearValue = Val(Trim(varSplit(i)))
    If YearValue > 1900 And YearValue < 2100 Then Exit Function
Next i
YearValue = 0 'Default return if no year value found.
End Function

Open in new window

You would use it like this, with no second argument needed:
=yearvalue(B9)

And actually, I uncovered a bug. You don't want to subtract 1 from the UBound value in the for loop, as it already compensates for starting at 0. If you had a number as the last word in the phrase, the code as I gave above would not return it correctly. Sorry about that. It's just as well I put together and tested this alternate function. So line 7 in my previous code should say this:
For i = 0 To UBound(varSplit)

Open in new window

Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
mmmcdee

ASKER
Addendum:  The AdjacentValue() function works well.  Is there a way to wrap an IF statement around the function that will return the smaller number of 2 numbers found from the field?  Example:

"actual price 7000 miles 5000..."

Since the =adjacentvalue(XX,"miles") usually returns the 'price' where the miles is what I am looking to return.  

I need to have the function look for the smaller of the two numbers.  Is this possible?

Thanks -
telyni19

You could change "Max" to "Min" in line 15, but unless there are relevant numbers on both sides of the word, that won't give you the results you want, because it will default to returning zero. There's no guarantee the price will be larger than the miles anyway. In fact, in many cases, I would expect the miles to be larger. Your original example file only included one example out of eight that had less than 10,000 miles.

This function doesn't keep more than one relevant number; it returns the first one that fits the minimum requirement (99). To keep all relevant numbers and return the lowest would require completely re-writing it.

If you have any other scenarios or modifications, you may wish to ask a new question. This question is already closed with a solution.
mmmcdee

ASKER
Thanks telyni19.  I will post another question due to the rewrite. Thanks again
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes