# Excel data extraction -  seeking numbers between text on
I am trying to find the first incidence where a 3, 4, 5 or 6 digit number (that is not a dollar amount) is either followed by the word "miles" or is preceded by the word "miles". In either event, extract that number from cell containing the subject matter and copy that number to an adjacent field.

My attached Excel file has my effort/formula at achieving this. The formula is not fool proof and locks up the sheet as the data grows. Not sure how to move to a regular expression for this.
Miles-Extraction.xlsm
Comment
Watch Question

Do more with EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Commented:
I recommend a UDF (user defined function) for something as complex as this. Please see the attached for one way to do it. I tested it on the examples and was able to produce the same results as you for each one of them. The exit criteria of greater than 99 is based on your definition of a 3-6 digit number.

Using the UDF in a cell takes the form of just one simple function:
It also allows for the possibility of an alternate delimiter character other than a space.

``````Function AdjacentValue(strText As String, strWord As String, Optional strDel As String = " ") As Long

Dim varSplit As Variant

varSplit = Split(strText, strDel)

For i = 0 To UBound(varSplit) - 1
If varSplit(i) = strWord Then
Select Case i
Case i < 1
Case i > UBound(varSplit) - 1
Case Else
AdjacentValue = WorksheetFunction.Max(Val(Trim(varSplit(i - 1))), Val(Trim(varSplit(i + 1))))
End Select
End If
If AdjacentValue > 99 Then Exit Function
Next i
End Function
``````
Miles-Extraction-1-UDF.xlsm
Software Professional

Commented:
Added code(in module 1) and a button. The updated workbook is attached.
Miles-Extraction.xlsm
Most Valuable Expert 2012
Top Expert 2012

Commented:
The regular expression I came up with is:

(miles \d{3,6} | \d{3,6} miles) - returns any matches 3 to 6 digits with miles before or after the number.  One would get results like 111000 miles or 51000 miles.  So we can wrap another RegExp around that to just get the number.

I used matthewsPatrick's utility RegExpFind to do this - see article, here:
http://www.experts-exchange.com/Programming/Languages/Visual_Basic/A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html

If we're only looking for one instance per line, then we can test for either and use the result as follows.

Use this formula, in K8:

[K8]=RegExpFind(RegExpFind(\$B8,"(\d{3,6} miles|miles \d{3,6})",1,FALSE,0,FALSE),"\d{3,6}",1,FALSE,0,FALSE)

And copy down.

In a public module, you'll need the RegExpFind() function, which can be found in a new module I created in the file, called RegExpHelpers.

This should be very fast, in comparison to string manipulation.

See attached final solution.

Cheers,

Dave
Miles-Extraction.xlsm

Commented:
Thanks for the prompt response. The function works great.  Side Question:  Is there a way to use this same Module to pull YEAR (XXXX or XX) year in the description AND if two digit year the module can add "20" to the front?

Commented:
The year isn't as well defined as the miles number, is it? It can't be defined as being next to anything in particular. And with a smaller number, it's more likely you'll get an irrelevant number instead of the year. A four-digit year can be defined as a number between 1900 and 2100, perhaps, but a two-digit year is only going to be a number between 0 and 100, and there are other numbers in your text that match that description.

Commented:
I would say that the majority is a four digit year - no really next to anything - somewhat random. If I could start with he four digit year and worry about the two digit later.  Could the module work on this? - thanks

Commented:
Try this:
``````Function YearValue(strText As String, Optional strDel As String = " ") As Long
Dim varSplit As Variant

varSplit = Split(strText, strDel)

For i = 0 To UBound(varSplit)
YearValue = Val(Trim(varSplit(i)))
If YearValue > 1900 And YearValue < 2100 Then Exit Function
Next i
YearValue = 0 'Default return if no year value found.
End Function
``````
You would use it like this, with no second argument needed:
=yearvalue(B9)

And actually, I uncovered a bug. You don't want to subtract 1 from the UBound value in the for loop, as it already compensates for starting at 0. If you had a number as the last word in the phrase, the code as I gave above would not return it correctly. Sorry about that. It's just as well I put together and tested this alternate function. So line 7 in my previous code should say this:
``````For i = 0 To UBound(varSplit)
``````

Commented:
Addendum:  The AdjacentValue() function works well.  Is there a way to wrap an IF statement around the function that will return the smaller number of 2 numbers found from the field?  Example:

"actual price 7000 miles 5000..."

Since the =adjacentvalue(XX,"miles") usually returns the 'price' where the miles is what I am looking to return.

I need to have the function look for the smaller of the two numbers.  Is this possible?

Thanks -

Commented:
You could change "Max" to "Min" in line 15, but unless there are relevant numbers on both sides of the word, that won't give you the results you want, because it will default to returning zero. There's no guarantee the price will be larger than the miles anyway. In fact, in many cases, I would expect the miles to be larger. Your original example file only included one example out of eight that had less than 10,000 miles.

This function doesn't keep more than one relevant number; it returns the first one that fits the minimum requirement (99). To keep all relevant numbers and return the lowest would require completely re-writing it.

If you have any other scenarios or modifications, you may wish to ask a new question. This question is already closed with a solution.

Commented:
Thanks telyni19.  I will post another question due to the rewrite. Thanks again

Do more with 