mmmcdee

asked on

# Excel data extraction - seeking numbers between text

I am trying to find the first incidence where a 3, 4, 5 or 6 digit number (that is not a dollar amount) is either followed by the word "miles" or is preceded by the word "miles". In either event, extract that number from cell containing the subject matter and copy that number to an adjacent field.

My attached Excel file has my effort/formula at achieving this. The formula is not fool proof and locks up the sheet as the data grows. Not sure how to move to a regular expression for this.

Miles-Extraction.xlsm

My attached Excel file has my effort/formula at achieving this. The formula is not fool proof and locks up the sheet as the data grows. Not sure how to move to a regular expression for this.

Miles-Extraction.xlsm

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

The regular expression I came up with is:

(miles \d{3,6} | \d{3,6} miles) - returns any matches 3 to 6 digits with miles before or after the number. One would get results like 111000 miles or 51000 miles. So we can wrap another RegExp around that to just get the number.

I used matthewsPatrick's utility RegExpFind to do this - see article, here:

https://www.experts-exchange.com/Programming/Languages/Visual_Basic/A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html

If we're only looking for one instance per line, then we can test for either and use the result as follows.

Use this formula, in K8:

[K8]=RegExpFind(RegExpFind($B8,"(\d{3,6} miles|miles \d{3,6})",1,FALSE,0,FALSE),"\d{3,6}",1,FALSE,0,FALSE)

And copy down.

In a public module, you'll need the RegExpFind() function, which can be found in a new module I created in the file, called RegExpHelpers.

This should be very fast, in comparison to string manipulation.

See attached final solution.

Cheers,

Dave

Miles-Extraction.xlsm

(miles \d{3,6} | \d{3,6} miles) - returns any matches 3 to 6 digits with miles before or after the number. One would get results like 111000 miles or 51000 miles. So we can wrap another RegExp around that to just get the number.

I used matthewsPatrick's utility RegExpFind to do this - see article, here:

https://www.experts-exchange.com/Programming/Languages/Visual_Basic/A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html

If we're only looking for one instance per line, then we can test for either and use the result as follows.

Use this formula, in K8:

[K8]=RegExpFind(RegExpFind

And copy down.

In a public module, you'll need the RegExpFind() function, which can be found in a new module I created in the file, called RegExpHelpers.

This should be very fast, in comparison to string manipulation.

See attached final solution.

Cheers,

Dave

Miles-Extraction.xlsm

ASKER

Thanks for the prompt response. The function works great. Side Question: Is there a way to use this same Module to pull YEAR (XXXX or XX) year in the description AND if two digit year the module can add "20" to the front?

The year isn't as well defined as the miles number, is it? It can't be defined as being next to anything in particular. And with a smaller number, it's more likely you'll get an irrelevant number instead of the year. A four-digit year can be defined as a number between 1900 and 2100, perhaps, but a two-digit year is only going to be a number between 0 and 100, and there are other numbers in your text that match that description.

ASKER

I would say that the majority is a four digit year - no really next to anything - somewhat random. If I could start with he four digit year and worry about the two digit later. Could the module work on this? - thanks

Try this:

=yearvalue(B9)

And actually, I uncovered a bug. You don't want to subtract 1 from the UBound value in the for loop, as it already compensates for starting at 0. If you had a number as the last word in the phrase, the code as I gave above would not return it correctly. Sorry about that. It's just as well I put together and tested this alternate function. So line 7 in my previous code should say this:

```
Function YearValue(strText As String, Optional strDel As String = " ") As Long
Dim varSplit As Variant
varSplit = Split(strText, strDel)
For i = 0 To UBound(varSplit)
YearValue = Val(Trim(varSplit(i)))
If YearValue > 1900 And YearValue < 2100 Then Exit Function
Next i
YearValue = 0 'Default return if no year value found.
End Function
```

You would use it like this, with no second argument needed:=yearvalue(B9)

And actually, I uncovered a bug. You don't want to subtract 1 from the UBound value in the for loop, as it already compensates for starting at 0. If you had a number as the last word in the phrase, the code as I gave above would not return it correctly. Sorry about that. It's just as well I put together and tested this alternate function. So line 7 in my previous code should say this:

`For i = 0 To UBound(varSplit)`

ASKER

Addendum: The AdjacentValue() function works well. Is there a way to wrap an IF statement around the function that will return the smaller number of 2 numbers found from the field? Example:

"actual price 7000 miles 5000..."

Since the =adjacentvalue(XX,"miles") usually returns the 'price' where the miles is what I am looking to return.

I need to have the function look for the smaller of the two numbers. Is this possible?

Thanks -

"actual price 7000 miles 5000..."

Since the =adjacentvalue(XX,"miles")

I need to have the function look for the smaller of the two numbers. Is this possible?

Thanks -

You could change "Max" to "Min" in line 15, but unless there are relevant numbers on both sides of the word, that won't give you the results you want, because it will default to returning zero. There's no guarantee the price will be larger than the miles anyway. In fact, in many cases, I would expect the miles to be larger. Your original example file only included one example out of eight that had less than 10,000 miles.

This function doesn't keep more than one relevant number; it returns the first one that fits the minimum requirement (99). To keep all relevant numbers and return the lowest would require completely re-writing it.

If you have any other scenarios or modifications, you may wish to ask a new question. This question is already closed with a solution.

This function doesn't keep more than one relevant number; it returns the first one that fits the minimum requirement (99). To keep all relevant numbers and return the lowest would require completely re-writing it.

If you have any other scenarios or modifications, you may wish to ask a new question. This question is already closed with a solution.

ASKER

Thanks telyni19. I will post another question due to the rewrite. Thanks again

Miles-Extraction.xlsm