[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
• Status: Solved
• Priority: Medium
• Security: Public
• Views: 2482

# Querying for Roman Numerals

I'm trying to build a Query using Oracle (10g) to identify all fields with a roman numeral - and put that roman numeral in a new cell:

Field                        Roman Numeral
MGM XXIX:              XXIX
MGM XIX, One         XIX

Roman Numerals are limited to I, V, X, L, C, D, M

I haven't found a function for this but...
I'm thinking that the query may be written where a word within a field contains (V, X, L, C, D, M) only - but will need to account for roman numerals like 'VI' -- where there is an I. Also, I think including the 'I' will result in words as 'DIM', 'MID', 'LID' - not what I would want. So - would need to search for I, II, III, IV separately.

Roman Numerals won't appear in the first part of a piece of text... and they may be proceeded by a comma (MGM XIX, one)

0
kwieckii
• 13
• 6
1 Solution

Commented:
may be proceeded by a comma (MGM XIX, one)

that example shows roman numeral being followed by a comma not preceeded.

is preceeding also allowed? or just following?
0

Commented:

SELECT field, REGEXP_REPLACE(REGEXP_SUBSTR(field, ' [IVXLCDM]+,?'), '[ ,]', NULL) roman_numeral
FROM your_table
0

Commented:
a simpler version...

SELECT field, ltrim(REGEXP_SUBSTR(field, ' [IVXLCDM]+')) roman_numeral
FROM your_table
0

Commented:
don't really need the comma check
0

Commented:
Do note, the above can give false positives.

it only checks for a contiguous set of letters that couldbe used to make a roman numeral
it doesn't actually check if it is.

'I LIKE DILL PICKLES' will return DILL  as a roman numeral even though it isn't.

0

Author Commented:
Would there be a way to not consider the other characters? - (to try to get rid of the false positives)?
For example - if there's a B or F in the string - don't put the value in the field...
using:
SELECT field, ltrim(REGEXP_SUBSTR(field, ' [IVXLCDM]+')) roman_numeral
FROM your_table

The word = IMMOBILIEN - returned IMM  - just trying to understand the query.

Thanks
0

Commented:
can you post some more example input strings so I can see what all the various delimiters are...

whitespace, comma, colon, end of field, etc.
0

Commented:
http://forums.devshed.com/python-programming-11/roman-numeral-to-arabic-converter-601261.html

Here they have tried to convert Roman in Arabac

This code could be translated in PL/SQL and used to check if the substring is a real Roman
0

Author Commented:
Here's a few.. please let me know if you'd like more:

with x as
(
select 'ABC CORPORATION LIMITED' CoName from dual union all
select 'ABC CORP XII LTD' CoName from dual union all
select 'ABC CORP XL, LTD' CoName from dual

)
select
CoName,
ltrim(REGEXP_SUBSTR(CoName, ' [IVXLCDM]+')) roman_numeral
from x

Thanks again!!
0

Commented:
using oracle's built in features it's quite easy to write a roman numeral converter (this was a fun exercise,  thanks for the question!)
Note, the function only supports up to 3999 because after that Oracle can't support the conversion from number to roman value.
If you need higher than that then we'll have to write our own conversion from number to roman too.
However, in practice, most roman numerals aren't supported beyond 3999 because 4000 "should" be M followed by some new character representing 5000 but there isn't one.
I have seen converters that simply repeat M's as far as needed and the code below could be modified to do that if you wanted.

however, using this function it makes it easy to eliminate false positives.
I'm not sure what you wanted for  "ABC CORPORATION LIMITED"  since it has no roman numerals, I assume you want those excluded

Try this... (I extended your WITH to include some more examples)

WITH x AS (SELECT   'MGM XXIX:' coname FROM DUAL
UNION ALL
SELECT   'MGM XIX, One' coname FROM DUAL
UNION ALL
SELECT   'MGM XIX' coname FROM DUAL
UNION ALL
SELECT   'MGM D XIX' coname FROM DUAL
UNION ALL
SELECT   'MGM DDDDD XIX' coname FROM DUAL
UNION ALL
SELECT   'ABC CORPORATION LIMITED' coname FROM DUAL
UNION ALL
SELECT   'ABC CORP XII LTD' coname FROM DUAL
UNION ALL
SELECT   'ABC CORP XL, LTD' coname FROM DUAL)
SELECT   coname, roman_numeral
FROM   (SELECT   coname,
REGEXP_REPLACE(REGEXP_SUBSTR(coname, ' [IVXLCDM]+([ ,:]|\$)', 1, n),
'[^IVXLCDM]',
NULL)
roman_numeral,
n
FROM   x, (    SELECT   LEVEL n
FROM   DUAL
CONNECT BY   LEVEL < 10))
WHERE   roman_value(roman_numeral) IS NOT NULL
``````CREATE OR REPLACE FUNCTION roman_value(p_roman_str VARCHAR2)
RETURN INTEGER
IS
v_integer   NUMBER;
BEGIN
SELECT   SUM(z)
INTO   v_integer
FROM   (SELECT   DECODE(p,
'IV',
4,
'IX',
9,
'XL',
40,
'XC',
90,
'CD',
400,
'CM',
900,
'I',
1,
'V',
5,
'X',
10,
'L',
50,
'C',
100,
'D',
500,
'M',
1000)
z
FROM   (SELECT   REGEXP_SUBSTR(p_roman_str,
'IV|IX|XL|XC|CD|CM|I|V|X|L|C|D|M',
1,
n)
p
FROM   (    SELECT   LEVEL n
FROM   DUAL
CONNECT BY   LEVEL <= 15))
WHERE   p IS NOT NULL);

IF TO_CHAR(v_integer, 'fmRN') = UPPER(TRIM(p_roman_str)) THEN
RETURN v_integer;
ELSE
RETURN NULL;
END IF;
END roman_value;
``````
0

Author Commented:
Thanks - I wish I could run this code - I don't have privileges in the db to create functions.  Is there anyway to change this to a select only --- otherwise, I think I'm going to have to pay someone off to run this for me.

I included ABC Corporation Limited - because with the first query - the C was retrieved as the roman numeral.

I can't think of a time where there would be over 3999 - if there is - most likely would be part of the text - or a mistake.

Thanks!!
0

Commented:
try this...
``````WITH x AS (SELECT   'MGM XXIX:' coname FROM DUAL
UNION ALL
SELECT   'MGM XIX, One' coname FROM DUAL
UNION ALL
SELECT   'MGM XIX' coname FROM DUAL
UNION ALL
SELECT   'MGM D XIX' coname FROM DUAL
UNION ALL
SELECT   'MGM DDDDD XIX' coname FROM DUAL
UNION ALL
SELECT   'ABC CORPORATION LIMITED' coname FROM DUAL
UNION ALL
SELECT   'ABC CORP XII LTD' coname FROM DUAL
UNION ALL
SELECT   'ABC CORP XL, LTD' coname FROM DUAL)
SELECT   coname, roman_numeral
FROM   (  SELECT   coname, roman_numeral, SUM(DECODE(p,
'IV',
4,
'IX',
9,
'XL',
40,
'XC',
90,
'CD',
400,
'CM',
900,
'I',
1,
'V',
5,
'X',
10,
'L',
50,
'C',
100,
'D',
500,
'M',
1000))
z
FROM   (SELECT   coname,
roman_numeral,
REGEXP_SUBSTR(roman_numeral, 'IV|IX|XL|XC|CD|CM|I|V|X|L|C|D|M', 1, n)
p
FROM   (SELECT   coname,
REGEXP_REPLACE(
REGEXP_SUBSTR(coname, ' [IVXLCDM]+([ ,:]|\$)', 1, i),
'[^IVXLCDM]',
NULL
)
roman_numeral,
i
FROM   x, (    SELECT   LEVEL i
FROM   DUAL
CONNECT BY   LEVEL < 10)),
(    SELECT   LEVEL n
FROM   DUAL
CONNECT BY   LEVEL <= 15))
GROUP BY   coname, roman_numeral)
WHERE   TO_CHAR(z, 'fmRN') = UPPER(TRIM(roman_numeral))
``````
0

Author Commented:
That works pretty good -- very slow (running against 2 million records).  It took about 23 seconds to return the first 20k.

Some of the false positives now...
Washington DC            DC
A & C Co                      C
A & D Co                      C

Do you think the easiest way to eliminate the C only (probably D and DC also) value would be to not select it after it's calculated or before?... (SELECT * FROM Roman_Numeral_Table where Roman_Numeral <> 'C')

ABC CORPORATION LIMITED didn't return in the results - (with a blank roman numeral)

Thanks again.
0

Commented:
well, that "could" work

you've got some special circumstances there because you have valid roman numerals that are proper delimited but you're eliminating them by "human" recognition
not by parsing rules.

Depending on your indexing it might be to identify special strings and strip them from the input before you start

select * from your table where coname not like 'Washington DC%' and coname not like 'A & C % and coname not like 'A & D %'

or, good roman numerals might follow the bad ones then  strip out those substrings (this will make it slower but accurate)

select regexp_replace(coname,'(Washington DC)|(A & C)|(A &D)',null) coname from your table
0

Commented:
you can remove the UPPER and TRIM functions too, those are carry overs from the function
0

Commented:
you can also make it faster by reducing the counter loops
The first "connect by level" counter is 1..10 so it can check up to 10 candidate strings.  If you'll never have that many, reduce it.

Similarly the roman numeral counter 1..15 can be reduced if you'll have smaller numbers. it can be reduced 14 if you'll never go over 3887
13 if you'll never go over 2887,  12 if you'll never go over 1887, 11 if never over 887, 10 if never over 387, 9 if never over 287, 8 if not over 187,
7 if not over 87,  and so on
0

Author Commented:
Thanks for the quick response!!  I'm still trying to understand how this works:

When I change the Level i counter to 1, I receive the same results - but very quick
(the time on 10k records with level i = 10 was 20 seconds - compared to 1.8 seconds where level i = 1).... What does the Level i counter do?

Is there a way to retrieve the rows where no roman numerals are identified in the data?
0

Commented:
The first "connect by level" counter is 1..10 so it can check up to 10 candidate strings.  If you'll never have that many, reduce it.

it's for counting substrings that look like they "might" be roman numerals.

DDDDD for instance,  has all roman numerals in it but it's not valid.

So if you had "My test string DDDDD  XXIX"

DDDDD would be 1
XXIX would be 2

the DDDDD would be thrown out later

by setting it to 1 you are effectively saying never check beyond the first substring that looks like it might be valid.
if it's that's ok, fine just make sure of your data

0

Author Commented:
Thanks again!! I always pick up something new with your answers (besides the original question)
0

Commented: