Parsing .svg file

I am trying to parse a .svg file to be able to write data for this particular website. A sample file can be found at

The web page is

an extract from the data file is reproduced at the bottom

Descriptions of .svg files tell that the data area is d=....  wherein M means move to and is followed by coordinates. Then lower case l (L) followed by a series of coordinates forming a path.

The problem is that in the data following the lower case l I see many numbers separated by spaces but within the spaces there are many decimal points so I do not see how it delimits the numbers/coordinates. Can someone help me out. Is there any other topic area where this question would be relevant?

<path class="border country" id="za" d="M450.942 336.32l.55 2.942.03.68.572 1.374.42 1.906-.004 3.927-.082 1.09-.173.08-1.027-.787-.468.325-.91 1.91-.02 1.072.915.614 1.74.033-.084.774-.432 1.767-.275 1.987-.562 1.414-.582.732-.56.327-.998 1.478-.372.713-.33 1.04-1.654 3.585-.714 1.18-.53.497-1.37 1.786-.61.976-2.24 2.547-1.764 1.568-1.46.8-.99-.163-.758.464-.032.54-1.43-.126-.396.655-.52.018-.93-.38-1.345-.255-.714.322-1.61-.238-.693.198-1.028 1.017-1.216.115-.425-.142-1.195.33-1.146 1.082-.87-.11-1.208-1.35-.6.046-.05-.857-.62-.075-.163.876-.247-.53.005-.837.25-.173-.46-1.563-.236-.21-.626-1.333-.06-.448.9-.588.166-.493-.033-1.282-.226-1.273-1.2-2.426-.744-1.926-.356-1.157-.538-2.334-.474-1.32-.657-1.315.694-.554.4-1.283.275-.128.742 1.08-. 1.474.582 1.658-.01.727.235.246-.54.902-.918.696-.177V344.28l. 2.327.05.835-.425.935.033 1 .22.358 1.566-.345.315-.363.74-2.88 1.215-.55.863-1.147.342-1.798.735-.94 1-.762 1.018-1.687.387-.336 1.03-.378.654-.897.53-.064.67-.16 1.794.612 1.38-.07z"/>
LVL 43
Saqib Husain, SyedEngineerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gertone (Geert Bormans)Information ArchitectCommented:
sometimes you see comma, sometimes you see dot as a decimal indicator inside coordinates
the space seperates the coordinates
M450.942 336.32
means move to (450.942, 336.32)
after that you have an
L.55 2.94
means you start from the coordinate at M and draw a line
After that you draw a line to the next set of two coordinates
(L takes any pair of coordinates and continues to have the L implicit until you change the operator)
There seems to be quite a few spaces missing by the way
Internet explorer manages to parse the path attribute however
If I analyse the path, I end up with a country I believen but I don't know which one

for more info
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Regarding the missing spaces, what they have done is to try and minimise the number of characters required. And so all the spaces have been stripped out except for those that are absolutely required to be able to work out two adjacent numbers.

So, as an example, the first part...

M450.942 336.32

the co-ordinates are 450.942 and 336.32. The space is required because otherwise you don't know where the first number ends and the next starts. However, is the example...

.55 2.942.03.68.572 1.374

there are three sets of co-ordinates (0.55, 2.942) and (0.03, 0.68) and (0.572, 1.374). The space is still there between the 1st and 2nd numbers (again for the same reason above), but between the 2nd and 3rd numbers, the space and the 3rd numbers leading zero can be omitted because it is obvious that a number can't have two (or more decimal places) so that second decimal place character is the start of a new number.

Another way to think about it (especially if you are writing code to parse/extract these numbers) is to look at each character in the path string, and put characters together until a space or it is no longer a valid number, ie.

.5       - valid
.55    - valid
.55<space>    - ok that is the end of a number, move on to the next

2.942.        - not a valid number so the number is 2.942, and we start again with the decimal as the first
.03.           - not a valid number so the number is .03, and we start again, etc, etc

Oh, and just to be clear, the same thing happens with the "-" character (negative sign), it can start a new number without there being a space because it is invalid to have a negative sign mid number, so ...



1.09    -0.173    0.08    -1.027    -0.787    -0.468    0.325    -0.91

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Saqib Husain, SyedEngineerAuthor Commented:
Geert Bormans: Thanks for the info.

mccarl: Thanks. Looks pretty much what I am looking for. Please let me try this out before I close this out.
CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

Gertone (Geert Bormans)Information ArchitectCommented:
@mccarl, thanks good point... the ,optimisations’ that save some space in the text string size but make it so much harder to write a parser for it :-)
Thanks for adding that info
Seems to be an error.  There is a lower-case "L" used in place of a "1"
Here's a regex pattern that seems to work
((?:[ \-A-Z])(?:\d*\.\d+))|(\.\d+)

Open in new window

You can capture the paired groups separately if that will make your job easier.
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Seems to be an error.  There is a lower-case "L" used in place of a "1"

No, no error, it IS supposed to be a lower case "L". While Geert's original post is correct, it is referring to the upper case "L" command where the path contains the lower case "l" command. This is all described in the page that Geert has linked to, but in summary, the "L" will draw a line from the current pen location to the location given by the absolute co-ordinates given after the L command, where the "l" will draw a line from the current pen location to the location given by the relative co-ordinates given after the l command. This location is relative to the current pen location, so..

M 1 2 L 3 4     -      will draw a line from (1,2) to (3,4)


M 1 2 l 3 4      -      will draw a line from (1, 2) to (4, 6), ie. it draws a line from (1,2) to a point 3 units in positive x direction and 4 units in positive y direction
In that case, here's an updated regex pattern that recognizes both upper-case and lower-case letters.
((?:[ \-A-Za-z])(?:\d*\.\d+))|(\.\d+)

Open in new window

Saqib Husain, SyedEngineerAuthor Commented:
Spot on!

Thanks a million.
mccarlIT Business Systems Analyst / Software DeveloperCommented:
You're welcome!!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.