Parsing .svg file

I am trying to parse a .svg file to be able to write data for this particular website. A sample file can be found at

The web page is

an extract from the data file is reproduced at the bottom

Descriptions of .svg files tell that the data area is d=....  wherein M means move to and is followed by coordinates. Then lower case l (L) followed by a series of coordinates forming a path.

The problem is that in the data following the lower case l I see many numbers separated by spaces but within the spaces there are many decimal points so I do not see how it delimits the numbers/coordinates. Can someone help me out. Is there any other topic area where this question would be relevant?

<path class="border country" id="za" d="M450.942 336.32l.55 2.942.03.68.572 1.374.42 1.906-.004 3.927-.082 1.09-.173.08-1.027-.787-.468.325-.91 1.91-.02 1.072.915.614 1.74.033-.084.774-.432 1.767-.275 1.987-.562 1.414-.582.732-.56.327-.998 1.478-.372.713-.33 1.04-1.654 3.585-.714 1.18-.53.497-1.37 1.786-.61.976-2.24 2.547-1.764 1.568-1.46.8-.99-.163-.758.464-.032.54-1.43-.126-.396.655-.52.018-.93-.38-1.345-.255-.714.322-1.61-.238-.693.198-1.028 1.017-1.216.115-.425-.142-1.195.33-1.146 1.082-.87-.11-1.208-1.35-.6.046-.05-.857-.62-.075-.163.876-.247-.53.005-.837.25-.173-.46-1.563-.236-.21-.626-1.333-.06-.448.9-.588.166-.493-.033-1.282-.226-1.273-1.2-2.426-.744-1.926-.356-1.157-.538-2.334-.474-1.32-.657-1.315.694-.554.4-1.283.275-.128.742 1.08-. 1.474.582 1.658-.01.727.235.246-.54.902-.918.696-.177V344.28l. 2.327.05.835-.425.935.033 1 .22.358 1.566-.345.315-.363.74-2.88 1.215-.55.863-1.147.342-1.798.735-.94 1-.762 1.018-1.687.387-.336 1.03-.378.654-.897.53-.064.67-.16 1.794.612 1.38-.07z"/>
LVL 43
Saqib Husain, SyedEngineerAsked:
Who is Participating?
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Regarding the missing spaces, what they have done is to try and minimise the number of characters required. And so all the spaces have been stripped out except for those that are absolutely required to be able to work out two adjacent numbers.

So, as an example, the first part...

M450.942 336.32

the co-ordinates are 450.942 and 336.32. The space is required because otherwise you don't know where the first number ends and the next starts. However, is the example...

.55 2.942.03.68.572 1.374

there are three sets of co-ordinates (0.55, 2.942) and (0.03, 0.68) and (0.572, 1.374). The space is still there between the 1st and 2nd numbers (again for the same reason above), but between the 2nd and 3rd numbers, the space and the 3rd numbers leading zero can be omitted because it is obvious that a number can't have two (or more decimal places) so that second decimal place character is the start of a new number.

Another way to think about it (especially if you are writing code to parse/extract these numbers) is to look at each character in the path string, and put characters together until a space or it is no longer a valid number, ie.

.5       - valid
.55    - valid
.55<space>    - ok that is the end of a number, move on to the next

2.942.        - not a valid number so the number is 2.942, and we start again with the decimal as the first
.03.           - not a valid number so the number is .03, and we start again, etc, etc

Oh, and just to be clear, the same thing happens with the "-" character (negative sign), it can start a new number without there being a space because it is invalid to have a negative sign mid number, so ...



1.09    -0.173    0.08    -1.027    -0.787    -0.468    0.325    -0.91
Geert BormansInformation ArchitectCommented:
sometimes you see comma, sometimes you see dot as a decimal indicator inside coordinates
the space seperates the coordinates
M450.942 336.32
means move to (450.942, 336.32)
after that you have an
L.55 2.94
means you start from the coordinate at M and draw a line
After that you draw a line to the next set of two coordinates
(L takes any pair of coordinates and continues to have the L implicit until you change the operator)
There seems to be quite a few spaces missing by the way
Internet explorer manages to parse the path attribute however
If I analyse the path, I end up with a country I believen but I don't know which one

for more info
Saqib Husain, SyedEngineerAuthor Commented:
Geert Bormans: Thanks for the info.

mccarl: Thanks. Looks pretty much what I am looking for. Please let me try this out before I close this out.
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Geert BormansInformation ArchitectCommented:
@mccarl, thanks good point... the ,optimisations’ that save some space in the text string size but make it so much harder to write a parser for it :-)
Thanks for adding that info
Seems to be an error.  There is a lower-case "L" used in place of a "1"
Here's a regex pattern that seems to work
((?:[ \-A-Z])(?:\d*\.\d+))|(\.\d+)

Open in new window

You can capture the paired groups separately if that will make your job easier.
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Seems to be an error.  There is a lower-case "L" used in place of a "1"

No, no error, it IS supposed to be a lower case "L". While Geert's original post is correct, it is referring to the upper case "L" command where the path contains the lower case "l" command. This is all described in the page that Geert has linked to, but in summary, the "L" will draw a line from the current pen location to the location given by the absolute co-ordinates given after the L command, where the "l" will draw a line from the current pen location to the location given by the relative co-ordinates given after the l command. This location is relative to the current pen location, so..

M 1 2 L 3 4     -      will draw a line from (1,2) to (3,4)


M 1 2 l 3 4      -      will draw a line from (1, 2) to (4, 6), ie. it draws a line from (1,2) to a point 3 units in positive x direction and 4 units in positive y direction
In that case, here's an updated regex pattern that recognizes both upper-case and lower-case letters.
((?:[ \-A-Za-z])(?:\d*\.\d+))|(\.\d+)

Open in new window

Saqib Husain, SyedEngineerAuthor Commented:
Spot on!

Thanks a million.
mccarlIT Business Systems Analyst / Software DeveloperCommented:
You're welcome!!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.