[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 852
  • Last Modified:

Python parsing format similar to json?

Ok i'm trying to write a python script that will parse some datafile, its similar to json or serialized data i think.

Anyway i don't really know what the best approach is to do this and i'm looking for some help.

Everything is contained within {} and there are multiple subclasses within each see example code below.

 I was going to try and parse it and convert it to valid json format somehow but then i read about custom serialization in python and think that may work.

Any advice is much appreciated.
nameofclass1
{
	"property1" "value1"
	"property2" "value2"
	nameofclass2
	{
		"property1" "value1"
		"property2" "value2"
	}
	nameofclass2
	{
		"property1" "value1"
		"property2" "value2"
		"property3" "value3"
	}
}
nameofclass3
{
    "property1" "value1"
}

Open in new window

0
invsman249
Asked:
invsman249
  • 9
  • 7
  • 2
2 Solutions
 
peprCommented:
There is the json module in Python -- standard module from Python 2.6.  See http://docs.python.org/library/json.html#module-json
0
 
peprCommented:
Also, have a look at "Dive into Python 3" by Mark Pilgrim, Chapter 13. Serializing Python Objects (http://diveintopython3.org/serializing.html), namely the parts:

13.8. Saving Data to a JSON File  (http://diveintopython3.org/serializing.html#json-dump)
13.9. Mapping of Python Datatypes to JSON
13.10. Serializing Datatypes Unsupported by JSON
13.11. Loading Data from a JSON File

It is related to Python 3, but 2.6 should serve as a transitional version between Python 2 and Python 3; hence, the json things for Python 2.6 should be the same as for Python 3.
0
 
invsman249Author Commented:
I've tried the json library, the problem is in its current state it won't load into the json library because it doesn't have the commas, colons and quotes in the right places etc.

I have no control over how this data file is output so i need to work with the format that is there.
I'm not sure if the json library will do what i want how can i for example get it to read nameofclass1 when its not in quotes?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
peprCommented:
I see.  What version of Python do you use?  And what should be the wanted result of parsing?
0
 
invsman249Author Commented:
python 2.6, ideally i'd like to be able to access each property and value of each of the classes. I imagine this may need some recursion or something as there can be classes within classes.

However the main bulk of what i want is a second level class that is repeated a few times. For the example i have called it boxclass.

I would like to be able to loop over the boxclasses and read the properties and values of this.

Hope that makes sense
nameofclass1
{
        "property1" "value1"
        "property2" "value2"
        boxclass
        {
                "property1" "value1"
                "property2" "value2"
        }
        boxclass
        {
                "property1" "value1"
                "property2" "value2"
                "property3" "value3"
        }
	boxclass
        {
                "property1" "value1"
                "property2" "value2"
                "property3" "value3"
        }
}

Open in new window

0
 
peprCommented:
- How dirty the solution could be?
- Do you have some description of the format?  Or are you forced to guess the format only?  
- Having the example with "boxclass", how do you want the future code to be called from your script?
- What do you expect to be returned exactly in this case?

This is not JSON.  You should specify better your needs.  Are the pairs always formed by two strings?  Are they always pairs?  Are they always enclosed in {} preceded with the ambiguous name (say the same 'boxclass')?
0
 
invsman249Author Commented:
- How dirty the solution could be?
For the time being whatever works

- Do you have some description of the format?  Or are you forced to guess the format only?  
Here is the real description of the format
http://developer.valvesoftware.com/wiki/VMF_documentation

- What do you expect to be returned exactly in this case?
list, dictionary, array? some variable to access all the properties and values?

This is not JSON.  You should specify better your needs.  Are the pairs always formed by two strings?  Are they always pairs?  Are they always enclosed in {} preceded with the ambiguous name (say the same 'boxclass')?

Of course its not JSON i never said it was........i said it had a similar format, if it was json i wouldn't have this problem now would i?
0
 
peprCommented:
I am looking at the JSON implementation if it can be (re)used to create the VMF parser.  Probably tomorrow...
0
 
invsman249Author Commented:
Thanks anything you can help with would be great
0
 
LunarNRGCommented:
Based on the likelihood of objects with the same name, I'm not sure converting this to json or using the json parser implementation will be particularly useful. I'm not a pyparsing[1] expert by any means, so trust this code as far as you can throw it, but the following seems to work given your example data and a few others from the vmf documentation:

#!/usr/bin/env python
from pyparsing import *

data = """\
nameofclass1
{
        "property1" "value1"
        "property2" "value2"
        boxclass
        {
                "property1" "value1"
                "property2" "value2"
        }
        boxclass
        {
                "property1" "value1"
                "property2" "value2"
                "property3" "value3"
        }
        boxclass
        {
                "property1" "value1"
                "property2" "value2"
                "property3" "value3"
        }
}"""

VMF = Forward()

vmfClassName = Word(alphanums + '_')
vmfProperty = vmfValue = dblQuotedString.setParseAction(removeQuotes)

vmfMember = Group(vmfProperty + vmfValue)
vmfObject = vmfClassName + Suppress('{') + Group(VMF) + Suppress('}')

VMF << OneOrMore(vmfMember | vmfObject)

vmfComment = '//' + restOfLine
VMF.ignore(vmfComment)

if __name__ == '__main__':
    results = VMF.parseString(data)
    from pprint import pprint 
    pprint(results.asList())

Open in new window


FWIW, this is very loosely based on the pyparsing example json parser, here:
  http://pyparsing.wikispaces.com/file/view/jsonParser.py

HTH!

[1] http://pyparsing.wikispaces.com/HowToUsePyparsing
0
 
peprCommented:
It is definitely a good idea to use some existing parser.  For the case you want to make your own, it is always better to do it in two phases.  The first one is called a lexical analyzer (http://en.wikipedia.org/wiki/Lexical_analysis) that converts the text into the more abstract pieces called tokens (or lexemes).  See the "dirty" lexical analyzer for the purpose:

b.py
"""Dirty VMF lex parser."""

import re

rexId = re.compile(r'\s*(\w+)\s*', re.MULTILINE)
rexLBrace = re.compile(r'\s*({)', re.MULTILINE)
rexRBrace = re.compile(r'\s*(})', re.MULTILINE)
rexStr = re.compile(r'\s*"(.*?)"', re.MULTILINE)
rexComment = re.compile(r'\s*(//.*?)$', re.MULTILINE)

def lex(s, pos=0):
    while pos < len(s):

        m = rexId.match(s[pos:]) 
        if m is not None:
            yield 'id', m.group(1)
            pos += m.end()
            continue
            
        m = rexLBrace.match(s[pos:])
        if m is not None:
            yield 'lbrace', ''
            pos += m.end()
            continue
            
        m = rexRBrace.match(s[pos:])
        if m is not None:
            yield 'rbrace', ''
            pos += m.end()
            continue
            
        m = rexStr.match(s[pos:])
        if m is not None:
            yield 'str', m.group(1)
            pos += m.end()
            continue
            
        m = rexComment.match(s[pos:])
        if m is not None:
            yield 'comment', m.group(1)
            pos += m.end()
            continue
            
        yield 'error', s[pos:]
        pos = len(s)
   

s = '''  
// This is a comment.
ClassName_1
{
      "Property_1" "Value_1"
      "Property_2" "Value_2"
      ClassName_2
      {
            "Property_1" "Value_1"
      }
}'''
for t in lex(s):
    print t

Open in new window



It produces the following output (one tuple = one token):

C:\tmp\___python\invsman249\Q_27026093>python b.py
('comment', '// This is a comment.')
('id', 'ClassName_1')
('lbrace', '')
('str', 'Property_1')
('str', 'Value_1')
('str', 'Property_2')
('str', 'Value_2')
('id', 'ClassName_2')
('lbrace', '')
('str', 'Property_1')
('str', 'Value_1')
('rbrace', '')
('rbrace', '')

Open in new window

0
 
peprCommented:
The second phase is to built the parser (a syntactic analyser) on the top of the lexical analyzer.  There are more approaches that also depends on the complexity of the language.  The easy languages can be analysed in a "top-down" manner using the recursive calls of functions where each function reflect the existence of the related syntactic category.  See http://en.wikipedia.org/wiki/LL_parser.  (However, the recursive solutions are not usually used -- only in simple cases.  Study the following code and the result.  Each element is represented as a dictionary with 'class' (the string identification), the 'props' (a dictionary of properies), and the 'children' (the list of children class structures):

c.py
"""Dirty VMF lex parser."""

import re

rexId = re.compile(r'\s*(\w+)\s*', re.MULTILINE)
rexLBrace = re.compile(r'\s*({)', re.MULTILINE)
rexRBrace = re.compile(r'\s*(})', re.MULTILINE)
rexStr = re.compile(r'\s*"(.*?)"', re.MULTILINE)
rexEmpty = re.compile(r'^\s*$', re.MULTILINE)
rexComment = re.compile(r'\s*(//.*?)$', re.MULTILINE)

def lex(s, pos=0):
    while pos < len(s):

        m = rexId.match(s[pos:])       # class identifier
        if m is not None:
            yield 'id', m.group(1)
            pos += m.end()
            continue
            
        m = rexLBrace.match(s[pos:])
        if m is not None:
            yield 'lbrace', ''
            pos += m.end()
            continue
            
        m = rexRBrace.match(s[pos:])
        if m is not None:
            yield 'rbrace', ''
            pos += m.end()
            continue
            
        m = rexStr.match(s[pos:])
        if m is not None:
            yield 'str', m.group(1)
            pos += m.end()
            continue
            
        m = rexComment.match(s[pos:])
        if m is not None:
            #yield 'comment', m.group(1)  # ignore the comments
            pos += m.end()
            continue
            
        m = rexEmpty.match(s[pos:])
        if m is not None:
            #yield 'empty', m.group()     # ignore the empty lines
            pos += m.end()
            continue
            
        yield 'error', s[pos:]
        pos = len(s)
   

def element(lst):
    """Returns the dictionary and position of the first unparsed token."""
    if len(lst) == 0:
        return {}, 0         # Empty list means empty element (no name, nothing).
        
    elif lst[0][0] == 'id':           # name of the element
        d = {}                        # new element saved in the dictionary
        d['class'] = lst[0][1]        # name stored as the class name
        assert lst[1][0] == 'lbrace'  # ... followed by the left brace 
        pos = 2
        d['props'], p2 = props(lst[pos:])       # extract the properties as a dict
        pos += p2
        d['children'], p2 = children(lst[pos:]) # extract the child elements as a list
        pos += p2
        assert lst[pos][0] == 'rbrace' # the right brace closes the element
        return d, pos + 1
    else:
        print 'error', lst
        
        
def props(lst):
    """Returns the dictionary of properties and position of the first unparsed token."""
    pos = 0
    p={}
    while pos < len(lst) and lst[pos][0] == 'str':  
        assert (pos+1) < len(lst)       # loop until there are no str tokens
        assert lst[pos+1][0] == 'str'
        p[lst[pos][1]] = lst[pos+1][1]
        pos += 2
    return p, pos


def children(lst):
    """Returns the list of children elements and position of the first unparsed token."""
    pos = 0
    chlst=[]
    while pos < len(lst) and lst[pos][0] == 'id':
        d, p2 = element(lst[pos:])      # loop until there are no child elements    
        pos += p2 
        chlst.append(d)
    return chlst, pos


# Let the following be the parsed input.
s = '''  
// This is a comment.
ClassName_1
{
      "Property_1" "Value_1"
      "Property_2" "Value_2"
      ClassName_2
      {
            "Property_1" "Value_1"
      }
}'''


# This way the lexical analyzer can be used.
for t in lex(s):
    print t
    
    
# Make the list out of the lexical symbols and process it by the recursive 
# parser. The element must always be the root structure.
lst = list(lex(s))
print '=' * 60
print len(lst)
print lst
print '-' * 60
e, pos = element(lst)
print e
    
# Now show something about the element e.
print '-' * 60
print 'Element', e['class'], 'with', len(e['props']), 'properties',
print 'and', len(e['children']), 'children.'
print '  Properties:'
for k in e['props']:
    print '    ', k, ':', e['props'][k]
    
print '  Children (names only):'
for child in e['children']:
    print '    ', child['class']

Open in new window


It prints the following:

C:\tmp\___python\invsman249\Q_27026093>python c.py
('id', 'ClassName_1')
('lbrace', '')
('str', 'Property_1')
('str', 'Value_1')
('str', 'Property_2')
('str', 'Value_2')
('id', 'ClassName_2')
('lbrace', '')
('str', 'Property_1')
('str', 'Value_1')
('rbrace', '')
('rbrace', '')
============================================================
12
[('id', 'ClassName_1'), ('lbrace', ''), ('str', 'Property_1'), ('str', 'Value_1'
), ('str', 'Property_2'), ('str', 'Value_2'), ('id', 'ClassName_2'), ('lbrace',
''), ('str', 'Property_1'), ('str', 'Value_1'), ('rbrace', ''), ('rbrace', '')]
------------------------------------------------------------
{'children': [{'children': [], 'class': 'ClassName_2', 'props': {'Property_1': '
Value_1'}}], 'class': 'ClassName_1', 'props': {'Property_1': 'Value_1', 'Propert
y_2': 'Value_2'}}
------------------------------------------------------------
Element ClassName_1 with 2 properties and 1 children.
  Properties:
     Property_1 : Value_1
     Property_2 : Value_2
  Children (names only):
     ClassName_2

Open in new window


It holds for the analyzed string:

// This is a comment.
ClassName_1
{
      "Property_1" "Value_1"
      "Property_2" "Value_2"
      ClassName_2
      {
            "Property_1" "Value_1"
      }
}

Open in new window



Notice, it is a bit DIRTY SOLUTION.  Not purely recursive solution...
0
 
invsman249Author Commented:
Thanks pepr thats great and give me something to work with now. Thank you!
Thank you for your input too LunarNRG!
0
 
peprCommented:
I am glad if it helped.  Anyway, in my opinion, LunarNRG deserved at least a portion of the points.  ;)
0
 
invsman249Author Commented:
hmm my bad i haven't used experts exchange much i should have given some points to lunarNRG, i don't think it will let me now...

0
 
peprCommented:
You can possibly "Request attention" of the zone administrator and then re-assign the points.
0
 
invsman249Author Commented:
Thanks i've tried that pepr, hopefully an admin sorts it out again thank you to both of you :)
0
 
LunarNRGCommented:
I'm glad you found your answer, please don't worry about the points. I benefit from pepr's answers, as well -- as they are always very comprehensive and on target. Thank you for thinking of me! :)
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 9
  • 7
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now