No announcement yet.

Parsing Delimiters (theory)

  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing Delimiters (theory)

    Doesn't the title make you excited?! (lol)

    I'm working on a JSON(esque) parser that I intend to use for something kinda-sorta Quake related, in the future. For those that are not aware, here is an example of JSON.

    It's nothing more than xml written in object format. You would access values with dot syntax - - returns freddy55. Simple stuff. Flash already has a JSON parser, but as I said - I am making a JSON(esque) parser, so I needed to write my own in order to handle the new syntax/possibilities.

    This means that I am just feeding flash a string and I need to explain to it how to loop through delimiters and gather information. This posed an immediate puzzle - How do I get the proper delimiters.

    For instance "players" is the root parent object, so I need the first and last delimiters, BUT there are a lot of identical close delimiters between. I solved this problem and the below screenshot is my heavily commented code.

    I'm hoping this post can serve one or more of any of the following possibilities.

    1) my code is right on and others can duplicate it in their language of choice, should they need such a script
    2) someone else that has done this before, knows a better way and is inclined to share it (in full - i don't care what language, functions/loops/etc are pretty universal)
    3) someone will post a truck (actually I just threw this one in)

    edit: one hole that needs to be closed is the last else{}. Technically match and count could be off and it would still act like a success. A check needs to be made and if it fails it needs to display a "script error". The only situation where this could happen, is if the programmer didn't properly close/end something.

    I used a screenshot because the syntax hilighting makes it easier to read and understand, but if commented code is still too confusing, here is a walk-through, using an object that is stripped down to it's delimiters.



    find :{ - result: :{
    find }; from :{ - result: };
    count :{ between :{ and }; - result: count = 2
    start at }; and find }; - result: };
    find :{ between }; and }; - result none - match+1=1
    start at }; and find }; - result: };
    find :{ between }; and }; - result 1 - match+0=1
    start at }; and find }; - result: };
    find :{ between }; and }; - result none - match+1=2

    match = count and we have the proper delimiter
    Last edited by MadGypsy; 12-10-2012, 06:11 AM.

  • #2
    cool stuff. im interested to see where this is going!
    My Avatars!
    Quake Leagues
    Quake 1.5!!!
    Definitive HD Quake


    • #3
      Well, for now, let's say that "where this is going" is a JSON matrix that will be parsed to 3d vectors. My "engine" will be far more powerful than simply displaying 3d, but like I said, for now let's say this is a "model format" in the making.

      If you noticed in the snapshot, I am hundreds of lines down in the class. This is because what the class actually does (so far) is:

      1) load xml, txt or json
      2) if xml or txt - return
      3) if json search for imports (ex [someFile.json] )
      4) if imports exist, import them and append in place
      5) when all imports are gathered/included find all text and sequester it in an array leaving behind a "token" in the script
      6) remove all comments
      7) remove all whitespace
      start processing delimiters

      This is where I am now. My next step will be to start converting delimited areas into an actual object that flash recognizes. From there I can begin building my matrix and vector3d classes, followed by plugging the JSON in and having it provide the data that powers the matrix/vector classes.

      I don't know what the fux I'm doing. I wanna build an engine (of sorts), so I'm re-inventing the wheel and figuring it out as I go.

      A quake engine?

      No. It's hard enough building one that works the way I envision it. Trying to make everything understand stuff that I don't even understand is too much work. I'm going completely rogue.
      Last edited by MadGypsy; 12-10-2012, 02:04 PM.


      • #4
        Originally posted by MadGypsy View Post
        It's nothing more than xml written in object format.
        You know what this reminds me of? This reminds me of Computer Programming class! Computer Programming is harder than University Mathematics (in my opinion). If you make one mistake, the system won't run. You have to check everything all over again. (It's like trying to find a penny in a football field.)

        I heard that programmers only write seven lines in one day. Programmers impress me!
        "Through my contact lenses, I have seen them all, I've seen wicked clowns and broken dreams / Crazy men in jumpsuits trying to be extreme and messing around with your computer screen" - Creative Rhyme (03/23/2012)


        • #5
          Hah! I caught an oversight. Thankfully it was incredibly easy to fix.

          I changed:
          del_count = snapshot.match(delimiter).length;
          Why does this matter?

          Well, on forward searches it was just checking for open delimiters and skipping ahead if one was found. Unfortunately, if there were two or more open delimiters in that search, it would result in the wrong end delimiter being found. This is because any more than one open delimiter means more needs to be closed, but I wasn't storing the additional necessary closures.

          The new method counts all delimiters and then follows all the same directives - don't stop til the counts match. I'd say this method is definitive. Everything is counted and processed through a perfect algorithm.

          I'd be interested if anyone has done this purely with regEx. The majority of my code is fancy talk for telling flash to regEx it for me. I'd be hella impressed/interested in the raw regExp's though.
          Last edited by MadGypsy; 12-11-2012, 01:41 AM.


          • #6
            Well, in my comp sci class, I learned one solution is to use a stack... each opening brace pushes (adds) onto the stack, and each closing one pops (removes) an entry. If you pop too many times, too early (without any opening brace), or not enough times, you can detect it. And otherwise, if you have a successful pop, then you know you just went through some text, and the stack size tells you how deep you are.

            Actually, couldn't you have a variable, say "depth" that was increased for every opening brace, and decreased for a closing brace? Simply checking that it's never negative, and at zero after parsing should be a good way to check for errors.


            • #7
              1) Stack method:

              I am aware of this method but, I believe my method is actually much faster. Instead of traversing the entire string one delimiter at a time and (pop/push)ing arrays, which is all using ram/processor resources - I store nothing but the perfect delimited chunk after regEx has determined what that is. I have one loop, but it's itinerations are not dictated by amounts of characters:

              while (delimiter counts don't match)
              //keep searching

              2) depth:

              This is essentially what I'm already doing but twisted. Your version has subtraction and one var. Mine has 2 vars and compares them for identicalness. The differences there are very nominal cause I could convert to your way by only changing 1 or 2 lines of code and the results would be the same.


              Overall, my parse delimiter code is perfect. You can feed it any open and close delimiters you want (that go together) and it will always return the proper contents. It even knows if the programmer fucked up and forgot to delimit something. It does not correct the error, but it does report it. Correcting the error would entail far far too much work. I would have to write a script that can determine where the error began. It may even be impossible, because after all text is sequestered, whitespace is removed, so I don't even have a line break or something to work with.

              I'm at the point now where all Object and Arrays are created and all non object/array type data is stored in thisObjectOrArray.contents, for the Object or Array that those name:value pairs belong to.

              Now, I'm looping through the parent object, finding all objects/arrays within and passing their this.contents to another parser that starts converting the name value pairs to (int, uint, number, hex, string, & boolean).

              Once I get past a lil hump on that end, I will begin adding more complex types like (function, condition, loop, etc). The reason I am putting these types off til last is because up until now everything is name:value pairs that store data types. Adding the ability to externally (and successfully) call external loops, functions, etc is going to prove to be an adventure, I am sure.

              I'm not above hacking it into the system as long as it functions with stability. If I have to bust out a hex editor and traverse flash op-codes til I find the one for (ex) function - so be it. Honestly, I wouldn't even use flash at all if I was as much of a beast in some other equally (or more) powerful language. Using a PHP,Javascrit,HTML combo would be incredibly slow and that is my only personal other option regarding my current education.


              • #8
                Oh, this is good. This is real good.

                Hell F'n Yes!!

                This is my current state. The red on the left is output directly from the objects my script created. As you can see, it recognized every single name:value pair. Upon further inspection you may say "Wait, it didn't get every name:value pair. Display and Object aren't listed." - HAH! Wrong. All of my name value pairs are children of those Objects. The fact that they appeared at all proves the objects have been created and the loops are finding them. To say that all my name value:pairs are correct but some kind of way Obj.display or Obj.display.object don't exist, is ludicrous.

                There is more going on here though. In the blue boxes you will notice I pass an escaped character and it gets retained (\"). In the red box there is a lot of value:"TEXT[num]". That is actually another feature. One of the first things my script does is sequester ALL text (ex "I am in quotes, therefore I am...text") into an array and leave behind a token with the proper array index. I do this because text can contain delimiters that I don't want parsed. However, it has the added bonus that all text is now stored in an array and is easily assignable to any var I want.

                Wait, there's more (lol). This also proves that my import feature is working perfectly as well.

                There is still more to do. Technically (as is) the script could definitely be used, but it is lacking some real balls. The name value pairs need to be processed into strict types (int, string, etc) and the Array feature I built is entirely untested (no arrays in my test string). I have to build a listener to determine when to do the final Object build cause as is, it's building the object multiple times. Regardless of these unhandled issues my Object parser not only already works, but it works very stable. It will only continue to become more so as time goes on.


                • #9

                  Two things.

                  First, if the data is best suited to using a binary format, then I'd advise to just use a binary format. Text-parsing is gross, complex and colossally error-prone; don't be taken in by religious arguments that insist otherwise.

                  Second, if a text format is - despite that - deemed most appropriate, then I'd really really really strongly advise to use Quake's built-in COM_Parse format. It may not be as sexy as the latest and greatest flavour of the month text formats, but it's consistent, it's compatible, and it allows you to reuse a whole bunch of code.
                  IT LIVES!


                  • #10
                    lol - this has nothing to do with Quake, bro. I'm building a JSON engine, which would be impossible to do if I don't write a JSON parser. My "engine" is primarily geared towards being a "browser" of sorts. Consider my JSON the equivalent of HTML (if you will) and my "engine" is the browser in which to view it.

                    Secondly, any 3D anything that I allow my engine to do will be based on Flash's Vector3D and Matrix3D classes. Guess what kind of information these classes expect....(jeopardy music)...that's right, text...not bytes. As a matter of fact, my JSON object could be plugged directly into them (@COM_Parse format).

                    So, I get what you're saying and everything, but I know what I'm building and how well it will work. You probably would hate my version of creating an object from strings encrypted in image data.
                    Last edited by MadGypsy; 12-14-2012, 07:57 PM.


                    • #11
                      If we're just seeing the output of the reading stage, we don't know what actually is the structure of the objects. As in, do the fields within 'display' exist in a way that when given 'app', can we get to 'color'?

                      For robustness ideas- are there checks for empty fields within delimiters, after a colon, or just duplicate names within the same depth?


                      • #12
                        good questions!

                        I'm gonna try and answer them (I'm short on time).

                        for "display" in particular, this name is recognized by my engine. There will be other names like this (tween, event). Since these are recognized names and there is mega potential for them to end up being "same-names" (ie you want more than one "display" in the object), what my engine does is find these names and converts them to an array index. So display in my engine is actually an array and each index hold the object of the same name in the order they were encountered.

                        You will never have to retrieve an actual display because the engine parses the entire display list into the objects that they have been described to be. In the case where you do need to talk about another display, you will do this within the "target" field using the display objects .name field.

                        I'm doing it this way because the Object itself is more of a "config". When the displays actually get created (the actual display) it will not be held in any index of the Objects array, it will get assigned the given name in the name field and always referenced that way.

                        This will break the coupling from the Object as no more references are passed throgh the Object in order to access the element that was actually created.

                        {struggling to explain this in plain english}

                        For the second part of your question I am a bit confused as to what you re asking. A completely empty field is possible. This would mostly happen in arrays that are not associative. They would have fields that have no name.

                        stuff[2][2] = "ugly"

                        as you see, nothing has a name and I even have an unnamed array within the array. This is the ONLY situation where you can eliminate names (while within an array). Everything else is objects and vars and must have a name. In the case of being inside of an array, I have the index to reference by and names are not necessary.

                        You will absolutely be able to reference the property(ies) of any Object that is created, but my engine will switch your access from tapping the Object for the info to tapping the actual created element for the info.

                        I hope this answered your questions. When dealing with stuff like this, it is hard to simply answer an unillustrated question. What you mean and what I think you mean can easily be 2 different things.

                        On a side note. I am at the part in my code where I am allowing arrays to have nameless content and assuring the content is in the same order that it was provided. I intend to have this finished before the end of the night. Once that is complete I will have a fully qualified Object of vars that I can start building example apps from. I don't need the other functionalities that I intend to include, in order to get some things started with it's current ability. Also by utilizing this "Level 1" completion I can take the time to make sure everything so far is working as expected before making it even more complex.

                        Thank you for your interest in my non-Quake project


                        • #13

                          I have it perfectly typing arrays, objects, strings, ints/numbers & booleans. If you compare data you may notice some differences. These differences are good. For instance 0x000000 & 0xA30808 change to 0 & 10684424 (respectively). These are color values that would be converted to an int anyway. I saved time and did it on assignment.

                          You also may notice that question = false, but in it's output it says true. If you look at my trace code, I am not tracing it's value, I am tracing whether or not it is typed as a boolean, and it is, so it is right.

                          Finally, we can take a look at nesting. In the most childest element of display[1] I have a String inside of a nameless Array, inside of a named Array, inside of an named Object, inside of a named Array, inside of a nameless object that is represented by an Array index, inside of a named Object (the ultimate parent). The way I trace those values is identical to how I wrote them in the original string.

                          All that being said, I have some clean-up to do and I can put a fork in this for this level. "This level" being a 100% working string to JSON parser that recognizes all the basic var types.

                          note: Obviously ignore the actual object as it is ridiculous. I wasn't focused on a sensical object at this point. I just needed all the parts to be there for testing.


                          • #14
                            I am actually going to add one more feature before I stick a fork in this. I don't know how complicated it will be, but I'm sure I will figure it out.

                            I want the ability to do overwrites on "template" JSON. This way, if "you" describe a certain element in JSON that needs to be used multiple times but only a handful of properties needs to be changed, you can simply import that element into the script followed by an "overwrite" (?object?) that will only overwrite the changes. This way that button that looks just like the other button, except the text or the event it triggers is different, you wont have to remake it from scratch.

                            1) Import support
                            2) strips comments
                            3) strips whitespace without effecting strings
                            4) escaped and special characters allowed in strings
                            5) Object, Array, String, Number & Boolean recognition
                            6) maintains all proper parent/child relationships
                            7) Same-name support for recognized names (keywords)
                            8*) Overwrite support

                            ...moving right along.

                            EDIT: My parser can already parse .map files. I haven't told it what to do with that information yet, but I just want to throw that out there. Technically, you can take a .map straight from radiant and my script will perfectly parse it into an object. I'd say I'm a trip through the Vector3D and Matrix3D classes away from being able to display .map files.
                            Last edited by MadGypsy; 12-20-2012, 10:10 PM.


                            • #15
                              Wow, overwriting has proven to become a huge challenge. I can't seem to find a good spot in my current script to start. I could just do it after the entire object has been created, but that poses it's own problems and seems sloppy to me.

                              I think I need to rethink the way that I intend to store an overwrite or attempt to expand the overwrite concept as whole.


                              However, just because overwrite is not established, it does not mean that my work thus far is without result. In the case of the image above we have an example of a simple shader class that I wrote. As I write more "types", more will be possible.

                              I think I'm going to put overwrite on hold and get the parser to load all images, on compile, directly into the object. This way all assets will be available upon "app start". Plus most of my classes expect BitampData not a URL.