Announcement

Collapse
No announcement yet.

Decompiling every progs.dat

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    So the parm_start section is:
    -1st parameters (if they exist)
    -2nd locals (if they exist)
    -3rd statement scratchpad (if needed)

    I'm guessing by 'temps' you are referring to what I call a 'scratchpad'. Basically where it writes results of one statement to be used by the next statement.

    I haven't gotten into how it deals with the stack just yet.

    I've written a program to automate the dis-assembly so now I have all functions with their corresponding function names, source file names, and statements (each with op code names.) My questions currently are:
    -is edict self always 0x1c? If not how do you correspond 0x1c with self?
    -is field .impulse always 0xad? (I assume this is a byte offset from the beginning of the structure.)
    -the parameters/locals/scratchpad numbers I assume to be offsets from a block, but where is this block?
    -do function calls refer to function index, function byte_offset, statement index, or statement byte_offset?

    P.S. might want to clear your PMs: Spike has exceeded their stored private messages quota and cannot accept further messages until they clear some space.
    Last edited by Hypersonic; 12-03-2014, 11:49 AM.

    Comment


    • #17
      dstatement_t a,b,c; generally refers to an index into the globals array. the specified offset is the first float of the variable (ie: offset+0 for any type, offset+1 and offset+2 as well if the instruction expects a vector).
      the globals array is really just a block of memory containing all immediate etc values. most of it will be initialised to 0, but some locals+globals might be preinitialised and yeah, immediates will be found here.

      the globaldefs struct array specifies the names+types+offsets of the various globals.
      a global might not have a 'globaldefs' value specified, in which case it is commonly a temp, but could also be one of the special OFS_foo globals, or a local (with optimisations), or an immediate (with optimisations: will have the name "IMMEDIATE" with vanilla qcc).

      the globals are just an array of 4-byte elements. types are infered by the separate globaldefs array or the instruction that refers to it, as appropriate. the VM must be written in a language that supports+uses unions. you cannot store floats and ints in separate blocks of memory, because the qcc is lazy and does not enforce types.
      the float type is just a float.
      the vector type is 3 consecutive floats. statements will refer to the first element, and will automatically read/write the following two elements as well (as appropriate).
      the field type is just an integer index. the statement specifies a variable the same as any other type. the OP_ADDRESS / OP_LOAD instructions read the index from the variable and index the entity that way. the field's type is infered by the OP_LOAD_F/V/etc or OP_STORE_F/V/etc instructions. because fields are variables the same as any other, it is legal to use fields as arguments etc - see the find builtin for an example of this. they are variables the same as any other type, but constant fields should also generate an entry in the fielddefs struct array so that loading+saving works correctly (some engines remap fields, so try to ensure that there are no gaps). the const fields should be initialised to the same integer value as specified in the fielddefs->ofs member.
      the entity type is traditionally a byte index into some entity lump allocated by the engine. some QCVMs will use entity indexes instead of byte offsets.
      Pointer types are a figment of your imagination... partly. They should only exist as temps, generated by the OP_ADDRESS instruction and written by the OP_STOREP_* instructions. you cannot read from pointers.
      strings are byte offsets from the string table present in the progs.dat (as are any 'name' types in the progs). the first byte of the string table must always be 0, to ensure that null strings are also empty. the engine might do weird stuff to strings to distinguish between different types of string. the vanilla engine uses negative string offsets to refer to strings within the engine's heap/bss/data segments, which mandates a 32bit address space. some engines thus do special things to avoid this issue, like using the high bits as flags or some such instead of a pure offset from the progs string table.
      function variables are integer indexes into the dfunction_t table. they're variables the same as any other, and you can wrap functions by storing to them or whatever and then calling the original.

      the 'fun' comes when you have a global with no name or type specified. it may be an a (core) system global, an optimised immediate/const, a temp, or an optimised local.
      you can use heuristics to determine if its an immediate in that the offset will never be written to. the vanilla qcc will provide type info and the name 'IMMEDIATE' for immediates, but other qccs will not. vanilla qcc will never reuse temps (so it'll be written and then read pretty much immediately), but optimised qccs might reuse the temp from multiple different functions. temps/scratchpads may or may not be considered a local.
      vanilla qcc has a bug of sorts in that it NEVER reuses a temp even within the same function. such temps will only ever be written from one place. more modern qccs will reuse the temp as often as it can, because its wasteful not to (the 1mb limit falacy).
      optimising qccs may collapse b=a*b; into a single op_mul instruction, but vanilla qcc will wastefully use an op_mul followed by an op_store so don't assume all qccs will give such friendly hints.

      the vanilla qcvm does not bound-check strings, pointers, fields, entities, etc etc... this potentially means you can do all sorts of things.
      one nice trick is that you can use floating point operations on denormalised floats, basically treating them as 23-bit integers. by hacking entities (byte offsets, remember), you can misalign these floats and implement some sort of add-with-carry logic to provide 32bit integer add/subtracts. with this you can then walk through the engine's heap and hack whatever you want (commonly refered to as qccx hacks). this is a bad thing as it can cripple/limit future engine modifications, leading to stagnation.
      this is one of the major motivations to switch entity types to integers (as well as to bounds-check pointers). thanks to x86 assuming everything is executable, such hacks allow you to inject native x86 instructions into the engine itself, so they're not a good idea to advertise as a feature... hence why decent engines try to block this stuff. actually, might be quite fun to rewrite the qcvm at run time to add extra instructions. maybe a project for another day - probably a crazy one.

      scramblers exist that might insert gibberish or strip names or reorder tables in an attempt to confuse decompilers. I can't think of any specific examples now though, people kinda gave up on being dicks when more recent games came out. just be sure to validate any file names so that c:\autoexec.bat or whatever can't get overwritten...

      saved game support ensures that globals, fields, and functions are all named. but don't expect anything else as it is often optimised out.
      csqc cannot be saved, as might various deathmatch-only mods. such mods might potentially use more extreme optimisations that strip much more info.
      Some Game Thing

      Comment


      • #18
        I'll have to get back to you on much of what you've just mentioned, I'm still working on some of the basics.

        It seems that a dis-assembler has to do some cross-referencing to get variable names (assuming they have names.) A statement will point to where the variable value/content is located. A definition will also point to where the variable value/content is located. So from a statement if you want to know the variable name you have to find which definition points to the same location, and hopefully there is only one definition that does so. I think I'll create a table that does this association ahead of time. 1st column address of vars, 2nd column index of corresponding definition (from which name can be gleaned.)

        It could be dangerous if one can make a qc mod that can write data outside of the sandboxed area by exploiting some flaw. In a way more dangerous than Quake2 game dlls as at least Quake 2 game dlls can be scanned for viruses, but I doubt progs.dat script files are analyzed for malicious code. I assume most modern ports fix the known exploits?

        Comment


        • #19
          quake2 can run dlls which can have had malware automatically injected into them. progs.dat files are at least specific to quake and thus do not have the same exposure.
          modern ports fix the issues, but not all active ports are modern. and in fixing them, they can break certain expectations, so if you're trying to emulate stuff, be careful of what you expect.

          select (char*)stringtablestart+globaldefs.name where globaldefs.ofs == statement.[a|b|c]
          be careful with unions - specifically vectors which will have one ev_vector entry, and one ev_float entry at the same offset (as well as two additional ev_float entries at the next two offsets too). you will want to consider the variable type that the statement implies to determine which one should be used, in order to try to avoid gibberish. strictly speaking an offset is an offset, but it is less readable and gets in the way of any sanity checks in your assembler.
          Some Game Thing

          Comment


          • #20
            While the opcode usually infers the correct var type, it doesn't always. For example, when loading up parameters it seems vectors are used as general purpose containers. Undefined vars are a pain as you mentioned. Also as you mentioned I guess you have to infer what they are by how they are used. But the value they contain is often moved around to other addresses before being operated on, and coding a dis assembler to investigate this would be a hassle, especially if it's a global or parameter used in another function.

            As for terminology, I tend to think of variable references as indices. When I think of offset I usually think of byte offset specifically. Byte offsets being how many byte addresses to traverse from a given starting point. Indices being how many structures to traverse from a given starting point. Confusing the 2 caused me many problems in the past.

            I wonder how many bytes are reserved for variables that are strings (not the string constants.) Are Entities Fields Functions and Pointers un-signed ints or shorts?

            Regarding dll vs qc files, I just meant that dll files are scrutinized by virus checkers, while I doubt script files are. Although analyzing the dlls may not be so easy, specifically how it interfaces with the host exe.

            UPDATE:

            Why don't the variables in statements index-reference definitions instead? Then there would be no doubt! In statements the same variable definition can be referenced many times. Just define type as void in the definition if more flexibility was desired. Maybe they didn't want to have the extra overhead of every single variable with definition information?

            UPDATE2:

            I'm a little dumbfounded over OP_ADDRESS. The 6 'OP_STOREP_*' opcodes are the only opcodes that use the pointer created. Those opcodes only use 'a' and 'b', they don't use 'c'. Well why not have 'a' be ent, 'b' be field, and 'c' be where the value at the ent/fld combo is stored to? Unless I'm missing something here, and the pointer is also used elsewhere? Upon further reflection I think I may have answered my own question: to save it for future use, although so far the only instances I've seen pointers used is when they were immediately stored.
            Last edited by Hypersonic; 12-05-2014, 01:58 PM.

            Comment

            Working...
            X