Domato Fuzzer's Generation Engine Internals

Domato is a DOM fuzzer developed by (@ifsecure). It has a generation engine inside that when given grammar rule, engine automatically generates codes which can be used as input to crash your target. While trying to write my own grammar rule for my fuzzing target, I got curious of how it is implemented so I decided to look at the code. Hope this post will help others who are interested in customizing the engine or writing your own grammars. 

Please note that I referenced code examples and explanations from official github page.

So to use the generation engine with a grammar file, you need to create a instance of the engine and import your grammar like stated in the README.

from grammar import Grammar

my_grammar = Grammar()
my_grammar.parse_from_file('input_file.txt')
result_string = my_grammar.generate_symbol('symbol_name')

Let’s see what happens when a grammar file is parsed. 

Special Commands

First, the file's content is opened and each line is parsed inside the function _include_from_string().

There are special commands the engine handles like the following.

!begin lines / !end lines

!begin lines
<new element> = document.getElementById("<string min=97 max=122>");
<element>.doSomething();
!end lines

This command generates the program code. If parser encounters this block, it sets “in_code” value to True, and “helper_lines” to False. Each line inside this block is then parsed by _parse_code_line().

!begin helperlines / !end helperlines

!begin helperlines
#functions
<new eventhandler> = eventhandler1;
<new eventhandler> = eventhandler2;
<new eventhandler> = eventhandler3;
<new eventhandler> = eventhandler4;
<new eventhandler> = eventhandler5;
… more ...
!end helperlines

Similarly, there is "!begin helperlines” which is used to define lines of code which should be only used for generating other real code for fuzzing. For example, jshelpers.txt file defines helperlines as seen on the example.

!begin function / !end function

!begin function savesize
context['size'] = ret_val
!end function

Engine also provides users to define a python code which can be called while generating code. If the parser meets “!begin function” it saves function name in the context and set “in_function” flag to True so next lines can be seen as function code which is concatenated and passed on to _save_function().

_save_function(self, name, source) first eliminates indents in user code to make most higher level to zero indentation. Then it calls compile(source, name, ‘exec’) to get compiled function and saves it to self._functions[name] class variable.

Other Commands

Handlers for other commands are defined inside the code like following.

self._command_handlers = {
            'varformat': self._set_variable_format,
            'include': self._include_from_file,
            'import': self._import_grammar,
            'lineguard': self._set_line_guard,
            'max_recursion': self._set_recursion_depth,
            'var_reuse_prob': self._set_var_reuse_probability,
            'extends': self._set_extends
        }
  • !varformat
    • Saves the format how newly created variables should look like in self._var_format
  • !include
    • Opens the file specified, saves the path in self._definitions_dir and call parse_from_string() like how the main grammar file is parsed
  • !import
    • Parses the file using separate sub Grammar class instance and saves it in self._imports[basename]
  • !lineguard
    • Saves the guard block in self._line_guard
  • !max_recursion
    • Parses string as integer and saves the value in self._recursion_max
  • !var_reuse_prob
    • Sets value in self._var_reuse_prob which is probability value of how often should the engine reuse already generated variable
  • !extends
    • Sets inheritance relationship

Rest of the lines where there is no special symbols (commands), it is considered as a grammar line and _parse_grammar_line() is called.

Parsing Grammar Lines

Grammar line has format of "<symbol> = a mix of constants and <other_symbol>s”

After each line is split into tokens, parsed information fills out python dictionary called "rule”. 

rule = {
      'type': 'grammar',
      'creates': self._parse_tag_and_attributes(match.group(1)),
      'parts': []
}

Let’s look at the grammar example from the README.

<cssrule> = <selector> { <declaration> }
<selector> = a
<selector> = b
<declaration> = width:100%

Each line gets split by the ‘=‘ character.

For example first line is split to tokens, “cssrule”, “<selector> { <declaration>}”. The first token “cssrule”, or left hand side of the ‘=‘ character is then passed to function _parse_tag_attributes() to extract tag name and attributes from the split token. Since “cssrule” has no attributes, “cssrule” becomes the tagname and returned dictionary looks like the following.

{'type': 'tag', 'tagname': 'cssrule’}

The returned dictionary will be saved to “creates” key in the rule dictionary of the parsing grammar line. The right hand side of the ‘=‘ character, which is “<selector> {<declaration>}” is then split again into tokens.  

Here when split, it is always the order of text->tag->text … 

For example, "<foo><bar>” it gets split to “”, “foo”, “”, “bar”,””. Therefore in our case, “<selector> { <declaration>}” will be split to “", “selector", " { ", “declaration", “ }”. Each text field is appended to rule[‘parts’] as type “text” with the token saved as the value of “text” key. 

Tags are parsed by _parse_tag_and_attributes() function and result is appended to rule[‘parts’] list. Also if parsed tag here is same as the tag in the left side of the ‘=‘ character, rule[‘recursive’] is set to True. Following shows the generated rule['parts'] for the first grammar line of our example.

rule['parts'] = 
[
    {'type': 'tag', 'tagname': 'selector'}, 
    {'type': 'text', 'text': ' { '}, 
    {'type': 'tag', 'tagname': 'declaration'}, 
    {'type': 'text', 'text': ' }'}
]

If creating tag for this grammar line is already in the self._creators list, current parsed rule is appended to existing creators of the creating tag. This means there can be multiple creators for a single tag. If there is no creators for the creating tag, new rule is added to the self._creators dictionary.

When ‘nonrecursive’ attribute is set for the creating tag, rule is appended or added to self._nonrecursive_creators dictionary as well. 'nonrecursive' attribute is set on tags which should be forced to be selected when recursion level has reached maximum thus avoiding infinite recursion.

Finally the rule is appended to self._all_rules dictionary and if ‘root’ attribute is set in the creating tag, name of the tag is set as self._root.

Generated rules for this example is as follows.

self._creators['cssrule'] = [
{creates': {'type': 'tag', 'tagname': 'cssrule'}, 'parts': [{'type': 'tag', 'tagname': 'selector'}, {'text': ' { ', 'type': 'text'}, {'type': 'tag', 'tagname': 'declaration'}, {'text': ' }', 'type': 'text'}], 'type': 'grammar', 'recursive': False}
]

self._creators['selector] = [
{'creates': {'type': 'tag', 'tagname': 'selector'}, 'parts': [{'text': 'a', 'type': 'text'}], 'type': 'grammar', 'recursive': False}
]

self._creators['selector'].append(
{'creates': {'type': 'tag', 'tagname': 'selector'}, 'parts': [{'text': 'b', 'type': 'text'}], 'type': 'grammar', 'recursive': False})

self._creators['declaration'] = [
{'creates': {'type': 'tag', 'tagname': 'declaration'}, 'parts': [{'text': 'width:100%', 'type': 'text'}], 'type': 'grammar', 'recursive': False}
]

Parsing Code Lines

Similarly, each code line creates rule of type ‘code’.

rule = {
      'type': 'code’,
      'parts': [],
      'creates': []
}

Each code line is also considered as mixture of texts and tags. Let’s say we have following code. 

!varformat fuzzvar%05d
!lineguard try { <line> } catch(e) {}

!begin lines
<new element> = document.getElementById("<string min=97 max=122>");
<element>.doSomething();
!end lines

Each line is split into tokens and texts are appended to rule[‘parts’] as type “text” whereas tags are parsed by _parse_tag_and_attributes() and appended to rule[‘parts’] as type "tag". If ’new’ attribute is in the parsed tag, parsed tag is appended to rule[‘creates’]. Following parsed rules shows that <new element> = document.getElementById("<string min=97 max=122>"); code line creates tag "element" thus has parsed tag info inside rule['creates'].

self._creators['element'] = [
{'creates': [{'new': 'true', 'type': 'tag', 'tagname': 'element'}], 'parts': [{'new': 'true', 'type': 'tag', 'tagname': 'element'}, {'text': ' = document.getElementById("', 'type': 'text'}, {'max': '122', 'type': 'tag', 'tagname': 'string', 'min': '97'}, {'text': '");', 'type': 'text'}], 'type': 'code'}]

self._creators['line'] = [
{'creates': [{'new': 'true', 'type': 'tag', 'tagname': 'element'}], 'parts': [{'new': 'true', 'type': 'tag', 'tagname': 'element'}, {'text': ' = document.getElementById("', 'type': 'text'}, {'max': '122', 'type': 'tag', 'tagname': 'string', 'min': '97'}, {'text': '");', 'type': 'text'}], 'type': 'code'},
{'creates': [], 'parts': [{'type': 'tag', 'tagname': 'element'}, {'text': '.doSomething();', 'type': 'text'}], 'type': 'code’}]

As explained, if new tag is generated and is interesting type, rule will be appended to the list of creator rules for that specific tag (e.g. self._creators['element']). If the tag has 'nonrecursive' attribute, the rule is appended to self._nonrecursive_creators[tagname] as well. All parsed rules of a code lines are stored in self._creators['line'] (when not a helperline). 

Following is the example of non interesting types.

_NONINTERESTING_TYPES = [
    'short',
    'long',
    'DOMString',
    'boolean',
    'float',
    'double'
]

Finally every parsed rule is appended to self._all_rules[].

Normalizing Probabilities

Next phase is preprocessing probabilities for production rules.

Go through all elements in _creators and _nonrecursive_creators list and get cdf (cumulative distribution functions) using _get_cdf(symbol, creators) and save them in _creators_cdfs[symbol] and _nonrecursivecreator_cdfs[symbol] respectively.

Symbol here means the name of the tag.

Computing Interesting Indices

Last thing to do before expanding production rules is selecting interesting lines for each variable type (tag type).

If ‘line’ is not in self._creators[], which means no code line is parsed, then the function just returns.

For all the parsed rules for code lines (non helperlines), line number is appended to self._all_nonhelpers_lines[]. If there is a token in the rule which is a interesting type tag and is not creating a new tag, then that code line is an interesting line for that specific tag without 'new' attribute. Thus code line number is appended to self._interesting_lines[tagname].  In our previous example, following are interesting lines for tagname 'element'. 

<element>.doSomething();

However, following line is not a interesting line for 'element' tag because there is only an 'element' tag with 'new' attribute. This is a interesting line for "string" tag because there is no 'new' attribute.

<new element> = document.getElementById("<string min=97 max=122>");

Generating Symbols

At this point, you have a Grammar object with parsed rules stored inside. Now it is time to generate symbols.

If "root" symbol is set, you can call like the following.

result_string = my_grammar.generate_root()

Or you can specify which symbol you want to generate like this.

result_string = my_grammar.generate_symbol('symbol_name')

Both cases will generate context information and internally call self._generate(symbolname, context, 0)

Following dictionary saves the context information of the generator.

context = {
            'lastvar': 0,
            'lines': [],
            'variables': {},
            'force_var_reuse': False
}

self._generate() function first checks if current generating symbol is already in the context[‘variables’] and is interesting type (not in _NONINTERESTING_TYPES).  context['variables'] stores names of all variables created so far. 

If current generating symbol is already created, it checks following conditions. 

  • Is force_var_reuse set
  • Randomly generated value is less than self._var_reuse_prob
  • Number of already generated variables of the same type exceeds self._max_vars_of_same_type

If any of above conditions are met, function randomly chooses from already generated variables of the same type instead of creating a new one and sets context[‘force_var_reuse’] to False.

If we have to generate a new variable (conditions are all false),

function selects a creator for the generating symbol using self._select_creator() and use that creator to expand rule by calling self._expand_rule(). All relevant context information is passed to the called function.

Selecting Creator

When selecting the creator, self._select_creator() first checks if there are creators for the symbol and if recursion_depth has reached self._recursion_max and raises error if so.

After that, if argument "force_nonrecursive" is True creators and cdf are chosen from self._nonrecursive_creators[symbol], self._nonrecursivecreator_cdfs[symbol] respectively. If not, they are chosen from self._creators[symbol], self._creator_cdfs[symbol].

Finally if cdf is not set for the symbol we are selecting creator for, it randomly selects a creator from the creators list. Otherwise it selects a creator according to the cdf.

Expanding Rules

After selecting a creator for the genearting symbol, _expand_rule() iterates through all the elements on the right hand side of the rule and replaces them with their text representations or recursively generates non-terminal symbols (or tags).

There are some dictionaries and lists used when expanding the right hand side of the rule.

variable_ids = {}
new_vars = []
ret_vars = []
ret_parts = []

There is a special case, if "id" attribute is specified in the expand rule, check if there is an already expanded symbol with same id (check in variable_ids[]). If there is one, expanded symbol with the same id is appended to ret_parts[]. If there is no previously expanded symbol with the same id, expanded symbol is saved in the variable_ids['id' value].

Also if symbol has "beforeoutput" attribute, which is one way to call a function when expanding the symbol, expansion result is passed on to the function so that the function can do other work with the expansion result like saving the generated int value. 

Lets look at how the function expands the right hand side of the rule.

  • Symbol is 'text'
    • Expand with the actual text value
  • Rule is for code line and symbol creates a new tag
    • Generate a  new variable name with a predefine format (e.g. fuzzvar00001)
    • Append the new variable's name and type to new_vars[]
    • If newly generated variable's type is same as the type of symbol we are currently expanding, append the variable's name to the ret_vars[].
    • Final expanded string is '/* newvar{' + var_name + ':' + var_type + '} */ var ' + var_name
  • Symbol is a constant type
    • Expand the constant type with predefine constant list
  • Symbol is a builtin type
    • Expand the symbol using predefined builtin type generator
  • Symbol is a function call
    • Expand with result of function execution
  • For all other symbols
    • Recursively expand the symbols by calling _generate() with increasing recursion_depth by one

Each expanded result of the right hand side symbols is appended to ret_parts[].

Next, after expanding the symbols in the right hand side of the rule, function checks for all the newly generated variables which are interesting types. Each interesting typed new variables are passed on to the function self._add_variable(). 

self._add_variable() adds newly generated variable information to the generator context.

  • If currently adding variable type is not already in the context, it is added to the context['variables'].
  • Then checks if variable type has interesting lines, or expansion rules that are not creating new  

To the context['interesting_lines'], it adds interesting lines of the adding variable type which is not already in the context. Also variable's name is appended to the context['variables'][var_type].

Lastly if there are parent types that this variable type inherits from, its parent type is also added to the context by recursively calling self._add_variable(var_name, parent_type, context).

All the newly generated variables are filled in a line of format like below and appended to the context['lines'] along with final expanded rule.

"if (!" + v['name'] + ") { " + v['name'] + " = GetVariable(fuzzervars, '" + v['type'] + "'); } else { " + self._get_variable_setters(v['name'], v['type']) + " }")

If the current rule is a ordinary grammar rule, final result of expanded right hand side, is returned.

Example

Let's go back to the simple <cssrule> grammar and see how it is generated.

<cssrule> = <selector> { <declaration> }
<selector> = a
<selector> = b
<declaration> = width:100%

After the grammar rules are parsed, it will have following rules in the self._creators[] list.

 Parsed grammar rules

Parsed grammar rules

When calling generate_symbol("cssrule"), it selects a creator for "cssrule" and start expanding from there.

 Generating symbol

Generating symbol

After recursively calling _generate() when tags are met and joining returned expanded symbol results in the final "b { width:100% }".

Generate Code Directly

You can also directly generate code specifying number of lines to generated.

my_grammar. _generate_code(numlines)

First what it does is randomly chooses a code line number from self._all_nonhelper_lines (contains all line numbers of code line rules) and get the associated creator from self._creators['line']. Then it tries to expand that creator rule. It repeats this process until expanded lines exceeds number of lines it is trying to generate. 

Occasionally when all following conditions meet, function randomly chooses an interesting line for expanding symbols.

  • random.random() < self._interesting_line_prob
  • len(tmp_context['interesting_lines']) > 0
    • tmp_context['interesting_lines']) is filled while expanding rules

When expanding the code line creator rule, symbol 'line' is passed as the expanding symbol. At the end of the _expand_rule() function, there is a check for symbol 'line'. If the function is currently expanding symbol 'line', it returns final expanded string.

Example

Let's look at a simple example to understand the process better.

!varformat fuzzvar%05d
!lineguard try { <line> } catch(e) {}

!begin lines
<new element> = document.getElementById("<string min=97 max=122>");
<new element> = <element>.clone();
<element>.doSomething();
!end lines

After the code line rules are parsed, self._creators['line'] will be something like the following diagram.

Please note that although in the diagram only shows self._creators['line'], all the rules that create a new tag are also stored in self._creators[tagname] as well so that self._select_creator()  function can correctly select creators for expansion.

 Parsed code line rules

Parsed code line rules

Let's say "<new element> = <element>.clone();" was first randomly chosen for expanding symbol 'line'. Then final generated codes will follow similar path.

 Generating code

Generating code

When it encounters <new element> tag it creates a new variable. If ordinary <element> tag is met, it tries to expand that using randomly chosen creator for element tag by recursively calling _generate(). Each expanded code line is saved in the context['lines'] and expanding <element> tag is replaced by newly generated "element" type variable's name (e.g. fuzzvar00003). Order of code line being saved is backwards, the one generated with higher recursion depth saved first. That is why in the generated code, fuzzvar00003 is written first. 

Final Thoughts

Domato is simple but well engineered fuzzer. You can write any custom grammar and basically you can start generating codes and fuzz your target. With more knowledge of how domato's generation engine works, you can write a better grammar for fuzzing. 

If you have any questions, feedbacks or errors in the post feel free to reach out to me.

Thanks!