Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2020 Jan 7;12038:425–436. doi: 10.1007/978-3-030-40608-0_30

Simplified Parsing Expression Derivatives

Aaron Moss 5,
Editors: Alberto Leporati8, Carlos Martín-Vide9, Dana Shapira10, Claudio Zandron11
PMCID: PMC7206630

Abstract

This paper presents a new derivative parsing algorithm for parsing expression grammars; this new algorithm is both simpler and faster than the existing parsing expression derivative algorithm presented by Moss [12]. This new algorithm improves on the worst-case space and runtime bounds of the previous algorithm by a linear factor, as well as decreasing runtime by about half in practice.

Keywords: Parsing, Parsing expression grammar, Derivative parsing

Introduction

A derivative parsing algorithm for parsing expression grammars (PEGs) was first published by Moss [12]; this paper presents a simplified and improved algorithm, as well as a practical comparison of the two algorithms both to each other and to other PEG parsing methods. This new algorithm preserves or improves the performance bounds of the earlier algorithm, trimming a linear factor off the worst-case time and space bounds, while preserving the linear time and constant space bounds for the class of “well-behaved” inputs defined in [12].

Parsing Expression Grammars

Parsing expression grammars are a language formalism similar in power to the more familiar context-free grammars (CFGs). PEGs are a formalization of recursive-descent parsing with limited backtracking and infinite lookahead; Fig. 1 provides definitions of the fundamental parsing expressions. a is a character literal, matching and consuming a single character of input; Inline graphic is the empty expression which always matches without consuming any input, while Inline graphic is the failure expression, which never matches. A is a nonterminal, which is replaced by its corresponding parsing expression Inline graphic to provide recursive structure in the formalism. The negative lookahead expression Inline graphic provides much of the unique power of PEGs, matching only if its subexpression Inline graphic does not match, but consuming no input1. The sequence expression Inline graphic matches Inline graphic followed by Inline graphic, while the alternation expression Inline graphic matches either Inline graphic or Inline graphic. Unlike the unordered choice in CFGs, if its first alternative Inline graphic matches, an alternation expression never backtracks to attempt its second alternative Inline graphic; this ordered choice is responsible for the unambiguous nature of PEG parsing.

Fig. 1.

Fig. 1.

Formal definitions of parsing expressions; Inline graphic is the expansion of A

Parsing expressions are functions that recognize prefixes of strings, producing either the un-consumed suffix of a match, or Inline graphic on failure. The language Inline graphic of a parsing expression Inline graphic over strings from an alphabet Inline graphic is the set of strings matched by Inline graphic; precisely, Inline graphic. This paper uses the notation Inline graphic for the empty string (distinct from the empty expression Inline graphic) and Inline graphic for the suffix Inline graphic of some string Inline graphic.

Related Work

A number of recognition algorithms for parsing expression grammars have been presented in the literature, though none have combined efficient runtime performance with good worst-case bounds. Ford [4] introduced both the PEG formalism and two recognition algorithms: recursive descent (a direct translation of the functions in Fig. 1) and packrat (memoized recursive descent). The recursive descent algorithm has exponential worst-case runtime, though it behaves well in practice (as shown in Sect. 6); packrat improves the runtime bound to linear, but at the cost of best-case linear space usage. Ford [5] also showed that there exist PEGs to recognize non-context-free languages (e.g. Inline graphic), and conjectured that some context-free languages exist for which there is no PEG. Mizushima et al. [11] have demonstrated the use of manually-inserted “cut operators” to trim memory usage of packrat parsing to a constant, while maintaining the asymptotic worst-case bounds; Kuramitsu [8] and Redziejowski [14] have built modified packrat parsers that use heuristic table-trimming mechanisms to achieve similar real-world performance without manual grammar modifications, but which sacrifice the polynomial worst-case runtime. Medeiros and Ierusalimschy [9] have developed a parsing machine for PEGs, similar in concept to a recursive descent parser, but somewhat faster in practice. Henglein and Rasmussen [7] have proved linear worst-case time and space bounds for their progressive tabular parsing algorithm, with some evidence of constant space usage in practice for a simple JSON grammar, but their work lacks empirical comparisons to other algorithms.

Moss [12] and Garnock-Jones et al. [6] have developed derivative parsing algorithms for PEGs. This paper extends the work of Moss, improving the theoretical quartic time and cubic space bounds by a linear factor each, and halving runtime in practice. Garnock-Jones et al. do not include empirical performance results for their work, but their approach elegantly avoids defining new parsing expressions through use of a nullability combinator to represent lookahead followers as later alternatives of an alternation expression.

Derivative Parsing

Though the backtracking capabilities of PEGs are responsible for much of their expressive power and ease-of-use, backtracking is also responsible for the worst-case resource bounds of existing algorithms. Recursive-descent parsing uses exponential time in the worst case to perform backtracking search, while packrat parsing trades this worst-case time for high best-case space usage. Derivative parsing presents a different trade-off, with low common-case memory usage paired with a polynomial time bound. A derivative parsing approach pursues all backtracking options concurrently, eliminating the repeated backtracking over the same input characteristic of worst-case recursive-descent, but also discarding bookkeeping information for infeasible options, saving space relative to packrat.

The essential idea of derivative parsing, first introduced by Brzozowski [3], is to iteratively transform an expression into an expression for the “rest” of the input. For example, given Inline graphic, Inline graphic, the suffixes that can follow Inline graphic in Inline graphic. After one derivative, the first character of the input has been consumed, and the grammar mutated to account for this missing character. Once repeated derivatives have been taken for every character in the input string, the resulting expression can be checked to determine whether or not it represents a match, e.g. Inline graphic, a matching result. Existing work shows how to compute the derivatives of regular expressions [3], context-free grammars [10], and parsing expression grammars [6, 12]. This paper presents a simplified algorithm for parsing expression derivatives, as well as a formal proof of the correctness of this algorithm, an aspect lacking from the earlier presentations.

The difficulty in designing a derivative parsing algorithm for PEGs is simulating backtracking when the input must be consumed at each step, with no ability to re-process earlier input characters. Consider !(ab)a; ab and a must be parsed concurrently, and an initial match of a must be reversed if ab later matches. Alternations introduce further complications; consider Inline graphic: the final a must be parsed concurrently with Inline graphic, but also “held back” until after the a in Inline graphic has been matched. To track the connections among such backtracking choices, Moss [12] used a system of “backtracking generations” to label possible backtracking options for each expression, as well as a complex mapping algorithm to translate the backtracking generations of parsing expressions to the corresponding generations of their parent expressions. The key observation of the simplified algorithm presented here is that an index into the input string is sufficient to label backtracking choices consistently across all parsing expressions.

Typically [3, 10, 12], the derivative Inline graphic is a function from an expression Inline graphic and a character Inline graphic to a derivative expression. Formally, Inline graphic. This paper defines a derivative Inline graphic, adding an index i for the current location in the input. This added index is used as a label to connect backtracking decisions across derivative subexpressions by annotation of certain parsing expressions. A sequence expression Inline graphic must track possible indices where Inline graphic may have stopped consuming characters and Inline graphic began to be parsed; to this end, Inline graphic is annotated with a list of lookahead followers Inline graphic, where Inline graphic is the repeated derivative of Inline graphic starting at each index Inline graphic where Inline graphic may have stopped consuming characters. To introduce this backtracking, Inline graphic and Inline graphic, neither of which consume any characters, become Inline graphic, a match at index j, and Inline graphic, a lookahead expression at index j. These annotated expressions are formally defined in Fig. 2; note that they produce either a string or Inline graphic under the same conditions as their equivalents in Fig. 1. Considered in isolation these extensions appear to introduce a dependency on the string Inline graphic into the expression definition (given that Inline graphic is a suffix of Inline graphic), but within the context of the derivative parsing algorithm any Inline graphic or Inline graphic must be in the Inline graphic subexpression of a sequence expression Inline graphic and paired with a corresponding Inline graphic lookahead follower such that Inline graphic, eliminating the dependency. Figure 3 defines a normalization function Inline graphic to annotate parsing expressions with their indices; derivative parsing of Inline graphic starts by taking Inline graphic.

Fig. 2.

Fig. 2.

Formal definitions of added parsing expressions

Fig. 3.

Fig. 3.

Definition of normalization function

Expressions that are known to always match their input provide opportunities for short-circuiting a derivative computation. For instance, if Inline graphic is an expression that is known to match, Inline graphic never tries the Inline graphic alternative, while Inline graphic always fails, allowing these expressions to be replaced by the simpler Inline graphic and Inline graphic, respectively. A similar optimization opportunity arises when expressions that have stopped consuming input are later invalidated; the augmented sequence expression Inline graphic keeps an ongoing derivative Inline graphic of Inline graphic for each start position j that may be needed, so discarding unreachable Inline graphic is essential for performance. Might et al. [10] dub this optimization “compaction” and demonstrate its importance to derivative performance; this work includes compaction in the derivative step based on functions back and match defined in Fig. 4 over normalized parsing expressions. By these definitions, based on [12], Inline graphic is the set of indices where Inline graphic may have stopped consuming input, while Inline graphic is the set of indices where Inline graphic matched. Note that Inline graphic and the definition of Inline graphic depends on the invariant that the Inline graphic alternative is discarded if Inline graphic matches.

Fig. 4.

Fig. 4.

Definitions of back and match

With these preliminaries established, the derivative is defined in Fig. 5. The derivative consumes character literals, while preserving Inline graphic matches and Inline graphic failures. To a first approximation, the derivative distributes through lookahead and alternation, though match and failure results trigger expression simplification. The bulk of the work done by the algorithm is in the sequence expression Inline graphic derivative. At a high level, the sequence derivative takes the derivative of Inline graphic, then updates the appropriate derivatives of Inline graphic, selecting one if Inline graphic matches. Any index j in Inline graphic where Inline graphic may have stopped consuming input needs to be paired with a corresponding backtrack follower Inline graphic; introducing a new follower Inline graphic involves a normalization operation. Testing for a match at end-of-input is traditionally [3, 6, 10] handled in derivative parsing with a nullability combinator Inline graphic which reduces the grammar to Inline graphic or Inline graphic; this work uses the derivative with respect to an end-of-input character Inline graphic to implement this combinator. As such, if Inline graphic matches at end-of-input, Inline graphic must also be evaluated. As in previous work [10, 12], Inline graphic, Inline graphic, back, and match are all memoized for performance.

Fig. 5.

Fig. 5.

Definition of derivative step; Inline graphic is end-of-input

The derivative with respect to a character can be extended to the derivative with respect to a string Inline graphic by repeated application: Inline graphic. After augmentation with an initial normalization step and final end-of-input derivative, the overall derivative parsing algorithm is then Inline graphic. If Inline graphic, then Inline graphic, otherwise Inline graphic. As an example, see Fig. 6.

Fig. 6.

Fig. 6.

Derivative execution example on string Inline graphic

Correctness

There is insufficient space in this paper to include a formal proof of the correctness of the presented algorithm. The author has produced such a proof, however; the general approach is outlined here.

The proof makes extensive use of structural induction, thus it must also show that such induction terminates when applied to recursively-expanded nonterminals. If evaluation of a parsing expression involves a left-recursive call to a nonterminal, this evaluation never terminates; as such, left-recursive grammars are generally excluded from consideration. Ford [5, § 3.6] introduced the notion that a parsing expression is well-formed if it does not occur anywhere in its own recursive left-expansion or have any subexpression that does; Fig. 7 formalizes the immediate left-expansion Inline graphic and the recursive left-expansion Inline graphic consistently with Ford’s definition. The normalization step presented in this paper expands nonterminals left-recursively, eliminating recursive structure from the parsing expressions considered by the derivative algorithm; this expansion is safe for well-formed grammars.

Fig. 7.

Fig. 7.

Definition of Inline graphic left-expansion function and its transitive closure Inline graphic; LE computed by iteration to a fixed point.

To prove the equivalence of derivative parsing with recursive descent, it must be shown that normalization does not change the semantics of a parsing expression, that the derivative step performs the expected transformation of the language of an expression, and that the end-of-input derivative correctly implements the behavior of an expression on the empty string. In each of these cases, the proof proceeds by treating the relevant parsing expressions as functions over their input and proving that they produce equivalent results.

Proof of correctness of the derivative step depends on a number of invariant properties of the normalized parsing expressions (e.g. there is a lookahead follower Inline graphic in Inline graphic for every Inline graphic that may arise from derivatives of Inline graphic); these properties must be shown to be established by the Inline graphic function and maintained by Inline graphic. Other lemmas needed to support the proof describe the dynamic behavior of the derivative algorithm (e.g. Inline graphic implies that the derivative of Inline graphic eventually becomes a Inline graphic success result).

Without appealing to a formal proof of correctness, it should be noted that the experimental results in Sect. 6 demonstrate successful matching of a large number of strings, and thus a low (possibly zero) false-negative rate for the derivative algorithm; further automated correctness tests are available with the source distribution [13].

Analysis

In [12], Moss demonstrated the polynomial worst-case space and time of his algorithm with an argument based on bounds on the depth and fanout of the DAG formed by his derivative expressions. These bounds, cubic space and quartic time, were improved to constant space and linear time for a broad class of “well-behaved” inputs with constant-bounded backtracking and depth of recursive invocation. This paper includes a similar analysis of the algorithm presented here, improving the worst-case bounds of the previous algorithm by a linear factor, to quadratic space and cubic time, while maintaining the optimal constant space and linear time bounds for the same class of “well-behaved” inputs.

For an input string of length n, the algorithm runs O(n) derivative steps; the cost of each derivative step Inline graphic is the sum of the cost of the derivative algorithm in Fig. 5 on each expression node in the recursive left-expansion Inline graphic of Inline graphic. Since by convention the size of the grammar is a constant, all operations on any expression Inline graphic from the original grammar (particularly Inline graphic) run in constant time and space. It can be observed from the derivative step and index equations in Figs. 5 and 4 that once the appropriate subexpression derivatives have been calculated, the cost of a derivative step on a single expression node Inline graphic is proportional to the size of the immediate left-expansion of Inline graphic, Inline graphic. Let b be the maximum Inline graphic over all Inline graphic; by examination of Fig. 7, Inline graphic is bounded by the number of backtracking followers Inline graphic in the annotated sequence expression. Since no more than one backtracking follower may be added per derivative step, Inline graphic. Assuming Inline graphic is memoized for each i, only a constant number of expression nodes may be added to the expression at each derivative step, therefore Inline graphic. By this argument, the derivative parsing algorithm presented here runs in Inline graphic worst-case space and Inline graphic worst-case time, improving the previous space and time bounds for derivative parsing of PEGs by a linear factor each. This linear improvement over the algorithm presented in [12] is due to the new algorithm only storing O(b) backtracking information in sequence nodes, rather than Inline graphic as in the previous algorithm.

In practical use, the linear time and constant space results presented in [12] for inputs with constant-bounded backtracking and grammar nesting (a class that includes most source code and structured data) also hold for this algorithm. If b is bounded by a constant rather than its linear worst-case, the bounds discussed above are reduced to linear space and quadratic time. Since b is a bound on the size of Inline graphic, it can be seen from Fig. 7 that this is really a bound on sequence expression backtracking choices, which existing work including [12] has shown is often bounded by a constant in practical use.

Given that the bound on b limits the fanout of the derivative expression DAG, a constant bound on the depth of that DAG implies that the overall size of the DAG is similarly constant-bounded. Intuitively, the bound on the depth of the DAG is a bound on recursive invocations of a nonterminal by itself, applying a sort of “tail-call optimization” for right-recursive invocations such as Inline graphic. The conjunction of both of these bounds defines the class of “well-behaved” PEG inputs introduced by Moss in [12], and by the constant bound on derivative DAG size this algorithm also runs in constant space and linear time on such inputs.

Experimental Results

In addition to being easier to implement than the previous derivative parsing algorithm, the new parsing expression derivative also has superior performance.

To test this performance, the simplified parsing expression derivative (SPED) algorithm was compared against the parser-combinator-based recursive descent (Rec.) and packrat (Pack.) parsers used in [12], as well as the parsing expression derivative (PED) implementation from that paper. The same set of XML, JSON, and Java inputs and grammars used in [12] are used here; the inputs originally come from [11]. Code and test data are available online [13]. All tests were compiled with g++ 6.2.0 and run on a Windows system with 8 GB of RAM, a 2.6 GHz processor, and SSD main storage.

Figure 8 shows the runtime of all four algorithms on all three data sets, plotted against the input size; Fig. 9 shows the memory usage of the same runs, also plotted against the input size, but on a log-log scale.

Fig. 8.

Fig. 8.

Algorithm runtime with respect to input size; lower is better.

Fig. 9.

Fig. 9.

Maximum algorithm memory use with respect to input size; lower is better.

Contrary to its poor worst-case asymptotic performance, the recursive descent algorithm is actually best in practice, running most quickly on all tests, and using the least memory on all but the largest inputs (where the derivative parsing algorithms’ ability to not buffer input gives them an edge). Packrat parsing is consistently slower than recursive descent, while using two orders of magnitude more memory. The two derivative parsing algorithms have significantly slower runtime, but memory usage closer to recursive descent than packrat.

Though on these well-behaved inputs all four algorithms run in linear time and space (constant space for the derivative parsing algorithms), the constant factor differs by both algorithm and grammar complexity. The XML and JSON grammars are of similar complexity, with 23 and 24 nonterminals, respectively, and all uses of lookahead expressions Inline graphic and &Inline graphic eliminated by judicious use of the more specialized negative character class, end-of-input, and until expressions described in [12]. It is consequently unsurprising that the parsers have similar runtime performance on those two grammars. By contrast, the Java grammar is significantly more complex, with 178 nonterminals and 54 lookahead expressions, and correspondingly poorer runtime performance.

Both the packrat algorithm and the derivative parsing algorithm presented here trade increased space usage for better runtime. Naturally, this trade-off works more in their favour for more complex grammars, particularly those with more lookahead expressions, as suggested by Moss [12]. Grouping the broadly equivalent XML and JSON tests together and comparing mean speedup, recursive descent is 3.3x as fast as packrat and 18x as fast as SPED on XML and JSON, yet only 1.6x as fast as packrat and 3.7x as fast as SPED for Java. Packrat’s runtime advantage over SPED also decreases from 5.5x to 2.3x between XML/JSON and Java.

Though the packrat algorithm is a modest constant factor faster than the derivative parsing algorithm across the test suite, it uses as much as 300x as much peak memory on the largest test cases, with the increases scaling linearly in the input size. Derivative parsing, by contrast, maintains a grammar-dependent constant memory usage across all the (well-behaved) inputs tested. This constant memory usage is within a factor of two on either side of the memory usage of the recursive descent implementation on all the XML and JSON inputs tested, and 3–5x more on the more complex Java grammar. The higher memory usage on Java is likely due to the lookahead expressions, which are handled with runtime backtracking in recursive descent, but extra concurrently-processed expressions in derivative parsing.

Derivative parsing in general is known to have poor runtime performance [1, 10], as these results also demonstrate. However, this new algorithm does provide a significant improvement on the current state of the art for parsing expression derivatives, with a 40% speedup on XML and JSON, a 50% speedup on Java, and an up to 13% decrease in memory usage. This improved performance may be beneficial for use cases that specifically require the derivative computation, such as the modular parsers of Brachthäuser et al. [2] or the sentence generator of Garnock-Jones et al. [6].

Conclusion and Future Work

This paper has introduced a new derivative parsing algorithm for PEGs based on the previously-published algorithm in [12]. Its key contributions are simplification of the earlier algorithm and empirical comparison of this new algorithm to previous work. The simplified algorithm also improves the worst-case space and time bounds of the previous algorithm by a linear factor. The author has produced a formal proof of correctness for this simplified algorithm, but was unable to include it in this paper due to space constraints.

While extension of this recognition algorithm to a parsing algorithm remains future work, any such extension may rely on the fact that successfully recognized parsing expressions produce a Inline graphic expression in this algorithm, where e is the index where the last character was consumed. As one approach, Inline graphic might annotate parsing expressions with b, the index where they began to consume characters. By collecting subexpression matches and combining the two indices b and e on a successful match, this algorithm should be able to return a parse tree on match, rather than simply a recognition decision. The parser derivative approach of Might et al. [10] may be useful here, with the added simplification that PEGs, unlike CFGs, have no more than one valid parse tree, and thus do not need to store multiple possible parses in a single node.

Footnotes

1

The positive lookahead expression&Inline graphic can be expressed as Inline graphic.

Contributor Information

Alberto Leporati, Email: alberto.leporati@unimib.it.

Carlos Martín-Vide, Email: carlos.martin@urv.cat.

Dana Shapira, Email: shapird@g.ariel.ac.il.

Claudio Zandron, Email: zandron@disco.unimib.it.

Aaron Moss, Email: mossa@up.edu.

References

  • 1.Adams, M.D., Hollenbeck, C., Might, M.: On the complexity and performance of parsing with derivatives. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, pp. 224–236. ACM, New York (2016)
  • 2.Brachthäuser, J.I., Rendel, T., Ostermann, K.: Parsing with first-class derivatives. In: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, pp. 588–606. ACM, New York (2016)
  • 3.Brzozowski JA. Derivatives of regular expressions. J. ACM (JACM) 1964;11(4):481–494. doi: 10.1145/321239.321249. [DOI] [Google Scholar]
  • 4.Ford, B.: Packrat parsing: a practical linear-time algorithm with backtracking. Master’s thesis, Massachusetts Institute of Technology, September 2002
  • 5.Ford, B.: Parsing expression grammars: a recognition-based syntactic foundation. In: ACM SIGPLAN Notices, vol. 39, no. 1, pp. 111–122. ACM (2004)
  • 6.Garnock-Jones, T., Eslamimehr, M., Warth, A.: Recognising and generating terms using derivatives of parsing expression grammars. arXiv preprint arXiv:1801.10490 (2018)
  • 7.Henglein, F., Rasmussen, U.T.: PEG parsing in less space using progressive tabling and dynamic analysis. In: Proceedings of the 2017 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, PEPM 2017, pp. 35–46. ACM, New York (2017)
  • 8.Kuramitsu K. Packrat parsing with elastic sliding window. J. Inf. Process. 2015;23(4):505–512. [Google Scholar]
  • 9.Medeiros, S., Ierusalimschy, R.: A parsing machine for PEGs. In: Proceedings of the 2008 Symposium on Dynamic Languages, DLS 2008, pp. 2:1–2:12. ACM, New York (2008)
  • 10.Might, M., Darais, D., Spiewak, D.: Parsing with derivatives: a functional pearl. In: ACM SIGPLAN Notices, vol. 46, no. 9, pp. 189–195. ACM (2011)
  • 11.Mizushima, K., Maeda, A., Yamaguchi, Y.: Packrat parsers can handle practical grammars in mostly constant space. In: Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. pp. 29–36. ACM (2010)
  • 12.Moss, A.: Derivatives of parsing expression grammars. In: Proceedings of the 15th International Conference on Automata and Formal Languages, AFL 2017, Debrecen, Hungary, 4–6 September 2017, pp. 180–194 (2017). 10.4204/EPTCS.252.18
  • 13.Moss, A.: Egg (2018). https://github.com/bruceiv/egg/tree/deriv
  • 14.Redziejowski RR. Parsing expression grammar as a primitive recursive-descent parser with backtracking. Fundam. Inform. 2007;79(3–4):513–524. [Google Scholar]

Articles from Language and Automata Theory and Applications are provided here courtesy of Nature Publishing Group

RESOURCES