Holistic Twig Joins
- The previous Section showed; how one can reduce the total running time by avoiding sort operators and reduces the total running time and/or the size of the intermediary join results, by choosing the order in which to perform the joins.
- A different approach toward the goals of reducing the running time. And the size of the intermediary results consists in devising a new (logical and physical operator), more precisely, an n-ary structural join operator also called a holistic twig join.
- Such an operator builds the result of a tree pattern query in a single pass over all the inputs in parallel.
- This eliminates the need for storing intermediary results and may also significantly reduce the total running time.
PathStack Algorithm ( for linear queries ) : Holistic Twig Joins
- PathStack works by continuously looking for the input operator (let’s call it iopmin, corresponding to the query node nmin labeled amin) whose first ID has the smallest pre-number among all the input streams.
- This amounts to finding the first element (in document order) among those not yet processed, across all inputs.
- Let’s call this element emin.
- Once emin has been found, PathStack inspects all the stacks, looking for nodes of which it can guarantee that they will not contribute to further query results.
- In particular, this is the case for any nodes preceding emin, i.e., ending before the start of emin.
- It is easy to see that such nodes cannot be ancestors either of emin, nor of any of the remaining nodes in any of the inputs, since such nodes have a starting position even bigger than emin’s.
PathStack pops all such entries, from all stacks. : Holistic Twig Joins
- PathStack then pushes the current nmin entry on, if. And only if suitable ancestors of this element have already been identified. And pushed on the stack amin The parent node of amin in the query.
- If this is the case and a new entry pushed on, importantly, a pointer stored from the top entry in, to the new (top) entry in
- Such pointers record the connections between the stack entries matching different query nodes. And will used to build result tuples out.
- Finally, PathStack advances the input operator iopmin and resumes its main loop, identifying again the input operator holding the first element (in document order) not yet processed etc.
- When an entry pushed on the stack corresponding to the query leaf. It is certain that the stacks contain matches for all its ancestors in the query, matches that ancestors of the leaf stack entry.
- At this point, two steps apply:
- Result tuples are built out of the entries on the stacks. And in particular of the new entry on the stack of the query leaf;
- This new entry popped from the stack.
TwigStack Algorithm : Holistic Twig Joins
- This algorithm generalizes PathStack with support for multiple branches.
- The ideas very similar, as the main features (no intermediary results. And space-efficient encoding of twig query results), are the same.
- if the query pattern edges only of type ancestor/descendant. TwigStack is I/O and CPU optimal among all sequential algorithms that read their inputs in their entirety.