Skip to main content
. Author manuscript; available in PMC: 2018 Jun 21.
Published in final edited form as: Proc ACM SIGMOD Int Conf Manag Data. 2018 Jun;2018:1301–1316. doi: 10.1145/3183713.3183729

Table 7.

Features from Fonduer’s feature library. Example values are drawn from the example candidate in Figure 1. Capitalized prefixes represent the feature templates and the remainder of the string represents a feature’s value.

Feature Type Arity Description Example Value

Structural Unary HTML tag of the mention TAG_<h1>
Structural Unary HTML attributes of the mention HTML_ATTR_font-family:Arial
Structural Unary HTML tag of the mention’s parent PARENT_TAG_<p>
Structural Unary HTML tag of the mention’s previous sibling PREV_SIB_TAG_<td>
Structural Unary HTML tag of the mention’s next sibling NEXT_SIB_TAG_<h1>
Structural Unary Position of a node among its siblings NODE_POS_1
Structural Unary HTML class sequence of the mention’s ancestors ANCESTOR_CLASS_<s1>
Structural Unary HTML tag sequence of the mention’s ancestors ANCESTOR_TAG_<body>_<p>
Structural Unary HTML ID’s of the mention’s ancestors ANCESTOR_ID_l1b
Structural Binary HTML tags shared between mentions on the path to the root of the document COMMON_ANCESTOR_<body>
Structural Binary Minimum distance between two mentions to their lowest common ancestor LOWEST_ANCESTOR_DEPTH_1

Tabular Unary N-grams in the same cell as the mentiona CELL_cevb
Tabular Unary Row number of the mention ROW_NUM_5
Tabular Unary Column number of the mention COL_NUM_3
Tabular Unary Number of rows the mention spans ROW_SPAN_1
Tabular Unary Number of columns the mention spans COL_SPAN_1
Tabular Unary Row header n-grams in the table of the mention ROW_HEAD_collector
Tabular Unary Column header n-grams in the table of the mention COL_HEAD_value
Tabular Unary N-grams from all Cells that are in the same row as the given mentiona ROW_200_[ma]c
Tabular Unary N-grams from all Cells that are in the same column as the given mentiona COL_200_[6]c
Tabular Binary Whether two mentions are in the same table SAME_TABLEb
Tabular Binary Row number difference if two mentions are in the same table SAME_TABLE_ROW_DIFF_1b
Tabular Binary Column number difference if two mentions are in the same table SAME_TABLE_COL_DIFF_3b
Tabular Binary Manhattan distance between two mentions in the same table SAME_TABLE_MANHATTAN_DIST_10b
Tabular Binary Whether two mentions are in the same cell SAME_CELLb
Tabular Binary Word distance between mentions in the same cell WORD_DIFF_1b
Tabular Binary Character distance between mentions in the same cell CHAR_DIFF_1b
Tabular Binary Whether two mentions in a cell are in the same sentence SAME_PHRASEb
Tabular Binary Whether two mention are in the different tables DIFF_TABLEb
Tabular Binary Row number difference if two mentions are in different tables DIFF_TABLE_ROW_DIFF_4b
Tabular Binary Column number difference if two mentions are in different tables DIFF_TABLE_COL_DIFF_2b
Tabular Binary Manhattan distance between two mentions in different tables DIFF_TABLE_MANHATTAN_DIST_7b

Visual Unary N-grams of all lemmas visually aligned with the mentiona ALIGNED_current
Visual Unary Page number of the mention PAGE_1
Visual Binary Whether two mentions are on the same page SAME_PAGE
Visual Binary Whether two mentions are horizontally aligned HORZ_ALIGNEDb
Visual Binary Whether two mentions are vertically aligned VERT_ALIGNED
Visual Binary Whether two mentions’ left bounding-box borders are vertically aligned VERT_ALIGNED_LEFTb
Visual Binary Whether two mentions’ right bounding-box borders are vertically aligned VERT_ALIGNED_RIGHTb
Visual Binary Whether the centers of two mentions’ bounding boxes are vertically aligned VERT_ALIGNED_CENTERb
a

All N-grams are 1-grams by default.

b

This feature was not present in the example candidate. The values shown are example values from other documents.

c

In this example, the mention is 200, which forms part of the feature prefix. The value is shown in square brackets.