Table 7.
Features from Fonduer’s feature library. Example values are drawn from the example candidate in Figure 1. Capitalized prefixes represent the feature templates and the remainder of the string represents a feature’s value.
Feature Type | Arity | Description | Example Value |
---|---|---|---|
| |||
Structural | Unary | HTML tag of the mention | TAG_<h1> |
Structural | Unary | HTML attributes of the mention | HTML_ATTR_font-family:Arial |
Structural | Unary | HTML tag of the mention’s parent | PARENT_TAG_<p> |
Structural | Unary | HTML tag of the mention’s previous sibling | PREV_SIB_TAG_<td> |
Structural | Unary | HTML tag of the mention’s next sibling | NEXT_SIB_TAG_<h1> |
Structural | Unary | Position of a node among its siblings | NODE_POS_1 |
Structural | Unary | HTML class sequence of the mention’s ancestors | ANCESTOR_CLASS_<s1> |
Structural | Unary | HTML tag sequence of the mention’s ancestors | ANCESTOR_TAG_<body>_<p> |
Structural | Unary | HTML ID’s of the mention’s ancestors | ANCESTOR_ID_l1b |
Structural | Binary | HTML tags shared between mentions on the path to the root of the document | COMMON_ANCESTOR_<body> |
Structural | Binary | Minimum distance between two mentions to their lowest common ancestor | LOWEST_ANCESTOR_DEPTH_1 |
| |||
Tabular | Unary | N-grams in the same cell as the mentiona | CELL_cevb |
Tabular | Unary | Row number of the mention | ROW_NUM_5 |
Tabular | Unary | Column number of the mention | COL_NUM_3 |
Tabular | Unary | Number of rows the mention spans | ROW_SPAN_1 |
Tabular | Unary | Number of columns the mention spans | COL_SPAN_1 |
Tabular | Unary | Row header n-grams in the table of the mention | ROW_HEAD_collector |
Tabular | Unary | Column header n-grams in the table of the mention | COL_HEAD_value |
Tabular | Unary | N-grams from all Cells that are in the same row as the given mentiona | ROW_200_[ma]c |
Tabular | Unary | N-grams from all Cells that are in the same column as the given mentiona | COL_200_[6]c |
Tabular | Binary | Whether two mentions are in the same table | SAME_TABLEb |
Tabular | Binary | Row number difference if two mentions are in the same table | SAME_TABLE_ROW_DIFF_1b |
Tabular | Binary | Column number difference if two mentions are in the same table | SAME_TABLE_COL_DIFF_3b |
Tabular | Binary | Manhattan distance between two mentions in the same table | SAME_TABLE_MANHATTAN_DIST_10b |
Tabular | Binary | Whether two mentions are in the same cell | SAME_CELLb |
Tabular | Binary | Word distance between mentions in the same cell | WORD_DIFF_1b |
Tabular | Binary | Character distance between mentions in the same cell | CHAR_DIFF_1b |
Tabular | Binary | Whether two mentions in a cell are in the same sentence | SAME_PHRASEb |
Tabular | Binary | Whether two mention are in the different tables | DIFF_TABLEb |
Tabular | Binary | Row number difference if two mentions are in different tables | DIFF_TABLE_ROW_DIFF_4b |
Tabular | Binary | Column number difference if two mentions are in different tables | DIFF_TABLE_COL_DIFF_2b |
Tabular | Binary | Manhattan distance between two mentions in different tables | DIFF_TABLE_MANHATTAN_DIST_7b |
| |||
Visual | Unary | N-grams of all lemmas visually aligned with the mentiona | ALIGNED_current |
Visual | Unary | Page number of the mention | PAGE_1 |
Visual | Binary | Whether two mentions are on the same page | SAME_PAGE |
Visual | Binary | Whether two mentions are horizontally aligned | HORZ_ALIGNEDb |
Visual | Binary | Whether two mentions are vertically aligned | VERT_ALIGNED |
Visual | Binary | Whether two mentions’ left bounding-box borders are vertically aligned | VERT_ALIGNED_LEFTb |
Visual | Binary | Whether two mentions’ right bounding-box borders are vertically aligned | VERT_ALIGNED_RIGHTb |
Visual | Binary | Whether the centers of two mentions’ bounding boxes are vertically aligned | VERT_ALIGNED_CENTERb |
All N-grams are 1-grams by default.
This feature was not present in the example candidate. The values shown are example values from other documents.
In this example, the mention is 200, which forms part of the feature prefix. The value is shown in square brackets.