Table 13.
Data science code features for classification
Feature | Description | Rationale for inclusion |
---|---|---|
Cell features are extracted from the content of individual cells, their related components (such as the output obtained from the cell execution), and information about neighbouring cells. | ||
filename‡ | Filename indicates the name of the notebook file where the cell comes from. | It provides metadata information about the cell by providing information about where it resides. Cells containing the same filename but with a different cell number provide positional information about the cell that may indicate the presence of certain data science steps. |
cell_number | Cell number indicates the position of the cell in a given notebook. Cell number along with filename uniquely identifies a cell in the dataset. | This information about a code cell may be indicative of the type of the data science action in the cell as different data science steps are expected to occur at a certain phase in the cycle (Wang et al. 2019a; Zhang et al. 2020a; Aggarwal et al. 2019; Muller et al. 2019b; Souza et al. 2020). For e.g.,, data_preparation and data_exploration generally occurs in the initial stages of data science workflow whereas modelling and evaluation at a later stage. |
execution_count | Execution count indicates the order in which a cell was executed by the user in a notebook. | It provides execution-level information about a code cell which could be indicative of a certain data science step. For example, a lower number may suggest a step at the beginning of the data science workflow. |
text | Text contains the content of the cell, which includes either the code text or markdown text depending on the cell_type (markdown or actual executable code). | Different blocks of code implement different data science actions and hence can exhibit patterns that identify the data science step. |
comment | All the comments in a given code cell. This feature is valid only for code cells. | It provides (additional) information about the intended purpose of the code present in a particular cell (Rule et al. 2018). |
output_text | This indicates the content of the output. For example, image/png or text/plain. This feature is valid for the cells of cell_type output. | The output information of a given code cell may be indicative of a certain data science step. For example, image/png may indicate a data_exploration step since data scientists use visualisation techniques to explore data. |
output_type | This indicates the type of the output. For example, display_data. This feature is valid for the cells of cell_type output. | It provides information about the type of the output (stream data or rich mime-type output) that could be indicative of a certain data science step. |
code_line_before | code_line_before indicates the last line of code in a code cell preceding the current cell. In the case of the cells of cell_type markdown or raw_nb, this information is obtained from the code cell preceding it (may or may not be immediately preceding). | The previous code cell’s line of code provides additional context by providing information regarding the surrounding structure of a particular data science step (Bacchelli et al. 2012). |
code_line_after | code_line_after indicates the first line of code in a code cell following the current cell. In the case of the cells of cell_type markdown or raw_nb, this information is obtained from the code cell following the current cell (may or may not be immediately succeeding). | The following code cell’s line of code provides additional contextual information and may provide indications about the previous step. |
markdown_heading | markdown_heading indicates the first line of a markdown cell. In the case of the cells of cell_type code, this information is obtained from the markdown cell preceding it (may or may not be immediately preceding). | They are descriptive of the steps in the code cells that follow because data scientists tend to organise their code in sections for readability purposes ((Rule et al. 2018; Bacchelli et al. 2012)). |
External features containing the detailed textual description of the imported libraries into a cell are included in order to extract information about its purpose. | ||
packages_info | This feature indicates the man-page information about the libraries imported in a given cell. The information contains the description of libraries as given in an external source: The Python Package Index (pypi)a. | It provides information about the libraries imported in the cell using a detailed description, thus providing a larger bag of distinctive words. |
Statistical/Metric features: Statistical or Software Metric-based features are features that contain code metrics for a given cell. Below custom metrics are generated using class:NotebookMetrics and class:CodeCellMetrics. The metric-based features are included under the view that structural information of the notebook can provide information about the data science step. | ||
no_linesofcode | Lines of code indicates the total number of lines in a given cell of cell_type code. | It provides statistical information about the density of a code cell which could be indicative of the data science step because certain activities require a lot of lines of code than others. For example, data_preprocessing may have a lot of lines of code. |
no_linesofcomment | Lines of comment indicates the total number of comment lines in a given cell of cell_type code. Any line that starts with ’#’ is considered a comment line. | It provides statistical information about the density of comments in a code cell which could be indicative of a certain data science step. For example, data_exploration may have a lot of lines of comments explaining the insights. |
function_count | Function count indicates the number of functions in a given cell of cell_type code. Any line that starts with the keyword ’def’ is considered to be a function. | It provides statistical information about the number of functions created in a code cell which can indicate certain data science step (Barstad et al. 2014). For example, helper_functions will not have any functions defined. |
variable_count | Variable count indicates the number of variables in a given cell of cell_type code. A variable is considered to be any keyword composed of alphanumerics and _ with a ’=’ to the right. Parameters are not considered. | It provides statistical information about cell structure and thus may help in identifying the data science step it is associated with (Barstad et al. 2014). For example, data_preprocessing may have a lot of variables compared to helper_functions. |