Skip to main content
. 2022 Nov 19;28(1):7. doi: 10.1007/s10664-022-10229-z

Table 13.

Data science code features for classification

Feature Description Rationale for inclusion
Cell features are extracted from the content of individual cells, their related components (such as the output obtained from the cell execution), and information about neighbouring cells.
filename‡ Filename indicates the name of the notebook file where the cell comes from. It provides metadata information about the cell by providing information about where it resides. Cells containing the same filename but with a different cell number provide positional information about the cell that may indicate the presence of certain data science steps.
cell_number Cell number indicates the position of the cell in a given notebook. Cell number along with filename uniquely identifies a cell in the dataset. This information about a code cell may be indicative of the type of the data science action in the cell as different data science steps are expected to occur at a certain phase in the cycle (Wang et al. 2019a; Zhang et al. 2020a; Aggarwal et al. 2019; Muller et al. 2019b; Souza et al. 2020). For e.g.,, data_preparation and data_exploration generally occurs in the initial stages of data science workflow whereas modelling and evaluation at a later stage.
execution_count Execution count indicates the order in which a cell was executed by the user in a notebook. It provides execution-level information about a code cell which could be indicative of a certain data science step. For example, a lower number may suggest a step at the beginning of the data science workflow.
text Text contains the content of the cell, which includes either the code text or markdown text depending on the cell_type (markdown or actual executable code). Different blocks of code implement different data science actions and hence can exhibit patterns that identify the data science step.
comment All the comments in a given code cell. This feature is valid only for code cells. It provides (additional) information about the intended purpose of the code present in a particular cell (Rule et al. 2018).
output_text This indicates the content of the output. For example, image/png or text/plain. This feature is valid for the cells of cell_type output. The output information of a given code cell may be indicative of a certain data science step. For example, image/png may indicate a data_exploration step since data scientists use visualisation techniques to explore data.
output_type This indicates the type of the output. For example, display_data. This feature is valid for the cells of cell_type output. It provides information about the type of the output (stream data or rich mime-type output) that could be indicative of a certain data science step.
code_line_before code_line_before indicates the last line of code in a code cell preceding the current cell. In the case of the cells of cell_type markdown or raw_nb, this information is obtained from the code cell preceding it (may or may not be immediately preceding). The previous code cell’s line of code provides additional context by providing information regarding the surrounding structure of a particular data science step (Bacchelli et al. 2012).
code_line_after code_line_after indicates the first line of code in a code cell following the current cell. In the case of the cells of cell_type markdown or raw_nb, this information is obtained from the code cell following the current cell (may or may not be immediately succeeding). The following code cell’s line of code provides additional contextual information and may provide indications about the previous step.
markdown_heading markdown_heading indicates the first line of a markdown cell. In the case of the cells of cell_type code, this information is obtained from the markdown cell preceding it (may or may not be immediately preceding). They are descriptive of the steps in the code cells that follow because data scientists tend to organise their code in sections for readability purposes ((Rule et al. 2018; Bacchelli et al. 2012)).
External features containing the detailed textual description of the imported libraries into a cell are included in order to extract information about its purpose.
packages_info This feature indicates the man-page information about the libraries imported in a given cell. The information contains the description of libraries as given in an external source: The Python Package Index (pypi)a. It provides information about the libraries imported in the cell using a detailed description, thus providing a larger bag of distinctive words.
Statistical/Metric features: Statistical or Software Metric-based features are features that contain code metrics for a given cell. Below custom metrics are generated using class:NotebookMetrics and class:CodeCellMetrics. The metric-based features are included under the view that structural information of the notebook can provide information about the data science step.
no_linesofcode Lines of code indicates the total number of lines in a given cell of cell_type code. It provides statistical information about the density of a code cell which could be indicative of the data science step because certain activities require a lot of lines of code than others. For example, data_preprocessing may have a lot of lines of code.
no_linesofcomment Lines of comment indicates the total number of comment lines in a given cell of cell_type code. Any line that starts with ’#’ is considered a comment line. It provides statistical information about the density of comments in a code cell which could be indicative of a certain data science step. For example, data_exploration may have a lot of lines of comments explaining the insights.
function_count Function count indicates the number of functions in a given cell of cell_type code. Any line that starts with the keyword ’def’ is considered to be a function. It provides statistical information about the number of functions created in a code cell which can indicate certain data science step (Barstad et al. 2014). For example, helper_functions will not have any functions defined.
variable_count Variable count indicates the number of variables in a given cell of cell_type code. A variable is considered to be any keyword composed of alphanumerics and _ with a ’=’ to the right. Parameters are not considered. It provides statistical information about cell structure and thus may help in identifying the data science step it is associated with (Barstad et al. 2014). For example, data_preprocessing may have a lot of variables compared to helper_functions.