{
"cells": [
{
"cell_type": "markdown",
"id": "0281b33a",
"metadata": {},
"source": [
"
\n",
"
Development and validation of a model for measuring alcohol consumption from transdermal alcohol content data among college students \n",
" Programmer: Sina Kianersi
\n",
" Email:
skianersi@bwh.harvard.edu\n",
" Objectives: This notebook contains the source code for developing the model described in the manuscript.\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "480fd13e",
"metadata": {},
"outputs": [],
"source": [
"# Import required packages\n",
"import random \n",
"import pandas as pd # version 1.4.4\n",
"import numpy as np # version 1.21.5\n",
"import os\n",
"import datetime as dt \n",
"from pytz import FixedOffset # version 2022.1\n",
"import io\n",
"\n",
"from scipy.stats import randint, beta, uniform, reciprocal # version 1.9.1\n",
"from scipy.signal import find_peaks \n",
"from scipy.integrate import simps \n",
"from scipy.ndimage import median_filter\n",
"\n",
"# Ignore pandas future warnings\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"\n",
"from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin # version 1.2.2\n",
"from sklearn.model_selection import (RandomizedSearchCV, GridSearchCV, GroupKFold, ParameterGrid) \n",
"from sklearn.metrics import (balanced_accuracy_score, mean_absolute_error) \n",
"from sklearn.utils.fixes import loguniform \n",
"from sklearn.preprocessing import StandardScaler \n",
"from sklearn.linear_model import (LinearRegression, SGDRegressor)\n",
"from sklearn.svm import SVR \n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"from lightgbm import LGBMRegressor # version 3.3.5\n",
"import json \n",
"import joblib # version 1.2.0\n",
"\n",
"import multiprocessing # version\n",
"n_jobs = int(max(1, multiprocessing.cpu_count())/2)\n",
"\n",
"# set path to data\n",
"data_path= # add data path"
]
},
{
"cell_type": "markdown",
"id": "f6b9ef75",
"metadata": {},
"source": [
"# Model Development"
]
},
{
"cell_type": "markdown",
"id": "4dd8d1e9",
"metadata": {},
"source": [
"__Figure 1. Summary of model development__\n",
"
\n"
]
},
{
"cell_type": "markdown",
"id": "e5368115",
"metadata": {},
"source": [
"## Procedures I and II"
]
},
{
"cell_type": "markdown",
"id": "1f3bc23f",
"metadata": {},
"source": [
"\n",
"\n",
"
Procedure I: TAC data processing\n",
"
\n",
" - We took three steps to clean and process the TAC data: Step 1) recode negative TAC values as zero, Step 2) implement median filter on the recoded TAC values, step 3) implement moving average on the output of median filter. The number of TAC values that enter the median filtering process at each point is determined by a hyperparameter named window size (named
size
in the code). After implementing median filter on the TAC data, we further attempted to remove remaining noise using moving average. Similarly, for moving average, there was one hyperparameter that we tuned, window size (named window
in the code). Moving average generates a few missing values. Peak detection algorithm (procedure II of our model) may produce wrong results if there are missing values in the TAC timeseries ([REF](https://docs.scipy.org/doc/)). All missing TAC values were removed before running the peak detection algorithm. \n",
" - A total of 1,066,391 negative TAC values were coded as zero. Among negative values, median TAC value was -3.03 ug/L(air) (IQR: -4.02)”
\n",
"
\n",
"\n",
"
Procedure II: Peak detection algorithm
\n",
"
\n",
" - Defining true/false positive/negative cases: The goal of the second procedure was to use a peak detection algorithm to detect drinking events in processed TAC data. Before training the algorithm, we needed to clearly define true/false positive/negative cases.
\n",
" - Five-hour intervals: We divided the week of data collection into equal five-hour intervals starting from 9:00 am of the baseline visit day to 9:00 pm of the endline visit day. This was a necessary step because, unlike daily surveys where the time unit of analysis is clear (i.e., day), there is no unit of analysis for the EMA app data. It is not possible to conceptualize and enumerate some performance measures (true negative/false negative) without a unit of analysis.
\n",
" - If there was no peak detected with the peak detection algorithm and no drinking start time (event) was recorded in the EMA app for a five-hour interval, that interval was counted as a true negative. Otherwise, if there was no peak detected but a drink was recorded on the EMA app, the interval was counted as a false negative. However, when evaluating the number of true positive and false positive values, we did not take the five-hour intervals into consideration because a detected peak and a recorded timestamp could have been close to each other but on different five-hour intervals. Instead, we assessed the time difference between a recorded drinking event start time on the EMA app and that detected with the peak detection model. If this time difference was less than 5 hours, the detected peak was counted as a true positive and otherwise it was a false positive.
\n",
" - Five-hour intervals with no TAC data points were removed from the analysis. For the EMA app data, we excluded participants (and their five-hour intervals) with no recorded EMA app timestamp from the analysis. Participants with one or more recorded EMA app timestamps were included in the model development analysis. Participants could not report no drinking in the EMA app. Therefore, any five-hour interval with no EMA app timestamp was considered an interval with no drinking. However, it was possible that a participant missed recording a drinking timestamp for an interval and we erroneously counted that interval as a no drinking one. To take account of this issue, we further validated our model against daily surveys. Participants could report if they had no drinking in daily surveys. A missing report could be distinguished from a no drinking report with daily surveys (model validation analysis was conducted in a separate notebook available from corresponding author upon request).
\n",
"
\n",
"\n",
"
Figure shows different peak properties.\n",
"
\n",
" distance
: Minimal time difference between neighboring peaks. \n",
" prominence
(minimal required prominence): “The prominence of a peak measures how much a peak stands out from the surrounding baseline of the signal and is defined as the vertical distance between the peak and its lowest contour line.” \n",
" Width
(minimal required width): This is the horizontal width of the peak in samples. \n",
" Wlen
: “A window length in samples that optionally limits the evaluated area for each peak to a subset of timestamps” \n",
"
\n",
"\n",
"Procedures I and II outputs:
\n",
"
\n",
" - Left base is the first point in a peak’s timeseries defined as the drinking start time in our study.
\n",
" - Right base is the last point in a peak’s timeseries.
\n",
" - Peak maximum is the maximum TAC value in a peak’s timeseries.
\n",
" - We further calculated area under the peak curve which is the area between a peak left and right bases. In our study, we hypothesized that this value is correlated with number of drinks consumed in a drinking event.
\n",
"
\n",
"\n",
"quotations are from SciPy peak detection algorithm documentations:
link\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "e72fa97c",
"metadata": {},
"source": [
"
\n"
]
},
{
"cell_type": "markdown",
"id": "b81acde0",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "602fea02",
"metadata": {},
"source": [
"\n",
"
Input: \n",
"
\n",
" TAC_df
is a csv dataset containing TAC signal data collected by all participants. \n",
" ref_df
is a csv dataset containing EMA app data (reference standard test). It has one row per participant per five hour interval. \n",
"
\n",
"
Outputs: left base, AUC, and peak maximum \n",
"
\n",
"
train dataset: The train dataset,
train_df
was created using
TAC_df
and
ref_df
. Negative TAC values have already been coded as zeros in
TAC_df
dataset. \n",
"
train_df
has the drinking event start times recorded in the EMA app (reference standard test) as well as TAC signals. Each row in this dataset represents a participant-five hour interval;
train_df.shape
is
(2429, 6)
. Each participant has on average 29 five-hour intervals.\n",
"
Columns in train_df
:\n",
"
\n",
" - participant_id
\n",
" - time: Start timestamp of five hour intervals
\n",
" - drinking_event: the value equals
1
if the interval (time
) encompasses start of a drinking event (reported with the EMA app) and 0
if it was not a drinking event. \n",
" - drinking_timestamp: Timestamp when participant recorded start of a new drinking event on the EMA app (
NaT
if drinking_event == 0
). \n",
" - ema_n_drinks: shows the number of drinks consumed in a drinking event (
NaN
if drinking_event == 0
). \n",
"
\n",
" The following two columns represent TAC signal and each include multiple data points stored as a list in each row of
train_df
. \n",
"
\n",
" - datetime: A list of timestamps recorded with Skyn for a participant within their five hour intervals.
\n",
" - TAC ug/L(air): A list of TAC value which correspond to
datetime
. \n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c28356b6",
"metadata": {
"code_folding": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TAC data shape: (1964713, 3)\n",
"EMA app data shape: (2429, 5)\n"
]
}
],
"source": [
"# Load TAC data\n",
"TAC_df=pd.read_csv(data_path+\"/04_Processed_data/Raw TAC_no negative value.csv\")\n",
"TAC_df.datetime=pd.to_datetime(TAC_df.datetime)\n",
"TAC_df=TAC_df[[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"]] \n",
"TAC_df= TAC_df.sort_values(by=[\"participant_id\",\"datetime\"])\n",
"print(\"TAC data shape:\", TAC_df.shape)\n",
"\n",
"# Load EMA app data (reference standard test)\n",
"ref_df = pd.read_csv(data_path+\"/04_Processed_data/EMA app data ready for model development.csv\")\n",
"ref_df[\"time\"]=pd.to_datetime(ref_df.time)\n",
"ref_df.drinking_timestamp=pd.to_datetime(ref_df.drinking_timestamp)\n",
"ref_df = ref_df[[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]]\n",
"ref_df= ref_df.sort_values(by=[\"participant_id\",\"time\"])\n",
"print(\"EMA app data shape:\",ref_df.shape)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ec9a29e7",
"metadata": {},
"outputs": [],
"source": [
"# Here, we merge TAC_df and ref_df to make train_df.\n",
"# First, create a temporary column in ref_df with the time interval upper bound\n",
"ref_df['time_upper_bound'] = ref_df['time'] + pd.Timedelta(hours=5)\n",
"\n",
"# Merge the two DataFrames based on participant_id\n",
"merged_df = pd.merge(TAC_df, ref_df, on='participant_id', how='right')\n",
"\n",
"# Filter rows within the 5-hour intervals\n",
"merged_df = merged_df[(merged_df['datetime'] >= merged_df['time']) & (merged_df['datetime'] < merged_df['time_upper_bound'])]\n",
"\n",
"# Group the DataFrames by participant_id and time, and aggregate datetime and TAC ug/L(air) values in lists\n",
"grouped_df = merged_df.groupby(['participant_id', 'time']).agg({'datetime': list, 'TAC ug/L(air)': list}).reset_index()\n",
"\n",
"# Merge the aggregated values back to ref_df\n",
"train_df = pd.merge(ref_df, grouped_df, on=['participant_id', 'time'])\n",
"\n",
"# Drop the temporary column in ref_df\n",
"train_df = train_df.drop(columns=['time_upper_bound'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "22f46d03",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" time | \n",
" drinking_timestamp | \n",
" drinking_event | \n",
" ema_n_drinks | \n",
" datetime | \n",
" TAC ug/L(air) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2021-03-25 09:00:00-04:00 | \n",
" NaT | \n",
" 0 | \n",
" NaN | \n",
" [2021-03-25 11:40:29-04:00, 2021-03-25 11:40:4... | \n",
" [26.35, 3.14, 0.0, 10.45, 13.17, 12.75, 9.83, ... | \n",
"
\n",
" \n",
" 1 | \n",
" 2021-03-25 14:00:00-04:00 | \n",
" NaT | \n",
" 0 | \n",
" NaN | \n",
" [2021-03-25 14:00:14-04:00, 2021-03-25 14:00:3... | \n",
" [2.3, 2.3, 3.97, 2.09, 3.14, 2.72, 2.09, 2.51,... | \n",
"
\n",
" \n",
" 2 | \n",
" 2021-03-25 19:00:00-04:00 | \n",
" NaT | \n",
" 0 | \n",
" NaN | \n",
" [2021-03-25 19:00:14-04:00, 2021-03-25 19:00:3... | \n",
" [0.84, 0.21, 0.63, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... | \n",
"
\n",
" \n",
" 3 | \n",
" 2021-03-26 00:00:00-04:00 | \n",
" NaT | \n",
" 0 | \n",
" NaN | \n",
" [2021-03-26 00:00:14-04:00, 2021-03-26 00:00:3... | \n",
" [12.13, 11.08, 12.34, 13.8, 14.01, 12.75, 12.7... | \n",
"
\n",
" \n",
" 4 | \n",
" 2021-03-26 05:00:00-04:00 | \n",
" NaT | \n",
" 0 | \n",
" NaN | \n",
" [2021-03-26 05:00:14-04:00, 2021-03-26 05:00:3... | \n",
" [0.63, 0.63, 0.84, 0.63, 0.84, 0.63, 0.84, 1.2... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" time drinking_timestamp drinking_event ema_n_drinks \\\n",
"0 2021-03-25 09:00:00-04:00 NaT 0 NaN \n",
"1 2021-03-25 14:00:00-04:00 NaT 0 NaN \n",
"2 2021-03-25 19:00:00-04:00 NaT 0 NaN \n",
"3 2021-03-26 00:00:00-04:00 NaT 0 NaN \n",
"4 2021-03-26 05:00:00-04:00 NaT 0 NaN \n",
"\n",
" datetime \\\n",
"0 [2021-03-25 11:40:29-04:00, 2021-03-25 11:40:4... \n",
"1 [2021-03-25 14:00:14-04:00, 2021-03-25 14:00:3... \n",
"2 [2021-03-25 19:00:14-04:00, 2021-03-25 19:00:3... \n",
"3 [2021-03-26 00:00:14-04:00, 2021-03-26 00:00:3... \n",
"4 [2021-03-26 05:00:14-04:00, 2021-03-26 05:00:3... \n",
"\n",
" TAC ug/L(air) \n",
"0 [26.35, 3.14, 0.0, 10.45, 13.17, 12.75, 9.83, ... \n",
"1 [2.3, 2.3, 3.97, 2.09, 3.14, 2.72, 2.09, 2.51,... \n",
"2 [0.84, 0.21, 0.63, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... \n",
"3 [12.13, 11.08, 12.34, 13.8, 14.01, 12.75, 12.7... \n",
"4 [0.63, 0.63, 0.84, 0.63, 0.84, 0.63, 0.84, 1.2... "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.iloc[:,1:].head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "78ee3312",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 2429.000000\n",
"mean 808.856731\n",
"std 216.738438\n",
"min 1.000000\n",
"25% 900.000000\n",
"50% 900.000000\n",
"75% 900.000000\n",
"max 903.000000\n",
"Name: TAC ug/L(air), dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of TAC values in each time interval\n",
"# Descriptive results for this number were almost the same for 5-hour intervals with/without a drinking event start time\n",
"train_df['TAC ug/L(air)'].apply(lambda x: len(x)).astype('int64').describe()"
]
},
{
"cell_type": "markdown",
"id": "52ad1d7d",
"metadata": {},
"source": [
"\n",
" Make an estimator to conduct procedures I and II,
DrinkDetector
. This estimator follows the scikit-learn API requirements (Ref:
link). \n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "67e492b2",
"metadata": {},
"outputs": [],
"source": [
"class DrinkDetector(BaseEstimator, ClassifierMixin, TransformerMixin):\n",
" \"\"\"A scikit-learn compatible estimator that conducts procedures I and II of the model.\n",
" \n",
" Parameters\n",
" ----------\n",
" size: median_filter kernel size\n",
" window: rolling average window size\n",
" distance: peak minimum distance in peak detection algorithm\n",
" prominence: peak minimum prominence\n",
" wlen: window length\n",
" width: peak minimum width\n",
" \n",
" Please see the following website for more details on the parameters: \n",
" https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.median_filter.html\n",
" https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html\n",
" https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html\n",
" \"\"\"\n",
" \n",
" def __init__(self, size=39, window=206, distance=265, prominence=8.0,\n",
" wlen=2202, width=36):\n",
" self.size = size\n",
" self.window = window\n",
" self.distance = distance\n",
" self.prominence = prominence\n",
" self.wlen = wlen\n",
" self.width = width\n",
" \n",
" def fit(self, X, y=None):\n",
" \"\"\"\n",
" Parameters\n",
" ----------\n",
" X : TAC data collected with Skyn. X is a numpy array with three \n",
" columns \"participant_id\",\"datetime\",\"TAC ug/L(air)\".\n",
" \n",
" y: EMA app/self-reported drinking event start times and number of drinkgs. y is a numpy array\n",
" with columns for \"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"\n",
" \"\"\"\n",
" return self\n",
" \n",
" def predict(self, X):\n",
" X=pd.DataFrame(X,columns=[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"])\n",
"\n",
" # Make X in long format with one row per participan TAC timestamp \n",
" data = [{'participant_id': row.participant_id, 'datetime': dt, 'TAC ug/L(air)': tac}\n",
" for row in X.itertuples() for dt, tac in zip(row.datetime, row._3)]\n",
" X = pd.DataFrame(data)\n",
" X.index=X.datetime\n",
" X = X[[\"participant_id\",\"TAC ug/L(air)\"]]\n",
" \n",
" y_pred = pd.DataFrame(columns=[\"participant_id\",\"index_test\",\"right_bases\",\"peak_maximum\",\"peak_auc\"]) \n",
" for participant_id in X.iloc[:,0].unique():\n",
" # Get data for each participant.\n",
" p_data = X.query('participant_id == @participant_id').copy() # participant level TAC data\n",
" \n",
" # PROCEDURE I (SIGNAL FILTERING)\n",
" ## Median filter\n",
" p_data[\"medfilt\"] = median_filter(p_data[\"TAC ug/L(air)\"], size=self.size)\n",
"\n",
" ## Moving average on median filter\n",
" p_data[\"medfilt_rolling\"] = p_data[\"medfilt\"].rolling(window=self.window).mean() \n",
" ## Remove missing values created in procedure I\n",
" p_data = p_data.loc[p_data[\"medfilt_rolling\"].notna()]\n",
"\n",
" # PROCEDURE II (PEAK DETECTION)\n",
" peaks, properties = find_peaks(p_data[\"medfilt_rolling\"].values, \n",
" distance = self.distance, \n",
" prominence = (self.prominence, None),\n",
" wlen = self.wlen,\n",
" width = (self.width,None))\n",
" \n",
" ## store peak properties: Left bases are the index test, model detected drinking event start times\n",
" index_test = p_data[\"medfilt_rolling\"].index[properties[\"left_bases\"]]\n",
" ## get right bases of detected peak too\n",
" rbs = p_data[\"medfilt_rolling\"].index[properties[\"right_bases\"]]\n",
" max_peak = p_data.loc[p_data.index.isin(p_data[\"medfilt_rolling\"].index[peaks]),\"medfilt_rolling\"].values \n",
" dict_df = {'participant_id': participant_id, 'index_test':index_test,'right_bases':rbs, 'peak_maximum':max_peak} \n",
" peaks_props = pd.DataFrame(dict_df) \n",
" ## AUC: we calculate area under the peak curve and add it to peaks_props\n",
" peaks_props[\"peak_auc\"] = [simps(dx=1, y=p_data[it:rb].medfilt_rolling)\n",
" for it, rb in zip(peaks_props.index_test, peaks_props.right_bases)]\n",
" y_pred = pd.concat([y_pred, peaks_props])\n",
" \n",
" y_pred = y_pred.values\n",
" return y_pred\n",
" \n",
" def main_analysis(self,X, y=None):\n",
" y=pd.DataFrame(y,columns=[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]) \n",
" y.participant_id=y.participant_id.astype(\"int64\")\n",
" y.drinking_event=y.drinking_event.astype(\"int64\")\n",
" y.ema_n_drinks=y.ema_n_drinks.astype(float)\n",
" y.sort_values(by=\"drinking_timestamp\",inplace=True) # EMA app recorded drinking start time\n",
" \n",
" \n",
" y_pred = self.predict(X)\n",
" y_pred = pd.DataFrame(data=y_pred, columns=[\"participant_id\",\"index_test\",\"right_bases\",\"peak_maximum\",\"peak_auc\"])\n",
" y_pred['index_test'] = pd.to_datetime(y_pred['index_test'], \n",
" utc=True).dt.tz_convert(FixedOffset(-240))\n",
"\n",
" y_pred[\"peak_id\"]=np.arange(0,y_pred.shape[0]) # make peak id in y_pred\n",
" y_pred.sort_values(by=\"index_test\",inplace=True)\n",
" y_pred.participant_id=y_pred.participant_id.astype(\"int64\")\n",
" \n",
" \n",
" # make a df for true pos. and false neg.\n",
" TP_FN=pd.merge_asof(left=y[y.drinking_event ==1 ], right=y_pred,\n",
" by=\"participant_id\", left_on=\"drinking_timestamp\", right_on=\"index_test\",\n",
" allow_exact_matches=True, \n",
" direction=\"nearest\", \n",
" tolerance=pd.Timedelta(\"5h\")) \n",
" \n",
" TP_FN[\"time_difference\"]=abs(TP_FN.drinking_timestamp-TP_FN.index_test)\n",
" # for the duplicates, keep the ones with smaller time_difference values\n",
" # To do that, first find duplicate peak_ids with larger time difference\n",
" code_nan=TP_FN.loc[(TP_FN.peak_id.notna())&(TP_FN.duplicated(subset=[\"peak_id\"],keep=False))\n",
" ].groupby(by=\"index_test\").max()\n",
" \n",
" if code_nan.shape[0] > 0:\n",
" print(code_nan.shape[0],\"duplicates were generated while calculating true pos. false neg.\",end='\\r')\n",
" # Set them to nan:\n",
" TP_FN.loc[(TP_FN.peak_id.isin(code_nan.peak_id))&\n",
" (TP_FN.time_difference.isin(code_nan.time_difference)),\n",
" ['index_test','peak_id']]=np.nan\n",
" # drop time_difference\n",
" TP_FN.drop(columns=[\"time_difference\"],inplace=True) \n",
" \n",
" # Next, Make a df for true neg and concat it to TP_FN\n",
" TN=y.loc[y.drinking_timestamp.isna()]\n",
" TP_FN_TN=pd.concat([TP_FN,TN])\n",
" \n",
" # Next, Make a df for false pos. (these are detected peaks not in TP_FN)\n",
" FP=y_pred.loc[~y_pred.peak_id.isin(TP_FN.peak_id)]\n",
" FP=FP.rename(columns={\"index_test\":\"false_p\"})\n",
"\n",
" # merge FP to TP_FN_TN: If more than 1 FP in a 5-hour interval, just one of them is counted\n",
" TP_FN_TN.sort_values(by=\"time\",inplace=True)\n",
" FP.sort_values(by=\"false_p\",inplace=True)\n",
" TP_FN_TN_FP=pd.merge_asof(left=TP_FN_TN, right=FP, suffixes=('_TP', '_FP'), \n",
" by=\"participant_id\", left_on=\"time\", right_on=\"false_p\",\n",
" allow_exact_matches=True, direction=\"forward\",\n",
" tolerance=pd.Timedelta(\"5h\"))\n",
" \n",
" # Make a true_label and pred_label for scoring\n",
" TP_FN_TN_FP.rename(columns={\"drinking_event\":\"true_label\"},inplace=True) \n",
" TP = TP_FN_TN_FP.query('index_test.notnull() &'\n",
" 'drinking_timestamp.notnull()').index\n",
" FN = TP_FN_TN_FP.query('index_test.isnull() &'\n",
" 'drinking_timestamp.notnull()').index\n",
" FP = TP_FN_TN_FP.query('false_p.notnull() &'\n",
" 'index_test.isnull()').index\n",
" TN = TP_FN_TN_FP.query('index_test.isnull() &'\n",
" 'drinking_timestamp.isnull() &'\n",
" 'false_p.isnull()').index\n",
" \n",
" TP_FN_TN_FP.loc[TP,\"pred_label\"]=1 # True pos \n",
" TP_FN_TN_FP.loc[FN,\"pred_label\"]=0 # False neg\n",
" # if for a drinking interval, there are more than one peak, \n",
" # it is counted as one True positive\n",
" TP_FN_TN_FP.loc[FP,\"pred_label\"]=1 # False pos\n",
" TP_FN_TN_FP.loc[TN,\"pred_label\"]=0 # True neg.\n",
"\n",
" # Compute the AUC score based on the binary labels and return it\n",
" score = balanced_accuracy_score(TP_FN_TN_FP['true_label'], TP_FN_TN_FP['pred_label'])\n",
" \n",
" return score, TP_FN_TN_FP\n",
" \n",
" def score(self,X, y=None):\n",
" score, _ = self.main_analysis(X, y)\n",
" return score\n",
" \n",
" def outputs(self,X, y=None):\n",
" _, outputs = self.main_analysis(X, y)\n",
" outputs = outputs[[\"participant_id\",\"true_label\",\"pred_label\",\n",
" \"ema_n_drinks\",\"peak_maximum_TP\",\"peak_auc_TP\"]]\n",
" return outputs"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d33bbb47",
"metadata": {},
"outputs": [],
"source": [
"# set up X,y, and group\n",
"X = np.array(train_df[[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"]])\n",
"y = np.array(train_df[[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]])\n",
"groups = np.array(train_df[\"participant_id\"])"
]
},
{
"cell_type": "markdown",
"id": "bf03ba41",
"metadata": {},
"source": [
"\n",
"
Hyperparameter Optimization\n",
"
\n",
" - We first performed a random grid search to find important subspaces. Random grid search was performed using
RandomizedSearchCV
in scikit-learn. All six hyperparameters were integer-valued and there was no conditional hyperparameter. In this search, hyperparameter values were selected randomly within a range of possible values. When possible, the range for a hyperparameter was determined based on subject matter knowledge. For instance, the range for minimal required width was based on the possible minimum number of hours that alcohol can be detected in TAC, or distance was based on the time difference between two consecutive drinking events. \n",
" - Finetuning was completed with
GridSearchCV
in scikit-learn. Finetuning was defined as changing the values of only one to two hyperparameters, while holding the values for other hyperparameter constant, to find the best value for the changing hyperparameter (this is also known as staged or sequential grid search). After identifying the most important subspace in random grid search (i.e., best estimator in random grid search), we performed finetuning to further improve the model performance. Finetuning encompassed the following steps. \n",
"
\n",
"
\n",
" - Set the hyperparameter values to that from the best set in the random grid search
\n",
" - Finetune a hyperparameter.
\n",
" - Update the best grid search model with the best performing value for the finetuned hyperparameter.
\n",
" - Repeat steps 2 and 3 for each hyperparameter in the following order:
window size (for median filter and moving average) → distance → minimum required prominence → wlen → minimum required width. \n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "dbf59c28",
"metadata": {
"code_folding": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 50 candidates, totalling 250 fits\n",
"Best hyperparameters: {'DrinkDetector__distance': 594, 'DrinkDetector__prominence': 2.3074534860188396, 'DrinkDetector__size': 378, 'DrinkDetector__width': 16, 'DrinkDetector__window': 399, 'DrinkDetector__wlen': 2233}\n",
"Best score: 0.8359545390006131\n"
]
}
],
"source": [
"# Create a pipeline\n",
"pipeline = Pipeline([\n",
" ('DrinkDetector', DrinkDetector())\n",
"])\n",
"\n",
"# Set the hyperparameters distributions\n",
"params = {\n",
" # possible values for each each hyperparameter were randomly pooled from a discrete uniform distribution ranging from 1 to 541\n",
" 'DrinkDetector__size': randint(1, 542), \n",
" 'DrinkDetector__window': randint(1, 541), \n",
" 'DrinkDetector__distance': randint(180, 901),\n",
" 'DrinkDetector__prominence': beta(a=2, b=2, loc=0, scale=21),\n",
" 'DrinkDetector__wlen': randint(900, 4321),\n",
" 'DrinkDetector__width': randint(1, 541),\n",
"}\n",
"\n",
"# Create RandomizedSearchCV\n",
"gkf = GroupKFold(n_splits=5)\n",
"rand_search = RandomizedSearchCV(estimator=pipeline,\n",
" param_distributions=params,\n",
" n_iter=50,\n",
" scoring=None, # this would use the score method from the estimator\n",
" cv=gkf, verbose=1, n_jobs = n_jobs,\n",
" random_state=45)\n",
"\n",
"rand_search.fit(X=X, y=y, groups=groups)\n",
"\n",
"print(\"Best hyperparameters:\", rand_search.best_params_)\n",
"print(\"Best score:\", rand_search.best_score_)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "aacec3af",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_fit_time | \n",
" std_fit_time | \n",
" mean_score_time | \n",
" std_score_time | \n",
" param_DrinkDetector__distance | \n",
" param_DrinkDetector__prominence | \n",
" param_DrinkDetector__size | \n",
" param_DrinkDetector__width | \n",
" param_DrinkDetector__window | \n",
" param_DrinkDetector__wlen | \n",
" params | \n",
" split0_test_score | \n",
" split1_test_score | \n",
" split2_test_score | \n",
" split3_test_score | \n",
" split4_test_score | \n",
" mean_test_score | \n",
" std_test_score | \n",
" rank_test_score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.000763 | \n",
" 0.000943 | \n",
" 5.812212 | \n",
" 1.038460 | \n",
" 594 | \n",
" 2.307453 | \n",
" 378 | \n",
" 16 | \n",
" 399 | \n",
" 2233 | \n",
" {'DrinkDetector__distance': 594, 'DrinkDetecto... | \n",
" 0.805026 | \n",
" 0.818333 | \n",
" 0.837618 | \n",
" 0.865300 | \n",
" 0.853496 | \n",
" 0.835955 | \n",
" 0.022085 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 5.619211 | \n",
" 0.575981 | \n",
" 749 | \n",
" 13.280698 | \n",
" 487 | \n",
" 197 | \n",
" 111 | \n",
" 3586 | \n",
" {'DrinkDetector__distance': 749, 'DrinkDetecto... | \n",
" 0.761722 | \n",
" 0.794444 | \n",
" 0.822758 | \n",
" 0.695019 | \n",
" 0.853604 | \n",
" 0.785509 | \n",
" 0.054514 | \n",
" 25 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.003207 | \n",
" 0.006414 | \n",
" 5.445204 | \n",
" 0.607811 | \n",
" 338 | \n",
" 3.11376 | \n",
" 190 | \n",
" 208 | \n",
" 402 | \n",
" 2800 | \n",
" {'DrinkDetector__distance': 338, 'DrinkDetecto... | \n",
" 0.815442 | \n",
" 0.785000 | \n",
" 0.839881 | \n",
" 0.838360 | \n",
" 0.863464 | \n",
" 0.828429 | \n",
" 0.026503 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.000400 | \n",
" 0.000490 | \n",
" 4.533497 | \n",
" 0.581077 | \n",
" 734 | \n",
" 14.609955 | \n",
" 43 | \n",
" 506 | \n",
" 153 | \n",
" 2932 | \n",
" {'DrinkDetector__distance': 734, 'DrinkDetecto... | \n",
" 0.599611 | \n",
" 0.608889 | \n",
" 0.631839 | \n",
" 0.595053 | \n",
" 0.620838 | \n",
" 0.611246 | \n",
" 0.013559 | \n",
" 48 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 5.235737 | \n",
" 0.506070 | \n",
" 413 | \n",
" 15.516487 | \n",
" 204 | \n",
" 317 | \n",
" 311 | \n",
" 3940 | \n",
" {'DrinkDetector__distance': 413, 'DrinkDetecto... | \n",
" 0.751305 | \n",
" 0.794444 | \n",
" 0.823889 | \n",
" 0.753812 | \n",
" 0.811451 | \n",
" 0.786980 | \n",
" 0.029630 | \n",
" 24 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_fit_time std_fit_time mean_score_time std_score_time \\\n",
"0 0.000763 0.000943 5.812212 1.038460 \n",
"1 0.000000 0.000000 5.619211 0.575981 \n",
"2 0.003207 0.006414 5.445204 0.607811 \n",
"3 0.000400 0.000490 4.533497 0.581077 \n",
"4 0.000000 0.000000 5.235737 0.506070 \n",
"\n",
" param_DrinkDetector__distance param_DrinkDetector__prominence \\\n",
"0 594 2.307453 \n",
"1 749 13.280698 \n",
"2 338 3.11376 \n",
"3 734 14.609955 \n",
"4 413 15.516487 \n",
"\n",
" param_DrinkDetector__size param_DrinkDetector__width \\\n",
"0 378 16 \n",
"1 487 197 \n",
"2 190 208 \n",
"3 43 506 \n",
"4 204 317 \n",
"\n",
" param_DrinkDetector__window param_DrinkDetector__wlen \\\n",
"0 399 2233 \n",
"1 111 3586 \n",
"2 402 2800 \n",
"3 153 2932 \n",
"4 311 3940 \n",
"\n",
" params split0_test_score \\\n",
"0 {'DrinkDetector__distance': 594, 'DrinkDetecto... 0.805026 \n",
"1 {'DrinkDetector__distance': 749, 'DrinkDetecto... 0.761722 \n",
"2 {'DrinkDetector__distance': 338, 'DrinkDetecto... 0.815442 \n",
"3 {'DrinkDetector__distance': 734, 'DrinkDetecto... 0.599611 \n",
"4 {'DrinkDetector__distance': 413, 'DrinkDetecto... 0.751305 \n",
"\n",
" split1_test_score split2_test_score split3_test_score split4_test_score \\\n",
"0 0.818333 0.837618 0.865300 0.853496 \n",
"1 0.794444 0.822758 0.695019 0.853604 \n",
"2 0.785000 0.839881 0.838360 0.863464 \n",
"3 0.608889 0.631839 0.595053 0.620838 \n",
"4 0.794444 0.823889 0.753812 0.811451 \n",
"\n",
" mean_test_score std_test_score rank_test_score \n",
"0 0.835955 0.022085 1 \n",
"1 0.785509 0.054514 25 \n",
"2 0.828429 0.026503 2 \n",
"3 0.611246 0.013559 48 \n",
"4 0.786980 0.029630 24 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Cross-validation results in random grid search for procedures I and II (first five rows)\n",
"pd.DataFrame(rand_search.cv_results_).head()"
]
},
{
"cell_type": "markdown",
"id": "8a860533",
"metadata": {},
"source": [
"\n",
"We perform staged finetuning on the best_estimator_
from random grid search.\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "47b81976",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(steps=[('DrinkDetector',\n",
" DrinkDetector(distance=594, prominence=2.3074534860188396,\n",
" size=378, width=16, window=399, wlen=2233))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"Pipeline(steps=[('DrinkDetector',\n",
" DrinkDetector(distance=594, prominence=2.3074534860188396,\n",
" size=378, width=16, window=399, wlen=2233))])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random_search_best_estimator = rand_search.best_estimator_.fit(X)\n",
"random_search_best_estimator"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8edf3ecf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 9 candidates, totalling 45 fits\n",
"Fitting 5 folds for each of 3 candidates, totalling 15 fits\n",
"Fitting 5 folds for each of 4 candidates, totalling 20 fits\n",
"Fitting 5 folds for each of 3 candidates, totalling 15 fits\n",
"Fitting 5 folds for each of 3 candidates, totalling 15 fits\n",
"All fits = 110\n",
"Best hyperparameters after fine-tuning: Pipeline(steps=[('DrinkDetector',\n",
" DrinkDetector(distance=587, prominence=2.29, size=348,\n",
" width=12, window=349, wlen=2223))])\n",
"Best score: 0.8433704994653132\n"
]
}
],
"source": [
"params_stages = [\n",
" {'DrinkDetector__size': [348,363,378], 'DrinkDetector__window': [349,374,399]},\n",
" {'DrinkDetector__distance': [587,591,594]},\n",
" {'DrinkDetector__prominence': [2.29,2.30,2.31,2.32]},\n",
" {'DrinkDetector__wlen': [2223,2233,2243]},\n",
" {'DrinkDetector__width': [12,14,16]}\n",
"]\n",
"\n",
"gkf = GroupKFold(n_splits=5)\n",
"best_estimator = random_search_best_estimator\n",
"cv_results = []\n",
"total_fits = 0\n",
"for i, params in enumerate(params_stages):\n",
" fine_search = GridSearchCV(estimator=best_estimator,\n",
" param_grid=params, scoring=None,\n",
" cv=gkf, verbose=1, n_jobs=n_jobs, refit=True)\n",
" \n",
" num_candidates = len(list(ParameterGrid(params)))\n",
" total_fits += (num_candidates * gkf.n_splits)\n",
" \n",
" fine_search.fit(X=X, y=y, groups=groups)\n",
" best_estimator = fine_search.best_estimator_\n",
" best_score = fine_search.best_score_\n",
" cv_results.append(fine_search.cv_results_)\n",
" \n",
" if i==4:\n",
" print(\"All fits = \",total_fits)\n",
" print(\"Best hyperparameters after fine-tuning:\", best_estimator)\n",
" print(\"Best score:\", best_score)"
]
},
{
"cell_type": "markdown",
"id": "b920b9f9",
"metadata": {},
"source": [
"In random grid search best score was 0.836 and it improved and reached 0.843 after fine-tuning. \n",
"We store the output of best_estimator
and use this in procedure III.
"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "06a98790",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 duplicates were generated while calculating true pos. false neg.\r"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" true_label | \n",
" pred_label | \n",
" ema_n_drinks | \n",
" peak_maximum_TP | \n",
" peak_auc_TP | \n",
"
\n",
" \n",
" \n",
" \n",
" 198 | \n",
" 0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 199 | \n",
" 0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 200 | \n",
" 0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 201 | \n",
" 1 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 27.031862 | \n",
" 12866.220081 | \n",
"
\n",
" \n",
" 202 | \n",
" 1 | \n",
" 0.0 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" true_label pred_label ema_n_drinks peak_maximum_TP peak_auc_TP\n",
"198 0 0.0 NaN NaN NaN\n",
"199 0 0.0 NaN NaN NaN\n",
"200 0 0.0 NaN NaN NaN\n",
"201 1 1.0 2.0 27.031862 12866.220081\n",
"202 1 0.0 6.0 NaN NaN"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DrinkDetector_outputs = best_estimator.named_steps['DrinkDetector'].outputs(X, y)\n",
"DrinkDetector_outputs.iloc[198:,1:].head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "244420f0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" pred_label | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" true_label | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2064 | \n",
" 170 | \n",
"
\n",
" \n",
" 1 | \n",
" 47 | \n",
" 148 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"pred_label 0.0 1.0\n",
"true_label \n",
"0 2064 170\n",
"1 47 148"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(DrinkDetector_outputs.true_label,DrinkDetector_outputs.pred_label)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "03d016e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['procedure_InII.pkl']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Save best pipeline\n",
"best_pipeline = Pipeline([\n",
" ('DrinkDetector', DrinkDetector(distance=587, prominence=2.29, size=348, width=12, \n",
" window=349, wlen=2223))\n",
"])\n",
"\n",
"best_pipeline.fit(X, y)\n",
"joblib.dump(best_pipeline, 'procedure_InII.pkl')"
]
},
{
"cell_type": "markdown",
"id": "72f3f6a6",
"metadata": {},
"source": [
"## Procedure III"
]
},
{
"cell_type": "markdown",
"id": "4bb2da25",
"metadata": {},
"source": [
" This procedure was only conducted on true positives, that is cases where the model detected a peak and participant reported a drinking event. Here, we use
DrinkDetector_outputs
which was created in previous procedure.
\n",
"
Columns in DrinkDetector_outputs
\n",
"
\n",
" - participant_id
\n",
" - ema_n_drinks: Number of standard drinks consumed in a drinking event; participants reported this number with their EMA app. This is
y
in procedure III. \n",
" The following two are X
in procedure III.\n",
" - peak_maximum_TP: peak maximum (detected in procedures I and II)
\n",
" - peak_auc_TP: peak AUC (detected in procedures I and II)
\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "85cffa31",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There were a total of 148 true positice cases.\n"
]
}
],
"source": [
"# restrict the data to true positives (peak prop are only available for true positives)\n",
"DrinkDetector_outputs = DrinkDetector_outputs.query('true_label == 1 & pred_label == 1')\n",
"print(f\"There were a total of {DrinkDetector_outputs.shape[0]} true positice cases.\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "a2618c94",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# No missing data\n",
"DrinkDetector_outputs.isna().sum().sum()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "9695e33b",
"metadata": {},
"outputs": [],
"source": [
"# Set X and y \n",
"X = np.array(DrinkDetector_outputs[[\"peak_maximum_TP\",\"peak_auc_TP\"]])\n",
"y = np.array(DrinkDetector_outputs[\"ema_n_drinks\"])\n",
"groups = np.array(DrinkDetector_outputs[\"participant_id\"])"
]
},
{
"cell_type": "markdown",
"id": "03648937",
"metadata": {},
"source": [
"\n",
" Hyperparameter Optimization\n",
"\n",
"This was similar to that conducted in procedures I and II. \n",
" In random grid search, we find the best regression technique and its hyperparameters. Of note, regression technique was itself a hyperparameter here.\n",
" In finetunning, we tuned the hyperparameters of the best regression technique.\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ec22c915",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\nhkia\\anaconda3\\lib\\site-packages\\sklearn\\model_selection\\_search.py:305: UserWarning: The total space of parameters 4 is smaller than n_iter=50. Running 4 iterations. For exhaustive searches, use GridSearchCV.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best model: SVR\n",
"Best parameters: {'SVR__C': 0.6716456431507717, 'SVR__epsilon': 1.0648019005876224, 'SVR__gamma': 'scale', 'SVR__kernel': 'sigmoid', 'SVR__shrinking': False}\n",
"Best score: 2.201700608754006\n"
]
}
],
"source": [
"models = {\n",
" 'LinearRegression': {\n",
" 'model': LinearRegression(),\n",
" 'params': {\n",
" 'LinearRegression__fit_intercept': [True, False],\n",
" 'LinearRegression__positive': [True, False]\n",
" } \n",
" },\n",
" 'SGDRegressor': {\n",
" 'model': SGDRegressor(),\n",
" 'params': { \n",
" 'SGDRegressor__loss': ['squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],\n",
" 'SGDRegressor__penalty':[\"l2\",\"l1\",\"elasticnet\"],\n",
" 'SGDRegressor__alpha':loguniform(1e-4, 1e0),\n",
" 'SGDRegressor__fit_intercept': [True, False],\n",
" 'SGDRegressor__learning_rate':[\"constant\",\"optimal\",\"invscaling\",\"adaptive\"],\n",
" 'SGDRegressor__l1_ratio':uniform(0, 1),\n",
" 'SGDRegressor__max_iter': [1000, 5000, 10000],\n",
" 'SGDRegressor__tol': [1e-3, 1e-4, 1e-5]\n",
" }\n",
" },\n",
" 'SVR': {\n",
" 'model': SVR(),\n",
" 'params': {\n",
" 'SVR__kernel': ['rbf', 'sigmoid', 'linear'],\n",
" 'SVR__C': reciprocal(0.1, 10),\n",
" 'SVR__gamma': [\"scale\",\"auto\"],\n",
" 'SVR__epsilon':uniform(0.1,1),\n",
" 'SVR__shrinking':[True, False]\n",
" }\n",
" },\n",
" 'LGBMRegressor': {\n",
" 'model': LGBMRegressor(),\n",
" 'params': { \n",
" 'LGBMRegressor__boosting_type':[\"gbdt\",\"dart\",\"goss\"],\n",
" 'LGBMRegressor__num_leaves': randint(low = 1, high=100), \n",
" 'LGBMRegressor__max_depth': randint(low=-1, high=20), \n",
" 'LGBMRegressor__learning_rate': uniform(0.01,2), \n",
" 'LGBMRegressor__n_estimators': randint(low=1, high=200)\n",
" }\n",
" }\n",
"}\n",
"\n",
"def find_best_model(X, y, groups, models):\n",
" best_score = np.inf\n",
" best_model = None\n",
" best_params = None\n",
"\n",
" gkf = GroupKFold(n_splits=5)\n",
"\n",
" for model_name, model_info in models.items():\n",
" pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" (model_name, model_info['model'])\n",
" ])\n",
"\n",
" randomized_search = RandomizedSearchCV(\n",
" estimator = pipeline,\n",
" param_distributions = model_info['params'],\n",
" scoring = \"neg_mean_absolute_error\",\n",
" cv = gkf,\n",
" n_iter = 50,\n",
" random_state = 45, refit = False\n",
" )\n",
"\n",
" randomized_times = randomized_search.fit(X=X, y=y, groups=groups)\n",
"\n",
" if -randomized_search.best_score_ < best_score:\n",
" best_score = -randomized_search.best_score_\n",
" best_model = model_name\n",
" best_params = randomized_search.best_params_\n",
" cv_res = randomized_search.cv_results_\n",
" \n",
" return best_model, best_params, best_score, cv_res\n",
"\n",
"# Find the best model and hyperparameters\n",
"best_model, best_params, best_score, cv_res = find_best_model(X, y, groups, models)\n",
"print(f\"Best model: {best_model}\")\n",
"print(f\"Best parameters: {best_params}\")\n",
"print(f\"Best score: {best_score}\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "1e4b3fdf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_fit_time | \n",
" std_fit_time | \n",
" mean_score_time | \n",
" std_score_time | \n",
" param_SVR__C | \n",
" param_SVR__epsilon | \n",
" param_SVR__gamma | \n",
" param_SVR__kernel | \n",
" param_SVR__shrinking | \n",
" params | \n",
" split0_test_score | \n",
" split1_test_score | \n",
" split2_test_score | \n",
" split3_test_score | \n",
" split4_test_score | \n",
" mean_test_score | \n",
" std_test_score | \n",
" rank_test_score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.006248 | \n",
" 0.007653 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 9.506552 | \n",
" 0.649545 | \n",
" scale | \n",
" sigmoid | \n",
" False | \n",
" {'SVR__C': 9.506551974984317, 'SVR__epsilon': ... | \n",
" -33.852418 | \n",
" -20.713941 | \n",
" -18.279950 | \n",
" -35.767277 | \n",
" -18.220137 | \n",
" -25.366745 | \n",
" 7.786180 | \n",
" 50 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.003123 | \n",
" 0.006247 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.774353 | \n",
" 0.572808 | \n",
" auto | \n",
" sigmoid | \n",
" True | \n",
" {'SVR__C': 0.7743530119612986, 'SVR__epsilon':... | \n",
" -1.890978 | \n",
" -2.152339 | \n",
" -1.450243 | \n",
" -4.037363 | \n",
" -1.727136 | \n",
" -2.251612 | \n",
" 0.921530 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.003121 | \n",
" 0.006243 | \n",
" 0.003128 | \n",
" 0.006256 | \n",
" 0.537941 | \n",
" 0.157238 | \n",
" auto | \n",
" linear | \n",
" True | \n",
" {'SVR__C': 0.5379407995493989, 'SVR__epsilon':... | \n",
" -2.410833 | \n",
" -2.436668 | \n",
" -1.498158 | \n",
" -3.597182 | \n",
" -1.888605 | \n",
" -2.366289 | \n",
" 0.707654 | \n",
" 9 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.003609 | \n",
" 0.003863 | \n",
" 0.000804 | \n",
" 0.000402 | \n",
" 1.996204 | \n",
" 1.090722 | \n",
" scale | \n",
" rbf | \n",
" False | \n",
" {'SVR__C': 1.9962036354162338, 'SVR__epsilon':... | \n",
" -1.935181 | \n",
" -2.266980 | \n",
" -1.962787 | \n",
" -3.998925 | \n",
" -1.896410 | \n",
" -2.412057 | \n",
" 0.804282 | \n",
" 27 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.001795 | \n",
" 0.001466 | \n",
" 0.000604 | \n",
" 0.000802 | \n",
" 4.351994 | \n",
" 0.34072 | \n",
" scale | \n",
" rbf | \n",
" True | \n",
" {'SVR__C': 4.351994273115294, 'SVR__epsilon': ... | \n",
" -2.005662 | \n",
" -2.448613 | \n",
" -2.035063 | \n",
" -3.826020 | \n",
" -1.822290 | \n",
" -2.427530 | \n",
" 0.728634 | \n",
" 34 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_fit_time std_fit_time mean_score_time std_score_time param_SVR__C \\\n",
"0 0.006248 0.007653 0.000000 0.000000 9.506552 \n",
"1 0.003123 0.006247 0.000000 0.000000 0.774353 \n",
"2 0.003121 0.006243 0.003128 0.006256 0.537941 \n",
"3 0.003609 0.003863 0.000804 0.000402 1.996204 \n",
"4 0.001795 0.001466 0.000604 0.000802 4.351994 \n",
"\n",
" param_SVR__epsilon param_SVR__gamma param_SVR__kernel param_SVR__shrinking \\\n",
"0 0.649545 scale sigmoid False \n",
"1 0.572808 auto sigmoid True \n",
"2 0.157238 auto linear True \n",
"3 1.090722 scale rbf False \n",
"4 0.34072 scale rbf True \n",
"\n",
" params split0_test_score \\\n",
"0 {'SVR__C': 9.506551974984317, 'SVR__epsilon': ... -33.852418 \n",
"1 {'SVR__C': 0.7743530119612986, 'SVR__epsilon':... -1.890978 \n",
"2 {'SVR__C': 0.5379407995493989, 'SVR__epsilon':... -2.410833 \n",
"3 {'SVR__C': 1.9962036354162338, 'SVR__epsilon':... -1.935181 \n",
"4 {'SVR__C': 4.351994273115294, 'SVR__epsilon': ... -2.005662 \n",
"\n",
" split1_test_score split2_test_score split3_test_score split4_test_score \\\n",
"0 -20.713941 -18.279950 -35.767277 -18.220137 \n",
"1 -2.152339 -1.450243 -4.037363 -1.727136 \n",
"2 -2.436668 -1.498158 -3.597182 -1.888605 \n",
"3 -2.266980 -1.962787 -3.998925 -1.896410 \n",
"4 -2.448613 -2.035063 -3.826020 -1.822290 \n",
"\n",
" mean_test_score std_test_score rank_test_score \n",
"0 -25.366745 7.786180 50 \n",
"1 -2.251612 0.921530 3 \n",
"2 -2.366289 0.707654 9 \n",
"3 -2.412057 0.804282 27 \n",
"4 -2.427530 0.728634 34 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Cross-validation results in random grid search for procedures III\n",
"pd.DataFrame(cv_res).head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "83522f47",
"metadata": {},
"outputs": [],
"source": [
"# Update pipeline based on results from random grid search\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('SVR', SVR(C = 0.672, epsilon = 1.065, gamma = 'scale', kernel = 'sigmoid',\n",
" shrinking = False))\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "5df93f05",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 32 candidates, totalling 160 fits\n",
"Best hyperparameters after fine-tuning: Pipeline(steps=[('scaler', StandardScaler()),\n",
" ('SVR', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])\n",
"Best score: -2.1887871151690304\n"
]
}
],
"source": [
"# Finetunning on best estimator from random grid search\n",
"reg_params = {\n",
" 'SVR__C': [0.067,0.672],\n",
" 'SVR__epsilon': [1.065, 0.065],\n",
" 'SVR__gamma': [\"scale\",\"auto\" ],\n",
" 'SVR__kernel': ['sigmoid', 'poly'],\n",
" 'SVR__shrinking': [True, False]\n",
"}\n",
"\n",
"gkf = GroupKFold(n_splits=5)\n",
"\n",
"fine_search = GridSearchCV(estimator=pipeline,\n",
" param_grid=reg_params, scoring=\"neg_mean_absolute_error\",\n",
" cv=gkf, verbose=1, n_jobs=n_jobs, refit=True)\n",
"\n",
"fine_search.fit(X = X, y = y, groups = groups)\n",
"best_estimator = fine_search.best_estimator_\n",
"best_score = fine_search.best_score_\n",
"cv_res_ = fine_search.cv_results_\n",
" \n",
"print(\"Best hyperparameters after fine-tuning:\", best_estimator)\n",
"print(\"Best score:\", best_score)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "d04f30bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['procedure_III.pkl']"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Save procedure III results\n",
"scaler = StandardScaler()\n",
"best_model = SVR(C=0.672, epsilon=0.065, gamma = 'scale', kernel='sigmoid', shrinking = True)\n",
"best_pipeline = Pipeline(steps=[('scaler', scaler), ('model', best_model)])\n",
"best_pipeline.fit(X, y)\n",
"\n",
"joblib.dump(best_pipeline, 'procedure_III.pkl')"
]
},
{
"cell_type": "markdown",
"id": "2c4b2b43",
"metadata": {},
"source": [
"\n",
" \n",
""
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "b040619e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(steps=[('scaler', StandardScaler()),\n",
" ('model', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"Pipeline(steps=[('scaler', StandardScaler()),\n",
" ('model', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"best_pipeline"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}