{ "cells": [ { "cell_type": "markdown", "id": "0281b33a", "metadata": {}, "source": [ "
\n", "
Development and validation of a model for measuring alcohol consumption from transdermal alcohol content data among college students
\n", " Programmer: Sina Kianersi
\n", " Email: skianersi@bwh.harvard.edu
\n", " Objectives: This notebook contains the source code for developing the model described in the manuscript.\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "id": "480fd13e", "metadata": {}, "outputs": [], "source": [ "# Import required packages\n", "import random \n", "import pandas as pd # version 1.4.4\n", "import numpy as np # version 1.21.5\n", "import os\n", "import datetime as dt \n", "from pytz import FixedOffset # version 2022.1\n", "import io\n", "\n", "from scipy.stats import randint, beta, uniform, reciprocal # version 1.9.1\n", "from scipy.signal import find_peaks \n", "from scipy.integrate import simps \n", "from scipy.ndimage import median_filter\n", "\n", "# Ignore pandas future warnings\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin # version 1.2.2\n", "from sklearn.model_selection import (RandomizedSearchCV, GridSearchCV, GroupKFold, ParameterGrid) \n", "from sklearn.metrics import (balanced_accuracy_score, mean_absolute_error) \n", "from sklearn.utils.fixes import loguniform \n", "from sklearn.preprocessing import StandardScaler \n", "from sklearn.linear_model import (LinearRegression, SGDRegressor)\n", "from sklearn.svm import SVR \n", "from sklearn.pipeline import Pipeline\n", "\n", "from lightgbm import LGBMRegressor # version 3.3.5\n", "import json \n", "import joblib # version 1.2.0\n", "\n", "import multiprocessing # version\n", "n_jobs = int(max(1, multiprocessing.cpu_count())/2)\n", "\n", "# set path to data\n", "data_path= # add data path" ] }, { "cell_type": "markdown", "id": "f6b9ef75", "metadata": {}, "source": [ "# Model Development" ] }, { "cell_type": "markdown", "id": "4dd8d1e9", "metadata": {}, "source": [ "__Figure 1. Summary of model development__\n", "\"Your\n" ] }, { "cell_type": "markdown", "id": "e5368115", "metadata": {}, "source": [ "## Procedures I and II" ] }, { "cell_type": "markdown", "id": "1f3bc23f", "metadata": {}, "source": [ "
\n", "\n", "Procedure I: TAC data processing\n", " \n", "\n", "Procedure II: Peak detection algorithm
\n", " \n", "\n", "Figure shows different peak properties.\n", " \n", "\n", "Procedures I and II outputs:
\n", "
    \n", "
  1. Left base is the first point in a peak’s timeseries defined as the drinking start time in our study.
  2. \n", "
  3. Right base is the last point in a peak’s timeseries.
  4. \n", "
  5. Peak maximum is the maximum TAC value in a peak’s timeseries.
  6. \n", "
  7. We further calculated area under the peak curve which is the area between a peak left and right bases. In our study, we hypothesized that this value is correlated with number of drinks consumed in a drinking event.
  8. \n", "
\n", "\n", "quotations are from SciPy peak detection algorithm documentations: link\n", "
" ] }, { "cell_type": "markdown", "id": "e72fa97c", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "b81acde0", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "id": "602fea02", "metadata": {}, "source": [ "
\n", " Input: \n", "
    \n", "
  1. TAC_df is a csv dataset containing TAC signal data collected by all participants.
  2. \n", "
  3. ref_df is a csv dataset containing EMA app data (reference standard test). It has one row per participant per five hour interval.
  4. \n", "
\n", " Outputs: left base, AUC, and peak maximum \n", "
\n", " train dataset: The train dataset, train_df was created using TAC_df and ref_df. Negative TAC values have already been coded as zeros in TAC_df dataset. \n", "train_df has the drinking event start times recorded in the EMA app (reference standard test) as well as TAC signals. Each row in this dataset represents a participant-five hour interval; train_df.shape is (2429, 6). Each participant has on average 29 five-hour intervals.\n", " Columns in train_df:\n", " \n", " The following two columns represent TAC signal and each include multiple data points stored as a list in each row of train_df. \n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 3, "id": "c28356b6", "metadata": { "code_folding": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TAC data shape: (1964713, 3)\n", "EMA app data shape: (2429, 5)\n" ] } ], "source": [ "# Load TAC data\n", "TAC_df=pd.read_csv(data_path+\"/04_Processed_data/Raw TAC_no negative value.csv\")\n", "TAC_df.datetime=pd.to_datetime(TAC_df.datetime)\n", "TAC_df=TAC_df[[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"]] \n", "TAC_df= TAC_df.sort_values(by=[\"participant_id\",\"datetime\"])\n", "print(\"TAC data shape:\", TAC_df.shape)\n", "\n", "# Load EMA app data (reference standard test)\n", "ref_df = pd.read_csv(data_path+\"/04_Processed_data/EMA app data ready for model development.csv\")\n", "ref_df[\"time\"]=pd.to_datetime(ref_df.time)\n", "ref_df.drinking_timestamp=pd.to_datetime(ref_df.drinking_timestamp)\n", "ref_df = ref_df[[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]]\n", "ref_df= ref_df.sort_values(by=[\"participant_id\",\"time\"])\n", "print(\"EMA app data shape:\",ref_df.shape)" ] }, { "cell_type": "code", "execution_count": 4, "id": "ec9a29e7", "metadata": {}, "outputs": [], "source": [ "# Here, we merge TAC_df and ref_df to make train_df.\n", "# First, create a temporary column in ref_df with the time interval upper bound\n", "ref_df['time_upper_bound'] = ref_df['time'] + pd.Timedelta(hours=5)\n", "\n", "# Merge the two DataFrames based on participant_id\n", "merged_df = pd.merge(TAC_df, ref_df, on='participant_id', how='right')\n", "\n", "# Filter rows within the 5-hour intervals\n", "merged_df = merged_df[(merged_df['datetime'] >= merged_df['time']) & (merged_df['datetime'] < merged_df['time_upper_bound'])]\n", "\n", "# Group the DataFrames by participant_id and time, and aggregate datetime and TAC ug/L(air) values in lists\n", "grouped_df = merged_df.groupby(['participant_id', 'time']).agg({'datetime': list, 'TAC ug/L(air)': list}).reset_index()\n", "\n", "# Merge the aggregated values back to ref_df\n", "train_df = pd.merge(ref_df, grouped_df, on=['participant_id', 'time'])\n", "\n", "# Drop the temporary column in ref_df\n", "train_df = train_df.drop(columns=['time_upper_bound'])" ] }, { "cell_type": "code", "execution_count": 5, "id": "22f46d03", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timedrinking_timestampdrinking_eventema_n_drinksdatetimeTAC ug/L(air)
02021-03-25 09:00:00-04:00NaT0NaN[2021-03-25 11:40:29-04:00, 2021-03-25 11:40:4...[26.35, 3.14, 0.0, 10.45, 13.17, 12.75, 9.83, ...
12021-03-25 14:00:00-04:00NaT0NaN[2021-03-25 14:00:14-04:00, 2021-03-25 14:00:3...[2.3, 2.3, 3.97, 2.09, 3.14, 2.72, 2.09, 2.51,...
22021-03-25 19:00:00-04:00NaT0NaN[2021-03-25 19:00:14-04:00, 2021-03-25 19:00:3...[0.84, 0.21, 0.63, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
32021-03-26 00:00:00-04:00NaT0NaN[2021-03-26 00:00:14-04:00, 2021-03-26 00:00:3...[12.13, 11.08, 12.34, 13.8, 14.01, 12.75, 12.7...
42021-03-26 05:00:00-04:00NaT0NaN[2021-03-26 05:00:14-04:00, 2021-03-26 05:00:3...[0.63, 0.63, 0.84, 0.63, 0.84, 0.63, 0.84, 1.2...
\n", "
" ], "text/plain": [ " time drinking_timestamp drinking_event ema_n_drinks \\\n", "0 2021-03-25 09:00:00-04:00 NaT 0 NaN \n", "1 2021-03-25 14:00:00-04:00 NaT 0 NaN \n", "2 2021-03-25 19:00:00-04:00 NaT 0 NaN \n", "3 2021-03-26 00:00:00-04:00 NaT 0 NaN \n", "4 2021-03-26 05:00:00-04:00 NaT 0 NaN \n", "\n", " datetime \\\n", "0 [2021-03-25 11:40:29-04:00, 2021-03-25 11:40:4... \n", "1 [2021-03-25 14:00:14-04:00, 2021-03-25 14:00:3... \n", "2 [2021-03-25 19:00:14-04:00, 2021-03-25 19:00:3... \n", "3 [2021-03-26 00:00:14-04:00, 2021-03-26 00:00:3... \n", "4 [2021-03-26 05:00:14-04:00, 2021-03-26 05:00:3... \n", "\n", " TAC ug/L(air) \n", "0 [26.35, 3.14, 0.0, 10.45, 13.17, 12.75, 9.83, ... \n", "1 [2.3, 2.3, 3.97, 2.09, 3.14, 2.72, 2.09, 2.51,... \n", "2 [0.84, 0.21, 0.63, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... \n", "3 [12.13, 11.08, 12.34, 13.8, 14.01, 12.75, 12.7... \n", "4 [0.63, 0.63, 0.84, 0.63, 0.84, 0.63, 0.84, 1.2... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.iloc[:,1:].head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "78ee3312", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2429.000000\n", "mean 808.856731\n", "std 216.738438\n", "min 1.000000\n", "25% 900.000000\n", "50% 900.000000\n", "75% 900.000000\n", "max 903.000000\n", "Name: TAC ug/L(air), dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of TAC values in each time interval\n", "# Descriptive results for this number were almost the same for 5-hour intervals with/without a drinking event start time\n", "train_df['TAC ug/L(air)'].apply(lambda x: len(x)).astype('int64').describe()" ] }, { "cell_type": "markdown", "id": "52ad1d7d", "metadata": {}, "source": [ "
\n", " Make an estimator to conduct procedures I and II, DrinkDetector. This estimator follows the scikit-learn API requirements (Ref: link). \n", "
" ] }, { "cell_type": "code", "execution_count": 7, "id": "67e492b2", "metadata": {}, "outputs": [], "source": [ "class DrinkDetector(BaseEstimator, ClassifierMixin, TransformerMixin):\n", " \"\"\"A scikit-learn compatible estimator that conducts procedures I and II of the model.\n", " \n", " Parameters\n", " ----------\n", " size: median_filter kernel size\n", " window: rolling average window size\n", " distance: peak minimum distance in peak detection algorithm\n", " prominence: peak minimum prominence\n", " wlen: window length\n", " width: peak minimum width\n", " \n", " Please see the following website for more details on the parameters: \n", " https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.median_filter.html\n", " https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html\n", " https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html\n", " \"\"\"\n", " \n", " def __init__(self, size=39, window=206, distance=265, prominence=8.0,\n", " wlen=2202, width=36):\n", " self.size = size\n", " self.window = window\n", " self.distance = distance\n", " self.prominence = prominence\n", " self.wlen = wlen\n", " self.width = width\n", " \n", " def fit(self, X, y=None):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " X : TAC data collected with Skyn. X is a numpy array with three \n", " columns \"participant_id\",\"datetime\",\"TAC ug/L(air)\".\n", " \n", " y: EMA app/self-reported drinking event start times and number of drinkgs. y is a numpy array\n", " with columns for \"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"\n", " \"\"\"\n", " return self\n", " \n", " def predict(self, X):\n", " X=pd.DataFrame(X,columns=[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"])\n", "\n", " # Make X in long format with one row per participan TAC timestamp \n", " data = [{'participant_id': row.participant_id, 'datetime': dt, 'TAC ug/L(air)': tac}\n", " for row in X.itertuples() for dt, tac in zip(row.datetime, row._3)]\n", " X = pd.DataFrame(data)\n", " X.index=X.datetime\n", " X = X[[\"participant_id\",\"TAC ug/L(air)\"]]\n", " \n", " y_pred = pd.DataFrame(columns=[\"participant_id\",\"index_test\",\"right_bases\",\"peak_maximum\",\"peak_auc\"]) \n", " for participant_id in X.iloc[:,0].unique():\n", " # Get data for each participant.\n", " p_data = X.query('participant_id == @participant_id').copy() # participant level TAC data\n", " \n", " # PROCEDURE I (SIGNAL FILTERING)\n", " ## Median filter\n", " p_data[\"medfilt\"] = median_filter(p_data[\"TAC ug/L(air)\"], size=self.size)\n", "\n", " ## Moving average on median filter\n", " p_data[\"medfilt_rolling\"] = p_data[\"medfilt\"].rolling(window=self.window).mean() \n", " ## Remove missing values created in procedure I\n", " p_data = p_data.loc[p_data[\"medfilt_rolling\"].notna()]\n", "\n", " # PROCEDURE II (PEAK DETECTION)\n", " peaks, properties = find_peaks(p_data[\"medfilt_rolling\"].values, \n", " distance = self.distance, \n", " prominence = (self.prominence, None),\n", " wlen = self.wlen,\n", " width = (self.width,None))\n", " \n", " ## store peak properties: Left bases are the index test, model detected drinking event start times\n", " index_test = p_data[\"medfilt_rolling\"].index[properties[\"left_bases\"]]\n", " ## get right bases of detected peak too\n", " rbs = p_data[\"medfilt_rolling\"].index[properties[\"right_bases\"]]\n", " max_peak = p_data.loc[p_data.index.isin(p_data[\"medfilt_rolling\"].index[peaks]),\"medfilt_rolling\"].values \n", " dict_df = {'participant_id': participant_id, 'index_test':index_test,'right_bases':rbs, 'peak_maximum':max_peak} \n", " peaks_props = pd.DataFrame(dict_df) \n", " ## AUC: we calculate area under the peak curve and add it to peaks_props\n", " peaks_props[\"peak_auc\"] = [simps(dx=1, y=p_data[it:rb].medfilt_rolling)\n", " for it, rb in zip(peaks_props.index_test, peaks_props.right_bases)]\n", " y_pred = pd.concat([y_pred, peaks_props])\n", " \n", " y_pred = y_pred.values\n", " return y_pred\n", " \n", " def main_analysis(self,X, y=None):\n", " y=pd.DataFrame(y,columns=[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]) \n", " y.participant_id=y.participant_id.astype(\"int64\")\n", " y.drinking_event=y.drinking_event.astype(\"int64\")\n", " y.ema_n_drinks=y.ema_n_drinks.astype(float)\n", " y.sort_values(by=\"drinking_timestamp\",inplace=True) # EMA app recorded drinking start time\n", " \n", " \n", " y_pred = self.predict(X)\n", " y_pred = pd.DataFrame(data=y_pred, columns=[\"participant_id\",\"index_test\",\"right_bases\",\"peak_maximum\",\"peak_auc\"])\n", " y_pred['index_test'] = pd.to_datetime(y_pred['index_test'], \n", " utc=True).dt.tz_convert(FixedOffset(-240))\n", "\n", " y_pred[\"peak_id\"]=np.arange(0,y_pred.shape[0]) # make peak id in y_pred\n", " y_pred.sort_values(by=\"index_test\",inplace=True)\n", " y_pred.participant_id=y_pred.participant_id.astype(\"int64\")\n", " \n", " \n", " # make a df for true pos. and false neg.\n", " TP_FN=pd.merge_asof(left=y[y.drinking_event ==1 ], right=y_pred,\n", " by=\"participant_id\", left_on=\"drinking_timestamp\", right_on=\"index_test\",\n", " allow_exact_matches=True, \n", " direction=\"nearest\", \n", " tolerance=pd.Timedelta(\"5h\")) \n", " \n", " TP_FN[\"time_difference\"]=abs(TP_FN.drinking_timestamp-TP_FN.index_test)\n", " # for the duplicates, keep the ones with smaller time_difference values\n", " # To do that, first find duplicate peak_ids with larger time difference\n", " code_nan=TP_FN.loc[(TP_FN.peak_id.notna())&(TP_FN.duplicated(subset=[\"peak_id\"],keep=False))\n", " ].groupby(by=\"index_test\").max()\n", " \n", " if code_nan.shape[0] > 0:\n", " print(code_nan.shape[0],\"duplicates were generated while calculating true pos. false neg.\",end='\\r')\n", " # Set them to nan:\n", " TP_FN.loc[(TP_FN.peak_id.isin(code_nan.peak_id))&\n", " (TP_FN.time_difference.isin(code_nan.time_difference)),\n", " ['index_test','peak_id']]=np.nan\n", " # drop time_difference\n", " TP_FN.drop(columns=[\"time_difference\"],inplace=True) \n", " \n", " # Next, Make a df for true neg and concat it to TP_FN\n", " TN=y.loc[y.drinking_timestamp.isna()]\n", " TP_FN_TN=pd.concat([TP_FN,TN])\n", " \n", " # Next, Make a df for false pos. (these are detected peaks not in TP_FN)\n", " FP=y_pred.loc[~y_pred.peak_id.isin(TP_FN.peak_id)]\n", " FP=FP.rename(columns={\"index_test\":\"false_p\"})\n", "\n", " # merge FP to TP_FN_TN: If more than 1 FP in a 5-hour interval, just one of them is counted\n", " TP_FN_TN.sort_values(by=\"time\",inplace=True)\n", " FP.sort_values(by=\"false_p\",inplace=True)\n", " TP_FN_TN_FP=pd.merge_asof(left=TP_FN_TN, right=FP, suffixes=('_TP', '_FP'), \n", " by=\"participant_id\", left_on=\"time\", right_on=\"false_p\",\n", " allow_exact_matches=True, direction=\"forward\",\n", " tolerance=pd.Timedelta(\"5h\"))\n", " \n", " # Make a true_label and pred_label for scoring\n", " TP_FN_TN_FP.rename(columns={\"drinking_event\":\"true_label\"},inplace=True) \n", " TP = TP_FN_TN_FP.query('index_test.notnull() &'\n", " 'drinking_timestamp.notnull()').index\n", " FN = TP_FN_TN_FP.query('index_test.isnull() &'\n", " 'drinking_timestamp.notnull()').index\n", " FP = TP_FN_TN_FP.query('false_p.notnull() &'\n", " 'index_test.isnull()').index\n", " TN = TP_FN_TN_FP.query('index_test.isnull() &'\n", " 'drinking_timestamp.isnull() &'\n", " 'false_p.isnull()').index\n", " \n", " TP_FN_TN_FP.loc[TP,\"pred_label\"]=1 # True pos \n", " TP_FN_TN_FP.loc[FN,\"pred_label\"]=0 # False neg\n", " # if for a drinking interval, there are more than one peak, \n", " # it is counted as one True positive\n", " TP_FN_TN_FP.loc[FP,\"pred_label\"]=1 # False pos\n", " TP_FN_TN_FP.loc[TN,\"pred_label\"]=0 # True neg.\n", "\n", " # Compute the AUC score based on the binary labels and return it\n", " score = balanced_accuracy_score(TP_FN_TN_FP['true_label'], TP_FN_TN_FP['pred_label'])\n", " \n", " return score, TP_FN_TN_FP\n", " \n", " def score(self,X, y=None):\n", " score, _ = self.main_analysis(X, y)\n", " return score\n", " \n", " def outputs(self,X, y=None):\n", " _, outputs = self.main_analysis(X, y)\n", " outputs = outputs[[\"participant_id\",\"true_label\",\"pred_label\",\n", " \"ema_n_drinks\",\"peak_maximum_TP\",\"peak_auc_TP\"]]\n", " return outputs" ] }, { "cell_type": "code", "execution_count": 8, "id": "d33bbb47", "metadata": {}, "outputs": [], "source": [ "# set up X,y, and group\n", "X = np.array(train_df[[\"participant_id\",\"datetime\",\"TAC ug/L(air)\"]])\n", "y = np.array(train_df[[\"participant_id\",\"time\",\"drinking_timestamp\",\"drinking_event\",\"ema_n_drinks\"]])\n", "groups = np.array(train_df[\"participant_id\"])" ] }, { "cell_type": "markdown", "id": "bf03ba41", "metadata": {}, "source": [ "
\n", " Hyperparameter Optimization\n", " \n", "
    \n", "
  1. Set the hyperparameter values to that from the best set in the random grid search
  2. \n", "
  3. Finetune a hyperparameter.
  4. \n", "
  5. Update the best grid search model with the best performing value for the finetuned hyperparameter.
  6. \n", "
  7. Repeat steps 2 and 3 for each hyperparameter in the following order:
    window size (for median filter and moving average) → distance → minimum required prominence → wlen → minimum required width.
  8. \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "id": "dbf59c28", "metadata": { "code_folding": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 50 candidates, totalling 250 fits\n", "Best hyperparameters: {'DrinkDetector__distance': 594, 'DrinkDetector__prominence': 2.3074534860188396, 'DrinkDetector__size': 378, 'DrinkDetector__width': 16, 'DrinkDetector__window': 399, 'DrinkDetector__wlen': 2233}\n", "Best score: 0.8359545390006131\n" ] } ], "source": [ "# Create a pipeline\n", "pipeline = Pipeline([\n", " ('DrinkDetector', DrinkDetector())\n", "])\n", "\n", "# Set the hyperparameters distributions\n", "params = {\n", " # possible values for each each hyperparameter were randomly pooled from a discrete uniform distribution ranging from 1 to 541\n", " 'DrinkDetector__size': randint(1, 542), \n", " 'DrinkDetector__window': randint(1, 541), \n", " 'DrinkDetector__distance': randint(180, 901),\n", " 'DrinkDetector__prominence': beta(a=2, b=2, loc=0, scale=21),\n", " 'DrinkDetector__wlen': randint(900, 4321),\n", " 'DrinkDetector__width': randint(1, 541),\n", "}\n", "\n", "# Create RandomizedSearchCV\n", "gkf = GroupKFold(n_splits=5)\n", "rand_search = RandomizedSearchCV(estimator=pipeline,\n", " param_distributions=params,\n", " n_iter=50,\n", " scoring=None, # this would use the score method from the estimator\n", " cv=gkf, verbose=1, n_jobs = n_jobs,\n", " random_state=45)\n", "\n", "rand_search.fit(X=X, y=y, groups=groups)\n", "\n", "print(\"Best hyperparameters:\", rand_search.best_params_)\n", "print(\"Best score:\", rand_search.best_score_)" ] }, { "cell_type": "code", "execution_count": 10, "id": "aacec3af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_DrinkDetector__distanceparam_DrinkDetector__prominenceparam_DrinkDetector__sizeparam_DrinkDetector__widthparam_DrinkDetector__windowparam_DrinkDetector__wlenparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0007630.0009435.8122121.0384605942.307453378163992233{'DrinkDetector__distance': 594, 'DrinkDetecto...0.8050260.8183330.8376180.8653000.8534960.8359550.0220851
10.0000000.0000005.6192110.57598174913.2806984871971113586{'DrinkDetector__distance': 749, 'DrinkDetecto...0.7617220.7944440.8227580.6950190.8536040.7855090.05451425
20.0032070.0064145.4452040.6078113383.113761902084022800{'DrinkDetector__distance': 338, 'DrinkDetecto...0.8154420.7850000.8398810.8383600.8634640.8284290.0265032
30.0004000.0004904.5334970.58107773414.609955435061532932{'DrinkDetector__distance': 734, 'DrinkDetecto...0.5996110.6088890.6318390.5950530.6208380.6112460.01355948
40.0000000.0000005.2357370.50607041315.5164872043173113940{'DrinkDetector__distance': 413, 'DrinkDetecto...0.7513050.7944440.8238890.7538120.8114510.7869800.02963024
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "0 0.000763 0.000943 5.812212 1.038460 \n", "1 0.000000 0.000000 5.619211 0.575981 \n", "2 0.003207 0.006414 5.445204 0.607811 \n", "3 0.000400 0.000490 4.533497 0.581077 \n", "4 0.000000 0.000000 5.235737 0.506070 \n", "\n", " param_DrinkDetector__distance param_DrinkDetector__prominence \\\n", "0 594 2.307453 \n", "1 749 13.280698 \n", "2 338 3.11376 \n", "3 734 14.609955 \n", "4 413 15.516487 \n", "\n", " param_DrinkDetector__size param_DrinkDetector__width \\\n", "0 378 16 \n", "1 487 197 \n", "2 190 208 \n", "3 43 506 \n", "4 204 317 \n", "\n", " param_DrinkDetector__window param_DrinkDetector__wlen \\\n", "0 399 2233 \n", "1 111 3586 \n", "2 402 2800 \n", "3 153 2932 \n", "4 311 3940 \n", "\n", " params split0_test_score \\\n", "0 {'DrinkDetector__distance': 594, 'DrinkDetecto... 0.805026 \n", "1 {'DrinkDetector__distance': 749, 'DrinkDetecto... 0.761722 \n", "2 {'DrinkDetector__distance': 338, 'DrinkDetecto... 0.815442 \n", "3 {'DrinkDetector__distance': 734, 'DrinkDetecto... 0.599611 \n", "4 {'DrinkDetector__distance': 413, 'DrinkDetecto... 0.751305 \n", "\n", " split1_test_score split2_test_score split3_test_score split4_test_score \\\n", "0 0.818333 0.837618 0.865300 0.853496 \n", "1 0.794444 0.822758 0.695019 0.853604 \n", "2 0.785000 0.839881 0.838360 0.863464 \n", "3 0.608889 0.631839 0.595053 0.620838 \n", "4 0.794444 0.823889 0.753812 0.811451 \n", "\n", " mean_test_score std_test_score rank_test_score \n", "0 0.835955 0.022085 1 \n", "1 0.785509 0.054514 25 \n", "2 0.828429 0.026503 2 \n", "3 0.611246 0.013559 48 \n", "4 0.786980 0.029630 24 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cross-validation results in random grid search for procedures I and II (first five rows)\n", "pd.DataFrame(rand_search.cv_results_).head()" ] }, { "cell_type": "markdown", "id": "8a860533", "metadata": {}, "source": [ "
\n", "We perform staged finetuning on the best_estimator_ from random grid search.\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "id": "47b81976", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('DrinkDetector',\n",
       "                 DrinkDetector(distance=594, prominence=2.3074534860188396,\n",
       "                               size=378, width=16, window=399, wlen=2233))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('DrinkDetector',\n", " DrinkDetector(distance=594, prominence=2.3074534860188396,\n", " size=378, width=16, window=399, wlen=2233))])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search_best_estimator = rand_search.best_estimator_.fit(X)\n", "random_search_best_estimator" ] }, { "cell_type": "code", "execution_count": 12, "id": "8edf3ecf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 9 candidates, totalling 45 fits\n", "Fitting 5 folds for each of 3 candidates, totalling 15 fits\n", "Fitting 5 folds for each of 4 candidates, totalling 20 fits\n", "Fitting 5 folds for each of 3 candidates, totalling 15 fits\n", "Fitting 5 folds for each of 3 candidates, totalling 15 fits\n", "All fits = 110\n", "Best hyperparameters after fine-tuning: Pipeline(steps=[('DrinkDetector',\n", " DrinkDetector(distance=587, prominence=2.29, size=348,\n", " width=12, window=349, wlen=2223))])\n", "Best score: 0.8433704994653132\n" ] } ], "source": [ "params_stages = [\n", " {'DrinkDetector__size': [348,363,378], 'DrinkDetector__window': [349,374,399]},\n", " {'DrinkDetector__distance': [587,591,594]},\n", " {'DrinkDetector__prominence': [2.29,2.30,2.31,2.32]},\n", " {'DrinkDetector__wlen': [2223,2233,2243]},\n", " {'DrinkDetector__width': [12,14,16]}\n", "]\n", "\n", "gkf = GroupKFold(n_splits=5)\n", "best_estimator = random_search_best_estimator\n", "cv_results = []\n", "total_fits = 0\n", "for i, params in enumerate(params_stages):\n", " fine_search = GridSearchCV(estimator=best_estimator,\n", " param_grid=params, scoring=None,\n", " cv=gkf, verbose=1, n_jobs=n_jobs, refit=True)\n", " \n", " num_candidates = len(list(ParameterGrid(params)))\n", " total_fits += (num_candidates * gkf.n_splits)\n", " \n", " fine_search.fit(X=X, y=y, groups=groups)\n", " best_estimator = fine_search.best_estimator_\n", " best_score = fine_search.best_score_\n", " cv_results.append(fine_search.cv_results_)\n", " \n", " if i==4:\n", " print(\"All fits = \",total_fits)\n", " print(\"Best hyperparameters after fine-tuning:\", best_estimator)\n", " print(\"Best score:\", best_score)" ] }, { "cell_type": "markdown", "id": "b920b9f9", "metadata": {}, "source": [ "
In random grid search best score was 0.836 and it improved and reached 0.843 after fine-tuning. \n", "We store the output of best_estimator and use this in procedure III.
" ] }, { "cell_type": "code", "execution_count": 14, "id": "06a98790", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 duplicates were generated while calculating true pos. false neg.\r" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
true_labelpred_labelema_n_drinkspeak_maximum_TPpeak_auc_TP
19800.0NaNNaNNaN
19900.0NaNNaNNaN
20000.0NaNNaNNaN
20111.02.027.03186212866.220081
20210.06.0NaNNaN
\n", "
" ], "text/plain": [ " true_label pred_label ema_n_drinks peak_maximum_TP peak_auc_TP\n", "198 0 0.0 NaN NaN NaN\n", "199 0 0.0 NaN NaN NaN\n", "200 0 0.0 NaN NaN NaN\n", "201 1 1.0 2.0 27.031862 12866.220081\n", "202 1 0.0 6.0 NaN NaN" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DrinkDetector_outputs = best_estimator.named_steps['DrinkDetector'].outputs(X, y)\n", "DrinkDetector_outputs.iloc[198:,1:].head()" ] }, { "cell_type": "code", "execution_count": 15, "id": "244420f0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pred_label0.01.0
true_label
02064170
147148
\n", "
" ], "text/plain": [ "pred_label 0.0 1.0\n", "true_label \n", "0 2064 170\n", "1 47 148" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(DrinkDetector_outputs.true_label,DrinkDetector_outputs.pred_label)" ] }, { "cell_type": "code", "execution_count": 16, "id": "03d016e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['procedure_InII.pkl']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save best pipeline\n", "best_pipeline = Pipeline([\n", " ('DrinkDetector', DrinkDetector(distance=587, prominence=2.29, size=348, width=12, \n", " window=349, wlen=2223))\n", "])\n", "\n", "best_pipeline.fit(X, y)\n", "joblib.dump(best_pipeline, 'procedure_InII.pkl')" ] }, { "cell_type": "markdown", "id": "72f3f6a6", "metadata": {}, "source": [ "## Procedure III" ] }, { "cell_type": "markdown", "id": "4bb2da25", "metadata": {}, "source": [ "
This procedure was only conducted on true positives, that is cases where the model detected a peak and participant reported a drinking event. Here, we use DrinkDetector_outputs which was created in previous procedure.
\n", "Columns in DrinkDetector_outputs\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 17, "id": "85cffa31", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There were a total of 148 true positice cases.\n" ] } ], "source": [ "# restrict the data to true positives (peak prop are only available for true positives)\n", "DrinkDetector_outputs = DrinkDetector_outputs.query('true_label == 1 & pred_label == 1')\n", "print(f\"There were a total of {DrinkDetector_outputs.shape[0]} true positice cases.\")" ] }, { "cell_type": "code", "execution_count": 18, "id": "a2618c94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No missing data\n", "DrinkDetector_outputs.isna().sum().sum()" ] }, { "cell_type": "code", "execution_count": 19, "id": "9695e33b", "metadata": {}, "outputs": [], "source": [ "# Set X and y \n", "X = np.array(DrinkDetector_outputs[[\"peak_maximum_TP\",\"peak_auc_TP\"]])\n", "y = np.array(DrinkDetector_outputs[\"ema_n_drinks\"])\n", "groups = np.array(DrinkDetector_outputs[\"participant_id\"])" ] }, { "cell_type": "markdown", "id": "03648937", "metadata": {}, "source": [ "
\n", " Hyperparameter Optimization\n", "\n", "This was similar to that conducted in procedures I and II. \n", " In random grid search, we find the best regression technique and its hyperparameters. Of note, regression technique was itself a hyperparameter here.\n", " In finetunning, we tuned the hyperparameters of the best regression technique.\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "id": "ec22c915", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\nhkia\\anaconda3\\lib\\site-packages\\sklearn\\model_selection\\_search.py:305: UserWarning: The total space of parameters 4 is smaller than n_iter=50. Running 4 iterations. For exhaustive searches, use GridSearchCV.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best model: SVR\n", "Best parameters: {'SVR__C': 0.6716456431507717, 'SVR__epsilon': 1.0648019005876224, 'SVR__gamma': 'scale', 'SVR__kernel': 'sigmoid', 'SVR__shrinking': False}\n", "Best score: 2.201700608754006\n" ] } ], "source": [ "models = {\n", " 'LinearRegression': {\n", " 'model': LinearRegression(),\n", " 'params': {\n", " 'LinearRegression__fit_intercept': [True, False],\n", " 'LinearRegression__positive': [True, False]\n", " } \n", " },\n", " 'SGDRegressor': {\n", " 'model': SGDRegressor(),\n", " 'params': { \n", " 'SGDRegressor__loss': ['squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],\n", " 'SGDRegressor__penalty':[\"l2\",\"l1\",\"elasticnet\"],\n", " 'SGDRegressor__alpha':loguniform(1e-4, 1e0),\n", " 'SGDRegressor__fit_intercept': [True, False],\n", " 'SGDRegressor__learning_rate':[\"constant\",\"optimal\",\"invscaling\",\"adaptive\"],\n", " 'SGDRegressor__l1_ratio':uniform(0, 1),\n", " 'SGDRegressor__max_iter': [1000, 5000, 10000],\n", " 'SGDRegressor__tol': [1e-3, 1e-4, 1e-5]\n", " }\n", " },\n", " 'SVR': {\n", " 'model': SVR(),\n", " 'params': {\n", " 'SVR__kernel': ['rbf', 'sigmoid', 'linear'],\n", " 'SVR__C': reciprocal(0.1, 10),\n", " 'SVR__gamma': [\"scale\",\"auto\"],\n", " 'SVR__epsilon':uniform(0.1,1),\n", " 'SVR__shrinking':[True, False]\n", " }\n", " },\n", " 'LGBMRegressor': {\n", " 'model': LGBMRegressor(),\n", " 'params': { \n", " 'LGBMRegressor__boosting_type':[\"gbdt\",\"dart\",\"goss\"],\n", " 'LGBMRegressor__num_leaves': randint(low = 1, high=100), \n", " 'LGBMRegressor__max_depth': randint(low=-1, high=20), \n", " 'LGBMRegressor__learning_rate': uniform(0.01,2), \n", " 'LGBMRegressor__n_estimators': randint(low=1, high=200)\n", " }\n", " }\n", "}\n", "\n", "def find_best_model(X, y, groups, models):\n", " best_score = np.inf\n", " best_model = None\n", " best_params = None\n", "\n", " gkf = GroupKFold(n_splits=5)\n", "\n", " for model_name, model_info in models.items():\n", " pipeline = Pipeline([\n", " ('scaler', StandardScaler()),\n", " (model_name, model_info['model'])\n", " ])\n", "\n", " randomized_search = RandomizedSearchCV(\n", " estimator = pipeline,\n", " param_distributions = model_info['params'],\n", " scoring = \"neg_mean_absolute_error\",\n", " cv = gkf,\n", " n_iter = 50,\n", " random_state = 45, refit = False\n", " )\n", "\n", " randomized_times = randomized_search.fit(X=X, y=y, groups=groups)\n", "\n", " if -randomized_search.best_score_ < best_score:\n", " best_score = -randomized_search.best_score_\n", " best_model = model_name\n", " best_params = randomized_search.best_params_\n", " cv_res = randomized_search.cv_results_\n", " \n", " return best_model, best_params, best_score, cv_res\n", "\n", "# Find the best model and hyperparameters\n", "best_model, best_params, best_score, cv_res = find_best_model(X, y, groups, models)\n", "print(f\"Best model: {best_model}\")\n", "print(f\"Best parameters: {best_params}\")\n", "print(f\"Best score: {best_score}\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "1e4b3fdf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_SVR__Cparam_SVR__epsilonparam_SVR__gammaparam_SVR__kernelparam_SVR__shrinkingparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0062480.0076530.0000000.0000009.5065520.649545scalesigmoidFalse{'SVR__C': 9.506551974984317, 'SVR__epsilon': ...-33.852418-20.713941-18.279950-35.767277-18.220137-25.3667457.78618050
10.0031230.0062470.0000000.0000000.7743530.572808autosigmoidTrue{'SVR__C': 0.7743530119612986, 'SVR__epsilon':...-1.890978-2.152339-1.450243-4.037363-1.727136-2.2516120.9215303
20.0031210.0062430.0031280.0062560.5379410.157238autolinearTrue{'SVR__C': 0.5379407995493989, 'SVR__epsilon':...-2.410833-2.436668-1.498158-3.597182-1.888605-2.3662890.7076549
30.0036090.0038630.0008040.0004021.9962041.090722scalerbfFalse{'SVR__C': 1.9962036354162338, 'SVR__epsilon':...-1.935181-2.266980-1.962787-3.998925-1.896410-2.4120570.80428227
40.0017950.0014660.0006040.0008024.3519940.34072scalerbfTrue{'SVR__C': 4.351994273115294, 'SVR__epsilon': ...-2.005662-2.448613-2.035063-3.826020-1.822290-2.4275300.72863434
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time param_SVR__C \\\n", "0 0.006248 0.007653 0.000000 0.000000 9.506552 \n", "1 0.003123 0.006247 0.000000 0.000000 0.774353 \n", "2 0.003121 0.006243 0.003128 0.006256 0.537941 \n", "3 0.003609 0.003863 0.000804 0.000402 1.996204 \n", "4 0.001795 0.001466 0.000604 0.000802 4.351994 \n", "\n", " param_SVR__epsilon param_SVR__gamma param_SVR__kernel param_SVR__shrinking \\\n", "0 0.649545 scale sigmoid False \n", "1 0.572808 auto sigmoid True \n", "2 0.157238 auto linear True \n", "3 1.090722 scale rbf False \n", "4 0.34072 scale rbf True \n", "\n", " params split0_test_score \\\n", "0 {'SVR__C': 9.506551974984317, 'SVR__epsilon': ... -33.852418 \n", "1 {'SVR__C': 0.7743530119612986, 'SVR__epsilon':... -1.890978 \n", "2 {'SVR__C': 0.5379407995493989, 'SVR__epsilon':... -2.410833 \n", "3 {'SVR__C': 1.9962036354162338, 'SVR__epsilon':... -1.935181 \n", "4 {'SVR__C': 4.351994273115294, 'SVR__epsilon': ... -2.005662 \n", "\n", " split1_test_score split2_test_score split3_test_score split4_test_score \\\n", "0 -20.713941 -18.279950 -35.767277 -18.220137 \n", "1 -2.152339 -1.450243 -4.037363 -1.727136 \n", "2 -2.436668 -1.498158 -3.597182 -1.888605 \n", "3 -2.266980 -1.962787 -3.998925 -1.896410 \n", "4 -2.448613 -2.035063 -3.826020 -1.822290 \n", "\n", " mean_test_score std_test_score rank_test_score \n", "0 -25.366745 7.786180 50 \n", "1 -2.251612 0.921530 3 \n", "2 -2.366289 0.707654 9 \n", "3 -2.412057 0.804282 27 \n", "4 -2.427530 0.728634 34 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cross-validation results in random grid search for procedures III\n", "pd.DataFrame(cv_res).head()" ] }, { "cell_type": "code", "execution_count": 22, "id": "83522f47", "metadata": {}, "outputs": [], "source": [ "# Update pipeline based on results from random grid search\n", "pipeline = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('SVR', SVR(C = 0.672, epsilon = 1.065, gamma = 'scale', kernel = 'sigmoid',\n", " shrinking = False))\n", "])" ] }, { "cell_type": "code", "execution_count": 23, "id": "5df93f05", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 32 candidates, totalling 160 fits\n", "Best hyperparameters after fine-tuning: Pipeline(steps=[('scaler', StandardScaler()),\n", " ('SVR', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])\n", "Best score: -2.1887871151690304\n" ] } ], "source": [ "# Finetunning on best estimator from random grid search\n", "reg_params = {\n", " 'SVR__C': [0.067,0.672],\n", " 'SVR__epsilon': [1.065, 0.065],\n", " 'SVR__gamma': [\"scale\",\"auto\" ],\n", " 'SVR__kernel': ['sigmoid', 'poly'],\n", " 'SVR__shrinking': [True, False]\n", "}\n", "\n", "gkf = GroupKFold(n_splits=5)\n", "\n", "fine_search = GridSearchCV(estimator=pipeline,\n", " param_grid=reg_params, scoring=\"neg_mean_absolute_error\",\n", " cv=gkf, verbose=1, n_jobs=n_jobs, refit=True)\n", "\n", "fine_search.fit(X = X, y = y, groups = groups)\n", "best_estimator = fine_search.best_estimator_\n", "best_score = fine_search.best_score_\n", "cv_res_ = fine_search.cv_results_\n", " \n", "print(\"Best hyperparameters after fine-tuning:\", best_estimator)\n", "print(\"Best score:\", best_score)" ] }, { "cell_type": "code", "execution_count": 25, "id": "d04f30bd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['procedure_III.pkl']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save procedure III results\n", "scaler = StandardScaler()\n", "best_model = SVR(C=0.672, epsilon=0.065, gamma = 'scale', kernel='sigmoid', shrinking = True)\n", "best_pipeline = Pipeline(steps=[('scaler', scaler), ('model', best_model)])\n", "best_pipeline.fit(X, y)\n", "\n", "joblib.dump(best_pipeline, 'procedure_III.pkl')" ] }, { "cell_type": "markdown", "id": "2c4b2b43", "metadata": {}, "source": [ "\n", " \n", "" ] }, { "cell_type": "code", "execution_count": 27, "id": "b040619e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('scaler', StandardScaler()),\n",
       "                ('model', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('scaler', StandardScaler()),\n", " ('model', SVR(C=0.672, epsilon=0.065, kernel='sigmoid'))])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "best_pipeline" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }