biopsykit.classification.utils module¶
Module with utility functions for machine learning and classification applications.
- biopsykit.classification.utils.factorize_subject_id(data, subject_col=None)[source]¶
Factorize subject IDs, i.e., encode them as an enumerated type or categorical variable.
- Parameters
- Returns
- Return type
- biopsykit.classification.utils.prepare_df_sklearn(data, label_col=None, subject_col=None, print_summary=False)[source]¶
Prepare a dataframe for usage in sklearn functions and return the single components of the dataframe.
This function performs the following steps:
Strip dataframe from all index levels and return an array that only contains values (using
strip_df()
)Extract labels from dataframe (using
strip_labels()
)Factorize subject IDs so that each subject ID has an unique number (using
factorize_subject_id()
)
- Parameters
data (
DataFrame
) – Input data as pandas dataframelabel_col (str, optional) – name of index level containing class labels or
None
to use default column name (“label”). Default:None
subject_col (str, optional) – name of index level containing subject IDs or
None
to use default column name (“subject”). Default:None
print_summary (bool, optional) –
True
to print a summary of the shape of the data and label arrays, the number of groups and the class prevalence of all classes,False
otherwise. Default:False
- Returns
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y_data (array-like of shape (n_samples,)) – Target relative to
X
, i.e. class labels.groups (array-like of shape (n_samples,)) – Factorized subject IDs
group_keys (array-like of shape (n_samples,)) – Subject IDs
- Return type
Tuple[numpy.ndarray, …]
- biopsykit.classification.utils.split_train_test(X, y, train, test, groups=None)[source]¶
Split data into train and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) – Data to be split, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target relative to
x_data
, i.e. class labels.train (
ndarray
) – The training set indices for that splittest (
ndarray
) – The test set indices for that splitgroups (array-like of shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set or
None
if group labels should not be considered for splitting. Default:None
- Returns
X_train (
ndarray
) – Training dataX_test (
ndarray
) – Test datay_train (
ndarray
) – Targets of training datay_test (
ndarray
) – Targets of test datagroup_train (
ndarray
) – Group labels of training data (only available ifgroups
is notNone
)group_test (
ndarray
) – Group labels of test data (only available ifgroups
is notNone
)
- Return type
Tuple[numpy.ndarray, …]
- biopsykit.classification.utils.strip_df(data)[source]¶
Strip dataframe from all index levels to only contain values.