biopsykit.classification.utils module¶
Module with utility functions for machine learning and classification applications.
- biopsykit.classification.utils.factorize_subject_id(data, subject_col=None)[source]¶
Factorize subject IDs, i.e., encode them as an enumerated type or categorical variable.
- Parameters
- Returns
- Return type
- biopsykit.classification.utils.prepare_df_sklearn(data, label_col=None, subject_col=None, print_summary=False)[source]¶
Prepare a dataframe for usage in sklearn functions and return the single components of the dataframe.
This function performs the following steps:
Strip dataframe from all index levels and return an array that only contains values (using
strip_df())Extract labels from dataframe (using
strip_labels())Factorize subject IDs so that each subject ID has an unique number (using
factorize_subject_id())
- Parameters
data (
DataFrame) – Input data as pandas dataframelabel_col (str, optional) – name of index level containing class labels or
Noneto use default column name (“label”). Default:Nonesubject_col (str, optional) – name of index level containing subject IDs or
Noneto use default column name (“subject”). Default:Noneprint_summary (bool, optional) –
Trueto print a summary of the shape of the data and label arrays, the number of groups and the class prevalence of all classes,Falseotherwise. Default:False
- Returns
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y_data (array-like of shape (n_samples,)) – Target relative to
X, i.e. class labels.groups (array-like of shape (n_samples,)) – Factorized subject IDs
group_keys (array-like of shape (n_samples,)) – Subject IDs
- Return type
tuple[numpy.ndarray, …]
- biopsykit.classification.utils.split_train_test(X, y, train, test, groups=None)[source]¶
Split data into train and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) – Data to be split, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target relative to
x_data, i.e. class labels.train (
ndarray) – The training set indices for that splittest (
ndarray) – The test set indices for that splitgroups (array-like of shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set or
Noneif group labels should not be considered for splitting. Default:None
- Returns
X_train (
ndarray) – Training dataX_test (
ndarray) – Test datay_train (
ndarray) – Targets of training datay_test (
ndarray) – Targets of test datagroup_train (
ndarray) – Group labels of training data (only available ifgroupsis notNone)group_test (
ndarray) – Group labels of test data (only available ifgroupsis notNone)
- Return type
tuple[numpy.ndarray, …]
- biopsykit.classification.utils.strip_df(data)[source]¶
Strip dataframe from all index levels to only contain values.