biopsykit.classification.utils module¶

Module with utility functions for machine learning and classification applications.

biopsykit.classification.utils.factorize_subject_id(data, subject_col=None)[source]¶

Factorize subject IDs, i.e., encode them as an enumerated type or categorical variable.

Parameters

data (DataFrame or Series) – input data
subject_col (str, optional) – name of index level containing subject IDs or None to use default column name (“subject”). Default: None

Returns

groups (ndarray) – A numpy array with factorized subject IDs. They also serve as indexer for keys.
keys (ndarray) – The unique subject ID values.

Return type

Tuple[numpy.ndarray, numpy.ndarray]

biopsykit.classification.utils.prepare_df_sklearn(data, label_col=None, subject_col=None, print_summary=False)[source]¶

Prepare a dataframe for usage in sklearn functions and return the single components of the dataframe.

This function performs the following steps:

Strip dataframe from all index levels and return an array that only contains values (using strip_df())
Extract labels from dataframe (using strip_labels())
Factorize subject IDs so that each subject ID has an unique number (using factorize_subject_id())

Parameters

data (DataFrame) – Input data as pandas dataframe
label_col (str, optional) – name of index level containing class labels or None to use default column name (“label”). Default: None
subject_col (str, optional) – name of index level containing subject IDs or None to use default column name (“subject”). Default: None
print_summary (bool, optional) – True to print a summary of the shape of the data and label arrays, the number of groups and the class prevalence of all classes, False otherwise. Default: False

Returns

X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y_data (array-like of shape (n_samples,)) – Target relative to X, i.e. class labels.
groups (array-like of shape (n_samples,)) – Factorized subject IDs
group_keys (array-like of shape (n_samples,)) – Subject IDs

Return type

Tuple[numpy.ndarray, …]

biopsykit.classification.utils.split_train_test(X, y, train, test, groups=None)[source]¶

Split data into train and test set.

Parameters

X (array-like of shape (n_samples, n_features)) – Data to be split, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target relative to x_data, i.e. class labels.
train (ndarray) – The training set indices for that split
test (ndarray) – The test set indices for that split
groups (array-like of shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set or None if group labels should not be considered for splitting. Default: None

Returns

X_train (ndarray) – Training data
X_test (ndarray) – Test data
y_train (ndarray) – Targets of training data
y_test (ndarray) – Targets of test data
group_train (ndarray) – Group labels of training data (only available if groups is not None)
group_test (ndarray) – Group labels of test data (only available if groups is not None)

Return type

Tuple[numpy.ndarray, …]

biopsykit.classification.utils.strip_df(data)[source]¶

Strip dataframe from all index levels to only contain values.

biopsykit.classification.utils.strip_labels(data, label_col=None)[source]¶

Strip labels from dataframe index.

Parameters

data (DataFrame or Series) – input data
label_col (str, optional) – name of index level containing class labels or None to use default column name (“label”). Default: None

Returns

array with labels

Return type

ndarray