biopsykit.classification.utils module

Module with utility functions for machine learning and classification applications.

biopsykit.classification.utils.factorize_subject_id(data, subject_col=None)[source]

Factorize subject IDs, i.e., encode them as an enumerated type or categorical variable.

Parameters
  • data (DataFrame or Series) – input data

  • subject_col (str, optional) – name of index level containing subject IDs or None to use default column name (“subject”). Default: None

Returns

  • groups (ndarray) – A numpy array with factorized subject IDs. They also serve as indexer for keys.

  • keys (ndarray) – The unique subject ID values.

Return type

Tuple[numpy.ndarray, numpy.ndarray]

biopsykit.classification.utils.prepare_df_sklearn(data, label_col=None, subject_col=None, print_summary=False)[source]

Prepare a dataframe for usage in sklearn functions and return the single components of the dataframe.

This function performs the following steps:

  • Strip dataframe from all index levels and return an array that only contains values (using strip_df())

  • Extract labels from dataframe (using strip_labels())

  • Factorize subject IDs so that each subject ID has an unique number (using factorize_subject_id())

Parameters
  • data (DataFrame) – Input data as pandas dataframe

  • label_col (str, optional) – name of index level containing class labels or None to use default column name (“label”). Default: None

  • subject_col (str, optional) – name of index level containing subject IDs or None to use default column name (“subject”). Default: None

  • print_summary (bool, optional) – True to print a summary of the shape of the data and label arrays, the number of groups and the class prevalence of all classes, False otherwise. Default: False

Returns

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y_data (array-like of shape (n_samples,)) – Target relative to X, i.e. class labels.

  • groups (array-like of shape (n_samples,)) – Factorized subject IDs

  • group_keys (array-like of shape (n_samples,)) – Subject IDs

Return type

Tuple[numpy.ndarray, …]

biopsykit.classification.utils.split_train_test(X, y, train, test, groups=None)[source]

Split data into train and test set.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Data to be split, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,)) – Target relative to x_data, i.e. class labels.

  • train (ndarray) – The training set indices for that split

  • test (ndarray) – The test set indices for that split

  • groups (array-like of shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set or None if group labels should not be considered for splitting. Default: None

Returns

  • X_train (ndarray) – Training data

  • X_test (ndarray) – Test data

  • y_train (ndarray) – Targets of training data

  • y_test (ndarray) – Targets of test data

  • group_train (ndarray) – Group labels of training data (only available if groups is not None)

  • group_test (ndarray) – Group labels of test data (only available if groups is not None)

Return type

Tuple[numpy.ndarray, …]

biopsykit.classification.utils.strip_df(data)[source]

Strip dataframe from all index levels to only contain values.

Parameters

data (DataFrame) – input dataframe

Returns

array of stripped dataframe without index

Return type

ndarray

biopsykit.classification.utils.strip_labels(data, label_col=None)[source]

Strip labels from dataframe index.

Parameters
  • data (DataFrame or Series) – input data

  • label_col (str, optional) – name of index level containing class labels or None to use default column name (“label”). Default: None

Returns

array with labels

Return type

ndarray