biopsykit.classification.model_selection.sklearn_pipeline_permuter module

Module for systematically evaluating different combinations of sklearn pipelines.

class biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter(model_dict=None, param_dict=None, hyper_search_dict=None, random_state=None)[source]

Bases: object

Class for systematically evaluating different sklearn pipeline combinations.

This class can be used to, for instance, evaluate combinations of different feature selection methods (e.g., SelectKBest, SequentialFeatureSelector) with different estimators (e.g., SVC, DecisionTreeClassifier), any much more.

For all combinations, hyperparameter search (e.g., using grid-search or randomized-search) can be performed by passing one joint parameter grid (see Examples).

Parameters
  • model_dict (dict) – Dictionary specifying the different transformers and estimators to evaluate. Each pipeline step corresponds to one dictionary entry and has the name of the pipeline step (str) as key. The values are again dictionaries with the transformer/estimator names as keys and instances of the transformers/estimators as values

  • param_dict (dict) – Nested dictionary specifying the parameter settings to try per transformer/estimator. The dictionary has the transformer/estimator names (str) as keys and parameter dictionaries as values. Each parameter dictionary has parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

  • hyper_search_dict (dict, optional) – Nested dictionary specifying the method for hyperparameter search (e.g., whether to use “grid” for grid-search or “random” for randomized-search) for each estimator. By default, “grid-search” is used for each estimator unless individually specified otherwise.

  • random_state (int, optional) – Controls the random seed passed to each estimator and each splitter. By default, no random seed is passed. Set this to an integer for reproducible results across multiple program calls.

Examples

>>> from sklearn import datasets
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> from sklearn.feature_selection import SelectKBest, RFE
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.svm import SVC
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import AdaBoostClassifier
>>> from sklearn.model_selection import KFold
>>>
>>> from biopsykit.classification.model_selection import SklearnPipelinePermuter
>>>
>>> breast_cancer = datasets.load_breast_cancer()
>>> X = breast_cancer.data
>>> y = breast_cancer.target
>>>
>>> model_dict = {
>>>    "scaler": {
>>>         "StandardScaler": StandardScaler(),
>>>         "MinMaxScaler": MinMaxScaler(),
>>>     },
>>>     "reduce_dim": {
>>>         "SelectKBest": SelectKBest(),
>>>         "RFE": RFE(SVC(kernel="linear", C=1))
>>>     },
>>>     "clf" : {
>>>         "KNeighborsClassifier": KNeighborsClassifier(),
>>>         "DecisionTreeClassifier": DecisionTreeClassifier(),
>>>         "SVC": SVC(),
>>>         "AdaBoostClassifier": AdaBoostClassifier(),
>>>     }
>>> }
>>>
>>> param_dict = {
>>>     "StandardScaler": None,
>>>     "MinMaxScaler": None,
>>>     "SelectKBest": { "k": [2, 4, 6, 8, "all"] },
>>>     "RFE": { "n_features_to_select": [2, 4, 6, 8, None] },
>>>     "KNeighborsClassifier": { "n_neighbors": [2, 4, 6, 8], "weights": ["uniform", "distance"] },
>>>     "DecisionTreeClassifier": {"criterion": ['gini', 'entropy'], "max_depth": [2, 4, 6, 8, 10] },
>>>     "AdaBoostClassifier": {
>>>         "base_estimator": [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2)],
>>>         "n_estimators": np.arange(20, 210, 10),
>>>         "learning_rate": np.arange(0.6, 1.1, 0.1)
>>>     },
>>>     "SVC": [
>>>         {
>>>             "kernel": ["linear"],
>>>             "C": np.logspace(start=-3, stop=3, num=7)
>>>         },
>>>         {
>>>             "kernel": ["rbf"],
>>>             "C": np.logspace(start=-3, stop=3, num=7),
>>>             "gamma": np.logspace(start=-3, stop=3, num=7)
>>>         }
>>>     ]
>>> }
>>>
>>> # AdaBoost hyperparameters should be optimized using randomized-search, all others using grid-search
>>> hyper_search_dict = {
>>>     "AdaBoostClassifier": {"search_method": "random", "n_iter": 30}
>>> }
>>>
>>> pipeline_permuter = SklearnPipelinePermuter(model_dict, param_dict, hyper_search_dict)
>>> pipeline_permuter.fit(X, y, outer_cv=KFold(), inner_cv=KFold())
models: Dict[str, Dict[str, sklearn.base.BaseEstimator]]

Dictionary with pipeline steps and the different transformers/estimators per step.

params: Dict[str, Optional[Union[Sequence[Dict[str, Any]], Dict[str, Any]]]]

Dictionary with parameter sets to test for the different transformers/estimators per pipeline step.

model_combinations: Sequence[Tuple[Tuple[str, str], ...]]

List of model combinations, i.e. permutations of the different transformers/estimators for each pipeline step.

hyper_search_dict: Dict[str, Dict[str, Any]]

Dictionary specifying the selected hyperparameter search method for each estimator.

param_searches: Dict[Tuple[str, str], Dict[str, Any]]

Dictionary with parameter search results for each pipeline step combination.

scoring: biopsykit.utils._types.str_t

Scoring used as metric for optimization during hyperparameter search.

refit: str
random_state: Optional[numpy.random.mtrand.RandomState]
property results

Dataframe with parameter search results of each pipeline step combination.

classmethod from_csv(file_path, num_pipeline_steps=3)[source]

Create a new SklearnPipelinePermute instance from a csv file with exported results from parameter search.

Parameters
  • file_path (pathlib.Path or str) – path to csv file

  • num_pipeline_steps (int) – integer specifying the number of steps in the pipeline. Used to infer pipeline steps from the MultiIndex in the dataframe. For instance, for a pipeline consisting of the steps “scaler”, “reduce_dim”, and “clf” pass “3” as num_pipeline_steps

Returns

SklearnPipelinePermuter instance with results from csv file

Return type

SklearnPipelinePermuter

fit(X, y, *, outer_cv, inner_cv, scoring=None, use_cache=True, **kwargs)[source]

Run fit for all pipeline combinations and sets of parameters.

This function calls nested_cv_param_search() for all Pipeline combinations and stores the results in the param_searches attribute.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.

  • outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.

  • inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.

  • scoring (str, optional) – A str specifying the scoring metric to use for evaluation.

  • use_cache (bool, optional) – True to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parameter cachedir_name), False otherwise. Default: True

  • **kwargs – Additional arguments that are passed to nested_cv_parameter_search() and the hyperparameter search class instance (e.g., GridSearchCV or RandomizedSearchCV).

fit_and_save_intermediate(X, y, *, outer_cv, inner_cv, file_path, scoring=None, use_cache=True, **kwargs)[source]

Run fit for all pipeline combinations and sets of parameters and save intermediate results to file.

This function calls nested_cv_param_search() for all Pipeline combinations and stores the results in the param_searches attribute. After each model combination, the results are saved to a pickle file.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.

  • outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.

  • inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.

  • file_path (pathlib.Path or str) – path to pickle file

  • scoring (str, optional) – A str specifying the scoring metric to use for evaluation.

  • use_cache (bool, optional) – True to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parameter cachedir_name), False otherwise. Default: True

  • **kwargs – Additional arguments that are passed to nested_cv_parameter_search() and the hyperparameter search class instance (e.g., GridSearchCV or RandomizedSearchCV).

pipeline_score_results()[source]

Return parameter search results for each pipeline combination.

Returns

dataframe with parameter search results for each pipeline combination

Return type

DataFrame

metric_summary(additional_metrics=None, pos_label=None)[source]

Return summary with all performance metrics for the best-performing estimator of each pipeline combination.

The best-performing estimator for each pipeline combination is the best_estimator_ that GridSearchCV returns for each outer fold, i.e. the pipeline which yielded the highest average test score (over all inner folds).

Parameters
  • additional_metrics (str or list of str, optional) – additional metrics to compute. Default: None. Available metrics can be found in scikit-learn’s metrics and scoring module.

  • pos_label (str, optional) – positive label for binary classification, must be specified if additional_metrics is specified.

Returns

dataframe with performance metric summary the best estimator of each pipeline combination.

Return type

DataFrame

export_pipeline_score_results(file_path)[source]

Export pipeline score results as csv file.

Parameters

file_path (Path or str) – file path to export

Return type

None

export_metric_summary(file_path)[source]

Export performance metric summary as csv file.

Parameters

file_path (Path or str) – file path to export

Return type

None

best_estimator_summary()[source]

Return a dataframe with the best estimator instances of all pipeline combinations for each fold.

Each entry of the dataframe is a list of Pipeline objects whe returned the .

Returns

dataframe with best estimator instances

Return type

DataFrame

mean_pipeline_score_results()[source]

Compute mean score results for each pipeline combination and hyperparameter combination.

Returns

dataframe with mean score results for each pipeline combination and each parameter combination, sorted by the highest mean score.

Return type

DataFrame

Notes

The pipeline with the highest “mean over the mean test scores” does not necessarily correspond to the best-performing pipeline as returned by metric_summary() or best_estimator_summary() because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters.

This function should only be used if you want to gain a deeper understanding of the different hyperparameter combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper, use metric_summary() or best_estimator_summary() instead.

best_hyperparameter_pipeline()[source]

Return the evaluation results for the pipeline with the best-performing hyperparameter set.

This returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.

Notes

This best pipeline does not necessarily correspond to the overall best-performing pipeline as returned by metric_summary() or best_estimator_summary() because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters. This function returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.

This function should only be used if you want to gain a deeper understanding of the different hyperparameter combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper, use metric_summary() or best_estimator_summary() instead.

Returns

dataframe with the evaluation results of the best pipeline over all outer folds

Return type

DataFrame

metric_summary_to_latex(data=None, metrics=None, pipeline_steps=None, si_table_format=None, highlight_best=None, **kwargs)[source]

Return a latex table with the performance metrics of the pipeline combinations.

By default, this function uses the attribute of the SklearnPipelinePermuter instance. If the data parameter is set, the function uses the dataframe passed as argument.

Parameters
  • data (DataFrame, optional) – dataframe with performance metrics if custom data should be used or None to use the attribute of the SklearnPipelinePermuter instance. Default: None

  • metrics (list of str, optional) – list of metrics to include in the table or None to use all available metrics in the dataframe. Default: None

  • pipeline_steps (list of str, optional) – list of pipeline steps to include in the table index or None to show all available pipeline steps as table index. Default: None

  • si_table_format (str, optional) – table format for the siunitx package or None to use the default format. Default: None

  • highlight_best (bool or str, optional) – Whether to highlight the pipeline with the best value in each column or not. * If highlight_best is a boolean, the best pipeline is highlighted in each column. * If highlight_best is a string, the best pipeline is highlighted in the column with the name

  • **kwargs – additional keyword arguments passed to to_latex()

Return type

str

to_pickle(file_path)[source]

Export the current instance as a pickle file.

Parameters

file_path (Path or str) – file path to export

Return type

None

static from_pickle(file_path)[source]

Import a SklearnPipelinePermuter instance from a pickle file.

Parameters

file_path (Path or str) – file path to import

Returns

``SklearnPipelinePermuter` instance

Return type

SklearnPipelinePermuter

update_permuter(model_dict=None, param_dict=None, hyper_search_dict=None)[source]

Update the SklearnPipelinePermuter instance with new model and parameter dictionaries.

Parameters
  • model_dict (dict, optional) – dictionary with model classes for each pipeline step

  • param_dict (dict, optional) – dictionary with parameter grids for each pipeline step

  • hyper_search_dict (dict, optional) – dictionary with hyperparameter search settings for each estimator

Returns

updated SklearnPipelinePermuter instance

Return type

SklearnPipelinePermuter

classmethod merge_permuter_instances(permuter)[source]

Merge two (or more) SklearnPipelinePermuter instances.

This function expects at least two SklearnPipelinePermuter instances to merge. The function first performs a deep copy of the first instance and then merges all attributes of the remaining permuter instance with the copy. The permuter instances passed to this function are not modified.

Parameters

permuter (list of SklearnPipelinePermuter instances or) – list of file paths to pickled SklearnPipelinePermuter instances

Returns

merged SklearnPipelinePermuter instance

Return type

SklearnPipelinePermuter

compute_additional_metrics(metric_summary, metrics, pos_label)[source]

Compute additional classification metrics.

Parameters
  • metric_summary (DataFrame) – metric summary from metric_summary()

  • metrics (str or list of str) – metric(s) to compute

  • pos_label (str) – positive label for binary classification

Returns

metric summary with additional metrics computed

Return type

DataFrame