biopsykit.classification.model_selection.sklearn_pipeline_permuter module¶

Module for systematically evaluating different combinations of sklearn pipelines.

class biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter(model_dict=None, param_dict=None, hyper_search_dict=None, random_state=None)[source]¶

Bases: object

Class for systematically evaluating different sklearn pipeline combinations.

This class can be used to, for instance, evaluate combinations of different feature selection methods (e.g., SelectKBest, SequentialFeatureSelector) with different estimators (e.g., SVC, DecisionTreeClassifier), any much more.

For all combinations, hyperparameter search (e.g., using grid-search or randomized-search) can be performed by passing one joint parameter grid (see Examples).

Parameters

model_dict (dict) – Dictionary specifying the different transformers and estimators to evaluate. Each pipeline step corresponds to one dictionary entry and has the name of the pipeline step (str) as key. The values are again dictionaries with the transformer/estimator names as keys and instances of the transformers/estimators as values
param_dict (dict) – Nested dictionary specifying the parameter settings to try per transformer/estimator. The dictionary has the transformer/estimator names (str) as keys and parameter dictionaries as values. Each parameter dictionary has parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
hyper_search_dict (dict, optional) – Nested dictionary specifying the method for hyperparameter search (e.g., whether to use “grid” for grid-search or “random” for randomized-search) for each estimator. By default, “grid-search” is used for each estimator unless individually specified otherwise.
random_state (int, optional) – Controls the random seed passed to each estimator and each splitter. By default, no random seed is passed. Set this to an integer for reproducible results across multiple program calls.

Examples

>>> from sklearn import datasets
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> from sklearn.feature_selection import SelectKBest, RFE
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.svm import SVC
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import AdaBoostClassifier
>>> from sklearn.model_selection import KFold
>>>
>>> from biopsykit.classification.model_selection import SklearnPipelinePermuter
>>>
>>> breast_cancer = datasets.load_breast_cancer()
>>> X = breast_cancer.data
>>> y = breast_cancer.target
>>>
>>> model_dict = {
>>>    "scaler": {
>>>         "StandardScaler": StandardScaler(),
>>>         "MinMaxScaler": MinMaxScaler(),
>>>     },
>>>     "reduce_dim": {
>>>         "SelectKBest": SelectKBest(),
>>>         "RFE": RFE(SVC(kernel="linear", C=1))
>>>     },
>>>     "clf" : {
>>>         "KNeighborsClassifier": KNeighborsClassifier(),
>>>         "DecisionTreeClassifier": DecisionTreeClassifier(),
>>>         "SVC": SVC(),
>>>         "AdaBoostClassifier": AdaBoostClassifier(),
>>>     }
>>> }
>>>
>>> param_dict = {
>>>     "StandardScaler": None,
>>>     "MinMaxScaler": None,
>>>     "SelectKBest": { "k": [2, 4, 6, 8, "all"] },
>>>     "RFE": { "n_features_to_select": [2, 4, 6, 8, None] },
>>>     "KNeighborsClassifier": { "n_neighbors": [2, 4, 6, 8], "weights": ["uniform", "distance"] },
>>>     "DecisionTreeClassifier": {"criterion": ['gini', 'entropy'], "max_depth": [2, 4, 6, 8, 10] },
>>>     "AdaBoostClassifier": {
>>>         "base_estimator": [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2)],
>>>         "n_estimators": np.arange(20, 210, 10),
>>>         "learning_rate": np.arange(0.6, 1.1, 0.1)
>>>     },
>>>     "SVC": [
>>>         {
>>>             "kernel": ["linear"],
>>>             "C": np.logspace(start=-3, stop=3, num=7)
>>>         },
>>>         {
>>>             "kernel": ["rbf"],
>>>             "C": np.logspace(start=-3, stop=3, num=7),
>>>             "gamma": np.logspace(start=-3, stop=3, num=7)
>>>         }
>>>     ]
>>> }
>>>
>>> # AdaBoost hyperparameters should be optimized using randomized-search, all others using grid-search
>>> hyper_search_dict = {
>>>     "AdaBoostClassifier": {"search_method": "random", "n_iter": 30}
>>> }
>>>
>>> pipeline_permuter = SklearnPipelinePermuter(model_dict, param_dict, hyper_search_dict)
>>> pipeline_permuter.fit(X, y, outer_cv=KFold(), inner_cv=KFold())

models: Dict[str, Dict[str, sklearn.base.BaseEstimator]]¶: Dictionary with pipeline steps and the different transformers/estimators per step.

params: Dict[str, Optional[Union[Sequence[Dict[str, Any]], Dict[str, Any]]]]¶: Dictionary with parameter sets to test for the different transformers/estimators per pipeline step.

model_combinations: Sequence[Tuple[Tuple[str, str], ...]]¶: List of model combinations, i.e. permutations of the different transformers/estimators for each pipeline step.

hyper_search_dict: Dict[str, Dict[str, Any]]¶: Dictionary specifying the selected hyperparameter search method for each estimator.

param_searches: Dict[Tuple[str, str], Dict[str, Any]]¶: Dictionary with parameter search results for each pipeline step combination.

scoring: biopsykit.utils._types.str_t¶: Scoring used as metric for optimization during hyperparameter search.

refit: str¶

random_state: Optional[numpy.random.mtrand.RandomState]¶

property results¶: Dataframe with parameter search results of each pipeline step combination.

classmethod from_csv(file_path, num_pipeline_steps=3)[source]¶

Create a new SklearnPipelinePermute instance from a csv file with exported results from parameter search.

Parameters

file_path (pathlib.Path or str) – path to csv file
num_pipeline_steps (int) – integer specifying the number of steps in the pipeline. Used to infer pipeline steps from the MultiIndex in the dataframe. For instance, for a pipeline consisting of the steps “scaler”, “reduce_dim”, and “clf” pass “3” as num_pipeline_steps

Returns

SklearnPipelinePermuter instance with results from csv file

Return type

SklearnPipelinePermuter

fit(X, y, *, outer_cv, inner_cv, scoring=None, use_cache=True, **kwargs)[source]¶

Run fit for all pipeline combinations and sets of parameters.

This function calls nested_cv_param_search() for all Pipeline combinations and stores the results in the param_searches attribute.

Parameters

X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.
outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.
inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.
scoring (str, optional) – A str specifying the scoring metric to use for evaluation.
use_cache (bool, optional) – True to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parameter cachedir_name), False otherwise. Default: True
**kwargs – Additional arguments that are passed to nested_cv_parameter_search() and the hyperparameter search class instance (e.g., GridSearchCV or RandomizedSearchCV).

fit_and_save_intermediate(X, y, *, outer_cv, inner_cv, file_path, scoring=None, use_cache=True, **kwargs)[source]¶

Run fit for all pipeline combinations and sets of parameters and save intermediate results to file.

This function calls nested_cv_param_search() for all Pipeline combinations and stores the results in the param_searches attribute. After each model combination, the results are saved to a pickle file.

Parameters

X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.
outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.
inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.
file_path (pathlib.Path or str) – path to pickle file
scoring (str, optional) – A str specifying the scoring metric to use for evaluation.
use_cache (bool, optional) – True to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parameter cachedir_name), False otherwise. Default: True
**kwargs – Additional arguments that are passed to nested_cv_parameter_search() and the hyperparameter search class instance (e.g., GridSearchCV or RandomizedSearchCV).

pipeline_score_results()[source]¶

Return parameter search results for each pipeline combination.

Returns: dataframe with parameter search results for each pipeline combination
Return type: DataFrame

metric_summary(additional_metrics=None, pos_label=None)[source]¶

Return summary with all performance metrics for the best-performing estimator of each pipeline combination.

The best-performing estimator for each pipeline combination is the best_estimator_ that GridSearchCV returns for each outer fold, i.e. the pipeline which yielded the highest average test score (over all inner folds).

Parameters

additional_metrics (str or list of str, optional) – additional metrics to compute. Default: None. Available metrics can be found in scikit-learn’s metrics and scoring module.
pos_label (str, optional) – positive label for binary classification, must be specified if additional_metrics is specified.

Returns

dataframe with performance metric summary the best estimator of each pipeline combination.

Return type

DataFrame

export_pipeline_score_results(file_path)[source]¶

Export pipeline score results as csv file.

Parameters: file_path (Path or str) – file path to export
Return type: None

export_metric_summary(file_path)[source]¶

Export performance metric summary as csv file.

Parameters: file_path (Path or str) – file path to export
Return type: None

best_estimator_summary()[source]¶

Return a dataframe with the best estimator instances of all pipeline combinations for each fold.

Each entry of the dataframe is a list of Pipeline objects whe returned the .

Returns: dataframe with best estimator instances
Return type: DataFrame

mean_pipeline_score_results()[source]¶

Compute mean score results for each pipeline combination and hyperparameter combination.

Returns: dataframe with mean score results for each pipeline combination and each parameter combination, sorted by the highest mean score.
Return type: DataFrame

Notes

The pipeline with the highest “mean over the mean test scores” does not necessarily correspond to the best-performing pipeline as returned by metric_summary() or best_estimator_summary() because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters.

This function should only be used if you want to gain a deeper understanding of the different hyperparameter combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper, use metric_summary() or best_estimator_summary() instead.

best_hyperparameter_pipeline()[source]¶

Return the evaluation results for the pipeline with the best-performing hyperparameter set.

This returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.

Notes

This best pipeline does not necessarily correspond to the overall best-performing pipeline as returned by metric_summary() or best_estimator_summary() because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters. This function returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.

Returns: dataframe with the evaluation results of the best pipeline over all outer folds
Return type: DataFrame

metric_summary_to_latex(data=None, metrics=None, pipeline_steps=None, si_table_format=None, highlight_best=None, **kwargs)[source]¶

Return a latex table with the performance metrics of the pipeline combinations.

By default, this function uses the attribute of the SklearnPipelinePermuter instance. If the data parameter is set, the function uses the dataframe passed as argument.

Parameters

data (DataFrame, optional) – dataframe with performance metrics if custom data should be used or None to use the attribute of the SklearnPipelinePermuter instance. Default: None
metrics (list of str, optional) – list of metrics to include in the table or None to use all available metrics in the dataframe. Default: None
pipeline_steps (list of str, optional) – list of pipeline steps to include in the table index or None to show all available pipeline steps as table index. Default: None
si_table_format (str, optional) – table format for the siunitx package or None to use the default format. Default: None
highlight_best (bool or str, optional) – Whether to highlight the pipeline with the best value in each column or not. * If highlight_best is a boolean, the best pipeline is highlighted in each column. * If highlight_best is a string, the best pipeline is highlighted in the column with the name
**kwargs – additional keyword arguments passed to to_latex()

Return type

str

to_pickle(file_path)[source]¶

Export the current instance as a pickle file.

Parameters: file_path (Path or str) – file path to export
Return type: None

static from_pickle(file_path)[source]¶

Import a SklearnPipelinePermuter instance from a pickle file.

Parameters: file_path (Path or str) – file path to import
Returns: ``SklearnPipelinePermuter` instance
Return type: SklearnPipelinePermuter

update_permuter(model_dict=None, param_dict=None, hyper_search_dict=None)[source]¶

Update the SklearnPipelinePermuter instance with new model and parameter dictionaries.

Parameters

model_dict (dict, optional) – dictionary with model classes for each pipeline step
param_dict (dict, optional) – dictionary with parameter grids for each pipeline step
hyper_search_dict (dict, optional) – dictionary with hyperparameter search settings for each estimator

Returns

updated SklearnPipelinePermuter instance

Return type

SklearnPipelinePermuter

classmethod merge_permuter_instances(permuter)[source]¶

Merge two (or more) SklearnPipelinePermuter instances.

This function expects at least two SklearnPipelinePermuter instances to merge. The function first performs a deep copy of the first instance and then merges all attributes of the remaining permuter instance with the copy. The permuter instances passed to this function are not modified.

Parameters: permuter (list of SklearnPipelinePermuter instances or) – list of file paths to pickled SklearnPipelinePermuter instances
Returns: merged SklearnPipelinePermuter instance
Return type: SklearnPipelinePermuter

compute_additional_metrics(metric_summary, metrics, pos_label)[source]¶

Compute additional classification metrics.

Parameters

metric_summary (DataFrame) – metric summary from metric_summary()
metrics (str or list of str) – metric(s) to compute
pos_label (str) – positive label for binary classification

Returns

metric summary with additional metrics computed

Return type

DataFrame

biopsykit.classification.model_selection.nested_cv module biopsykit.classification.utils module