biopsykit.classification.model_selection.sklearn_pipeline_permuter module¶
Module for systematically evaluating different combinations of sklearn pipelines.
- class biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter(model_dict=None, param_dict=None, hyper_search_dict=None, random_state=None)[source]¶
Bases:
object
Class for systematically evaluating different sklearn pipeline combinations.
This class can be used to, for instance, evaluate combinations of different feature selection methods (e.g.,
SelectKBest
,SequentialFeatureSelector
) with different estimators (e.g.,SVC
,DecisionTreeClassifier
), any much more.For all combinations, hyperparameter search (e.g., using grid-search or randomized-search) can be performed by passing one joint parameter grid (see Examples).
- Parameters
model_dict (dict) – Dictionary specifying the different transformers and estimators to evaluate. Each pipeline step corresponds to one dictionary entry and has the name of the pipeline step (str) as key. The values are again dictionaries with the transformer/estimator names as keys and instances of the transformers/estimators as values
param_dict (dict) – Nested dictionary specifying the parameter settings to try per transformer/estimator. The dictionary has the transformer/estimator names (str) as keys and parameter dictionaries as values. Each parameter dictionary has parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
hyper_search_dict (dict, optional) – Nested dictionary specifying the method for hyperparameter search (e.g., whether to use “grid” for grid-search or “random” for randomized-search) for each estimator. By default, “grid-search” is used for each estimator unless individually specified otherwise.
random_state (int, optional) – Controls the random seed passed to each estimator and each splitter. By default, no random seed is passed. Set this to an integer for reproducible results across multiple program calls.
Examples
>>> from sklearn import datasets >>> from sklearn.preprocessing import StandardScaler, MinMaxScaler >>> from sklearn.feature_selection import SelectKBest, RFE >>> from sklearn.neighbors import KNeighborsClassifier >>> from sklearn.svm import SVC >>> from sklearn.tree import DecisionTreeClassifier >>> from sklearn.ensemble import AdaBoostClassifier >>> from sklearn.model_selection import KFold >>> >>> from biopsykit.classification.model_selection import SklearnPipelinePermuter >>> >>> breast_cancer = datasets.load_breast_cancer() >>> X = breast_cancer.data >>> y = breast_cancer.target >>> >>> model_dict = { >>> "scaler": { >>> "StandardScaler": StandardScaler(), >>> "MinMaxScaler": MinMaxScaler(), >>> }, >>> "reduce_dim": { >>> "SelectKBest": SelectKBest(), >>> "RFE": RFE(SVC(kernel="linear", C=1)) >>> }, >>> "clf" : { >>> "KNeighborsClassifier": KNeighborsClassifier(), >>> "DecisionTreeClassifier": DecisionTreeClassifier(), >>> "SVC": SVC(), >>> "AdaBoostClassifier": AdaBoostClassifier(), >>> } >>> } >>> >>> param_dict = { >>> "StandardScaler": None, >>> "MinMaxScaler": None, >>> "SelectKBest": { "k": [2, 4, 6, 8, "all"] }, >>> "RFE": { "n_features_to_select": [2, 4, 6, 8, None] }, >>> "KNeighborsClassifier": { "n_neighbors": [2, 4, 6, 8], "weights": ["uniform", "distance"] }, >>> "DecisionTreeClassifier": {"criterion": ['gini', 'entropy'], "max_depth": [2, 4, 6, 8, 10] }, >>> "AdaBoostClassifier": { >>> "base_estimator": [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2)], >>> "n_estimators": np.arange(20, 210, 10), >>> "learning_rate": np.arange(0.6, 1.1, 0.1) >>> }, >>> "SVC": [ >>> { >>> "kernel": ["linear"], >>> "C": np.logspace(start=-3, stop=3, num=7) >>> }, >>> { >>> "kernel": ["rbf"], >>> "C": np.logspace(start=-3, stop=3, num=7), >>> "gamma": np.logspace(start=-3, stop=3, num=7) >>> } >>> ] >>> } >>> >>> # AdaBoost hyperparameters should be optimized using randomized-search, all others using grid-search >>> hyper_search_dict = { >>> "AdaBoostClassifier": {"search_method": "random", "n_iter": 30} >>> } >>> >>> pipeline_permuter = SklearnPipelinePermuter(model_dict, param_dict, hyper_search_dict) >>> pipeline_permuter.fit(X, y, outer_cv=KFold(), inner_cv=KFold())
- models: Dict[str, Dict[str, sklearn.base.BaseEstimator]]¶
Dictionary with pipeline steps and the different transformers/estimators per step.
- params: Dict[str, Optional[Union[Sequence[Dict[str, Any]], Dict[str, Any]]]]¶
Dictionary with parameter sets to test for the different transformers/estimators per pipeline step.
- model_combinations: Sequence[Tuple[Tuple[str, str], ...]]¶
List of model combinations, i.e. permutations of the different transformers/estimators for each pipeline step.
- hyper_search_dict: Dict[str, Dict[str, Any]]¶
Dictionary specifying the selected hyperparameter search method for each estimator.
- param_searches: Dict[Tuple[str, str], Dict[str, Any]]¶
Dictionary with parameter search results for each pipeline step combination.
- scoring: biopsykit.utils._types.str_t¶
Scoring used as metric for optimization during hyperparameter search.
- random_state: Optional[numpy.random.mtrand.RandomState]¶
- property results¶
Dataframe with parameter search results of each pipeline step combination.
- classmethod from_csv(file_path, num_pipeline_steps=3)[source]¶
Create a new
SklearnPipelinePermute
instance from a csv file with exported results from parameter search.- Parameters
file_path (
pathlib.Path
or str) – path to csv filenum_pipeline_steps (int) – integer specifying the number of steps in the pipeline. Used to infer pipeline steps from the
MultiIndex
in the dataframe. For instance, for a pipeline consisting of the steps “scaler”, “reduce_dim”, and “clf” pass “3” asnum_pipeline_steps
- Returns
SklearnPipelinePermuter
instance with results from csv file- Return type
- fit(X, y, *, outer_cv, inner_cv, scoring=None, use_cache=True, **kwargs)[source]¶
Run fit for all pipeline combinations and sets of parameters.
This function calls
nested_cv_param_search()
for all Pipeline combinations and stores the results in theparam_searches
attribute.- Parameters
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.
outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.
inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.
scoring (str, optional) – A str specifying the scoring metric to use for evaluation.
use_cache (bool, optional) –
True
to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parametercachedir_name
),False
otherwise. Default:True
**kwargs – Additional arguments that are passed to
nested_cv_parameter_search()
and the hyperparameter search class instance (e.g.,GridSearchCV
orRandomizedSearchCV
).
- fit_and_save_intermediate(X, y, *, outer_cv, inner_cv, file_path, scoring=None, use_cache=True, **kwargs)[source]¶
Run fit for all pipeline combinations and sets of parameters and save intermediate results to file.
This function calls
nested_cv_param_search()
for all Pipeline combinations and stores the results in theparam_searches
attribute. After each model combination, the results are saved to a pickle file.- Parameters
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, n_output) or (n_samples,)) – Target (i.e., class labels) relative to X for classification or regression.
outer_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the outer cross-validation.
inner_cv (`CV splitter`_) – Cross-validation object determining the cross-validation splitting strategy of the hyperparameter search.
file_path (
pathlib.Path
or str) – path to pickle filescoring (str, optional) – A str specifying the scoring metric to use for evaluation.
use_cache (bool, optional) –
True
to cache fitted transformer instances of the pipeline in a caching directory (can be provided by the additional parametercachedir_name
),False
otherwise. Default:True
**kwargs – Additional arguments that are passed to
nested_cv_parameter_search()
and the hyperparameter search class instance (e.g.,GridSearchCV
orRandomizedSearchCV
).
- pipeline_score_results()[source]¶
Return parameter search results for each pipeline combination.
- Returns
dataframe with parameter search results for each pipeline combination
- Return type
- metric_summary(additional_metrics=None, pos_label=None)[source]¶
Return summary with all performance metrics for the best-performing estimator of each pipeline combination.
The best-performing estimator for each pipeline combination is the best_estimator_ that
GridSearchCV
returns for each outer fold, i.e. the pipeline which yielded the highest average test score (over all inner folds).- Parameters
additional_metrics (str or list of str, optional) – additional metrics to compute. Default:
None
. Available metrics can be found in scikit-learn’s metrics and scoring module.pos_label (str, optional) – positive label for binary classification, must be specified if additional_metrics is specified.
- Returns
dataframe with performance metric summary the best estimator of each pipeline combination.
- Return type
- export_pipeline_score_results(file_path)[source]¶
Export pipeline score results as csv file.
- Parameters
file_path (
Path
or str) – file path to export- Return type
None
- export_metric_summary(file_path)[source]¶
Export performance metric summary as csv file.
- Parameters
file_path (
Path
or str) – file path to export- Return type
None
- best_estimator_summary()[source]¶
Return a dataframe with the best estimator instances of all pipeline combinations for each fold.
Each entry of the dataframe is a list of
Pipeline
objects whe returned the .- Returns
dataframe with best estimator instances
- Return type
- mean_pipeline_score_results()[source]¶
Compute mean score results for each pipeline combination and hyperparameter combination.
- Returns
dataframe with mean score results for each pipeline combination and each parameter combination, sorted by the highest mean score.
- Return type
Notes
The pipeline with the highest “mean over the mean test scores” does not necessarily correspond to the best-performing pipeline as returned by
metric_summary()
orbest_estimator_summary()
because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters.This function should only be used if you want to gain a deeper understanding of the different hyperparameter combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper, use
metric_summary()
orbest_estimator_summary()
instead.
- best_hyperparameter_pipeline()[source]¶
Return the evaluation results for the pipeline with the best-performing hyperparameter set.
This returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.
Notes
This best pipeline does not necessarily correspond to the overall best-performing pipeline as returned by
metric_summary()
orbest_estimator_summary()
because the best-performing pipelines are determined by averaging the best_estimator instances, as determined by scikit-learn over all folds. Hence, all best_estimator instances can have a different set of hyperparameters. This function returns the pipeline with the unique hyperparameter combination that achieved the highest mean score over all outer folds.This function should only be used if you want to gain a deeper understanding of the different hyperparameter combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper, use
metric_summary()
orbest_estimator_summary()
instead.- Returns
dataframe with the evaluation results of the best pipeline over all outer folds
- Return type
- metric_summary_to_latex(data=None, metrics=None, pipeline_steps=None, si_table_format=None, highlight_best=None, **kwargs)[source]¶
Return a latex table with the performance metrics of the pipeline combinations.
By default, this function uses the attribute of the
SklearnPipelinePermuter
instance. If thedata
parameter is set, the function uses the dataframe passed as argument.- Parameters
data (
DataFrame
, optional) – dataframe with performance metrics if custom data should be used orNone
to use the attribute of theSklearnPipelinePermuter
instance. Default:None
metrics (list of str, optional) – list of metrics to include in the table or
None
to use all available metrics in the dataframe. Default:None
pipeline_steps (list of str, optional) – list of pipeline steps to include in the table index or
None
to show all available pipeline steps as table index. Default:None
si_table_format (str, optional) – table format for the
siunitx
package orNone
to use the default format. Default:None
highlight_best (bool or str, optional) – Whether to highlight the pipeline with the best value in each column or not. * If
highlight_best
is a boolean, the best pipeline is highlighted in each column. * Ifhighlight_best
is a string, the best pipeline is highlighted in the column with the name**kwargs – additional keyword arguments passed to
to_latex()
- Return type
- to_pickle(file_path)[source]¶
Export the current instance as a pickle file.
- Parameters
file_path (
Path
or str) – file path to export- Return type
None
- static from_pickle(file_path)[source]¶
Import a
SklearnPipelinePermuter
instance from a pickle file.
- update_permuter(model_dict=None, param_dict=None, hyper_search_dict=None)[source]¶
Update the
SklearnPipelinePermuter
instance with new model and parameter dictionaries.- Parameters
- Returns
updated
SklearnPipelinePermuter
instance- Return type
- classmethod merge_permuter_instances(permuter)[source]¶
Merge two (or more)
SklearnPipelinePermuter
instances.This function expects at least two
SklearnPipelinePermuter
instances to merge. The function first performs a deep copy of the first instance and then merges all attributes of the remainingpermuter
instance with the copy. Thepermuter
instances passed to this function are not modified.- Parameters
permuter (list of
SklearnPipelinePermuter
instances or) – list of file paths to pickled SklearnPipelinePermuter instances- Returns
merged
SklearnPipelinePermuter
instance- Return type
- compute_additional_metrics(metric_summary, metrics, pos_label)[source]¶
Compute additional classification metrics.
- Parameters
metric_summary (
DataFrame
) – metric summary frommetric_summary()
metrics (str or list of str) – metric(s) to compute
pos_label (str) – positive label for binary classification
- Returns
metric summary with additional metrics computed
- Return type