biopsykit.stats.stats module¶

Module for setting up a pipeline for statistical analysis.

class biopsykit.stats.stats.StatsPipeline(steps, params, **kwargs)[source]¶

Bases: object

Class to set up a pipeline for statistical analysis.

The purpose of such a pipeline is to assemble several steps of a typical statistical analysis procedure while setting different parameters. The parameters passed to this class depend on the used statistical functions. It enables setting parameters of the various steps using their names and the parameter name separated by a “__”, as in the examples below.

The interface of this class is inspired by the scikit-learn Pipeline for machine learning tasks (Pipeline).

All functions used are from the pingouin library (https://pingouin-stats.org/) for statistical analysis.

The different steps of statistical analysis can be divided into categories:

Preparatory Analysis (prep): Analyses applied to the data before performing the actual statistical analysis. Currently supported functions are:
- normality: Test whether a random sample comes from a normal distribution. See normality() for further information.
- equal_var: Test equality of variances (homoscedasticity). See homoscedasticity() for further information.
Statistical Test (test): Statistical test to determine differences or similarities in the data. Currently supported functions are:
- pairwise_tests: Pairwise tests (either for independent or dependent samples). See pairwise_tests() for further information.
- anova: One-way or N-way ANOVA. See anova() for further information.
- welch_anova: One-way Welch-ANOVA. See welch_anova() for further information.
- rm_anova: One-way and two-way repeated measures ANOVA. See rm_anova() further information.
- mixed_anova: Mixed-design (split-plot) ANOVA. See mixed_anova() for further information.
- kruskal: Kruskal-Wallis H-test for independent samples. See kruskal() for further information.
Posthoc Tests (posthoc): Posthoc tests to determine differences of individual groups if more than two groups are analyzed. Currently supported functions are:
- pairwise_tests: Pairwise tests (either for independent or dependent samples). See pairwise_tests() for further information.
- pairwise_tukey: Pairwise Tukey-HSD post-hoc test. See pairwise_tukey() for further information.
- pairwise_gameshowell: Pairwise Games-Howell post-hoc test. See pairwise_gameshowell() for further information.

A StatsPipeline consists of a list of tuples specifying the individual steps of the pipeline. The first value of each tuple indicates the category this step belongs to (prep, test, or posthoc), the second value indicates the analysis function to use in this step (e.g., normality, or rm_anova).

Furthermore, a params dictionary specifying the parameters and variables for statistical analysis needs to be supplied. Parameters can either be specified globally, i.e., for all steps in the pipeline (the default), or locally, i.e., only for one specific category, by prepending the category and separating it from the parameter name by a __. The parameters depend on the type of analysis used in the pipeline.

Examples are:

dv: column name of the dependent variable
between: column name of the between-subject factor
within: column name of the within-subject factor
effsize: type of effect size to compute (if applicable)
multicomp: whether (and how) to apply multi-comparison correction of p-values to the last step in the pipeline (either “test” or “posthoc”) using multicomp(). The arguments for the call to multicomp() are supplied via dictionary.
…

Parameters

steps (list of tuples) – list of tuples specifying statistical analysis pipeline
params (dict) – dictionary with parameter names and their values
**kwargs (dict) –
additional arguments, such as:
- round: Set the default decimal rounding of the output dataframes or None to disable rounding. Default: Rounding to 4 digits on p-value columns only. See round() for further options.

apply(data)[source]¶

Apply statistical analysis pipeline on input data.

Parameters: data (DataFrame) – input data to apply statistical analysis pipeline on. Must be provided in long-format.
Returns: dictionary with results from all pipeline steps
Return type: dict

results_cat(category)[source]¶

Return results for pipeline category.

This function filters results from steps belonging to the specified category and returns them.

Parameters: category ({"prep", "test", "posthoc"}) – category name
Returns: dataframe with results from the specified category or dict of such if multiple steps belong to the same category.
Return type: DataFrame or dict

display_results(sig_only=None, **kwargs)[source]¶

Display formatted results of statistical analysis pipeline.

This function displays the results of the statistical analysis pipeline. The output is Markdown-formatted and optimized for Jupyter Notebooks. The output can be configured to for example:

only show specific categories
only show statistically significant results for specific categories or for all pipeline categories
display results grouped by a grouper (when pipeline is applied on multiple groups of data, e.g., on multiple feature of the same type independently)

Parameters

sig_only (bool, str, list, or dict, optional) –
whether to only show statistically significant (p < 0.05) results or not. sig_only accepts multiple possible formats:
- str: filter only one specific category or “all” to filter all categories by statistical significance
- bool: True to filter all categories by statistical significance, False otherwise
- list: list of categories whose results should be filtered by statistical significance
- dict: dictionary with category names and bool values to filter (or not) for statistical significance
Default: None (no filtering)
**kwargs –
additional arguments to be passed to the function, such as:
- category names: True to display results of this category, False to skip displaying results of this category. Default: show results from all categories
- grouped: True to group results by the variable “groupby” specified in the parameter dictionary when initializing the StatsPipeline instance.

export_statistics(file_path)[source]¶

Export results of statistics analysis pipeline to Excel file.

Each step of the analysis pipeline is saved into its own sheet. The first sheet is an overview of the parameters specified in the analysis pipeline.

Parameters: file_path (Path or str) – path to export file

sig_brackets(stats_category_or_data, stats_effect_type, stats_type=None, plot_type='single', features=None, x=None, subplots=False)[source]¶

Generate significance brackets used indicate statistical significance in boxplots.

Parameters

stats_category_or_data ({“prep”, “test”, “posthoc”} or DataFrame) – either a string to specify the pipeline category to use for generating significance brackets or a dataframe with statistical results if significance brackets should be generated from the dataframe.
stats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
stats_type ({"between", "within", "interaction"}) –

Note

Deprecated in 0.4.0 stats_type will be removed in 0.5.0, it is replaced by stats_effect_type.
plot_type ({"single", "multi"}) – type of plot for which significance brackets are generated: “multi” if boxplots are grouped (by hue variable), “single” (the default) otherwise.
features (str, list or dict, optional) –
feature(s) used in boxplot. The resulting significance brackets will be filtered accordingly to only contain features present in the boxplot. It can have the following formats:
- str: only one feature is plotted in the boxplot (returns significance brackets of only one feature)
- list: multiple features are combined into one Axes object (returns significance brackets of multiple features)
- dict: dictionary with feature (or list of features) per subplot if boxplots are structured in subplots (subplots is True) (returns dictionary with significance brackets per subplot)
Default: None to return significance brackets of all features
x (str, optional) – name of x variable used to plot data in boxplot. Only required if plot_type is “multi”.
subplots (bool, optional) – True if multiple boxplots are structured in subplots, False otherwise. Default: False

Returns

box_pairs – list with significance brackets (or dict of such if subplots is True)
pvalues – list with p values belonging to the significance brackets in box_pairs (or dict of such if subplots is True)

Return type

tuple[collections.abc.Sequence[tuple[str, str]], collections.abc.Sequence[float]] | tuple[dict[str, collections.abc.Sequence[tuple[tuple[str, str], tuple[str, str]]]], dict[str, collections.abc.Sequence[float]]]

stats_to_latex(stats_test=None, index=None, data=None)[source]¶

Generate LaTeX output from statistical results.

Parameters

stats_test (str, optional) – name of statistical test in StatsPipeline, e.g., “pairwise_tests” or “anova” or None if external statistical results is provided via data
index (str or tuple, optional) – row indexer of statistical result or None to generate LaTeX output for all rows. Default: None
data (DataFrame, optional) – dataframe with optional external statistical results

Returns

LaTeX output that can be copied and pasted into LaTeX documents

Return type

str

Raises

ValueError – if both data and stats_test is None

results_to_latex_table(stats_test, data=None, stats_effect_type=None, unstack_levels=None, collapse_dof=True, si_table_format=None, index_kws=None, column_kws=None, show_a_b=False, **kwargs)[source]¶

Convert statistical result dataframe to LaTeX table.

This function converts a dataframe from a statistical analysis to a LaTeX table using to_latex().

This function uses the LaTeX package siunitx (https://ctan.org/pkg/siunitx?lang=en) to represent numbers. By default, the column format for columns that contain numbers is “S” which is provided by siunitx. The column format can be configured by the si_table_format argument.

Parameters

stats_test (str, optional) – name of statistical test in StatsPipeline, e.g., “pairwise_tests” or “anova”.
data (DataFrame, optional) – dataframe with optional external statistical results
stats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
unstack_levels (str or list of str, optional) – name(s) of dataframe index level(s) to be unstacked in the resulting latex table or None to unstack no level(s)
collapse_dof (bool, optional) – True to collapse degree-of-freedom (dof) from a separate column into the column header of the t- or F-value, respectively, False to keep it as separate “dof” column. This only works if the degrees-of-freedom are the same for all tests in the table. Default: True
si_table_format (str, optional) – table format for the numbers in the LaTeX table.
index_kws (dict, optional) –
dictionary containing arguments to configure how the table index is formatted. Possible arguments are:
- index_italic : bool True to format index columns in italic, False otherwise. Default: True
- index_level_order : list list of index level names indicating the index level order of a MultiIndex in the LaTeX table. If None the index order of the dataframe will be used
- index_value_order : list or dict list of index values if rows in LaTeX table should have a different order than the underlying dataframe or if only specific rows should be exported as LaTeX table. If the table index is a MultiIndex then index_value_order should be a dictionary with the index level names as keys and lists of index values of the specific level as values
- index_rename_mapdict
  mapping with dictionary with index values as keys and new index values to be exported
- index_level_names_texstr of list of str
  names of index levels in the LaTeX table or None to keep the index level names of the dataframe
column_kws (dict, optional) –
dictionary containing arguments to configure how the table columns are formatted. Possible arguments are:
- column_level_order : list list of column level names indicating the column level order of a MultiIndex in the LaTeX table. If None the order of the dataframe will be used
- column_value_order : list or dict list of column values if columns in LaTeX table should have a different order than the underlying dataframe or if only specific columns should be exported as LaTeX table. If the table column index is a MultiIndex then column_value_order should be a dictionary with the column level names as keys and lists of column values of the specific level as values
- column_rename_mapdict
  mapping with dictionary with column values as keys and new column values to be exported
- column_level_names_texstr of list of str
  names of column levels in the LaTeX table or None to keep the column level names of the dataframe
show_a_b (bool, optional) – True to add the names of the measurements (columns “A” and “B”) to the output table, False otherwise. Only needed for pairwise tests (pairwise_tests()). Default: False
kwargs –
additional keywords that are passed to to_latex(). The following default arguments will be passed if not specified otherwise:
- column_format: str The columns format as specified in LaTeX table format e.g. “rcl” for 3 columns. By default, the column format is automatically inferred from the dataframe, with index columns being formatted as “l” and value columns formatted as “S”. If column headers are multi-columns, “|” will be added as separators between each group.
- multicolumn_format : str The alignment for multi-columns. Default: “c”
- escape : bool By default, the value will be read from the pandas config module. When set to False prevents from escaping latex special characters in column names. Default: False
- position : str The LaTeX positional argument for tables, to be placed after \begin{} in the output. Default: “th!”

Returns

LaTeX code of formatted table

Return type

str

multicomp(stats_category_or_data, levels=False, method='bonf')[source]¶

Apply multi-comparison correction to results from statistical analysis.

This function will add a new column p-corr to the dataframe which contains the adjusted p-values. The level(s) on which to perform multi-comparison correction on can be specified by the levels parameter.

Parameters

stats_category_or_data (DataFrame) – dataframe with results from statistical analysis
levels (bool, str, or list of str, optional) – index level(s) on which to perform multi-comparison correction on, True to perform multi-comparison correction on the whole dataset (i.e., on no particular index level), or False or None to perform multi-comparison correction on all index levels. Default: False
method (str, optional) – method used for testing and adjustment of p-values. See multicomp() for the available methods. Default: “bonf”

Returns

dataframe with adjusted p-values

Return type

DataFrame

biopsykit.stats.regression module biopsykit.utils package