biopsykit.stats.stats module¶
Module for setting up a pipeline for statistical analysis.
- class biopsykit.stats.stats.StatsPipeline(steps, params, **kwargs)[source]¶
Bases:
objectClass to set up a pipeline for statistical analysis.
The purpose of such a pipeline is to assemble several steps of a typical statistical analysis procedure while setting different parameters. The parameters passed to this class depend on the used statistical functions. It enables setting parameters of the various steps using their names and the parameter name separated by a “__”, as in the examples below.
The interface of this class is inspired by the scikit-learn Pipeline for machine learning tasks (
Pipeline).All functions used are from the
pingouinlibrary (https://pingouin-stats.org/) for statistical analysis.The different steps of statistical analysis can be divided into categories:
Preparatory Analysis (
prep): Analyses applied to the data before performing the actual statistical analysis. Currently supported functions are:normality: Test whether a random sample comes from a normal distribution. Seenormality()for further information.equal_var: Test equality of variances (homoscedasticity). Seehomoscedasticity()for further information.
Statistical Test (
test): Statistical test to determine differences or similarities in the data. Currently supported functions are:pairwise_tests: Pairwise tests (either for independent or dependent samples). Seepairwise_tests()for further information.anova: One-way or N-way ANOVA. Seeanova()for further information.welch_anova: One-way Welch-ANOVA. Seewelch_anova()for further information.rm_anova: One-way and two-way repeated measures ANOVA. Seerm_anova()further information.mixed_anova: Mixed-design (split-plot) ANOVA. Seemixed_anova()for further information.kruskal: Kruskal-Wallis H-test for independent samples. Seekruskal()for further information.
Posthoc Tests (
posthoc): Posthoc tests to determine differences of individual groups if more than two groups are analyzed. Currently supported functions are:pairwise_tests: Pairwise tests (either for independent or dependent samples). Seepairwise_tests()for further information.pairwise_tukey: Pairwise Tukey-HSD post-hoc test. Seepairwise_tukey()for further information.pairwise_gameshowell: Pairwise Games-Howell post-hoc test. Seepairwise_gameshowell()for further information.
A
StatsPipelineconsists of a list of tuples specifying the individualstepsof the pipeline. The first value of each tuple indicates the category this step belongs to (prep,test, orposthoc), the second value indicates the analysis function to use in this step (e.g.,normality, orrm_anova).Furthermore, a
paramsdictionary specifying the parameters and variables for statistical analysis needs to be supplied. Parameters can either be specified globally, i.e., for all steps in the pipeline (the default), or locally, i.e., only for one specific category, by prepending the category and separating it from the parameter name by a __. The parameters depend on the type of analysis used in the pipeline.Examples are:
dv: column name of the dependent variablebetween: column name of the between-subject factorwithin: column name of the within-subject factoreffsize: type of effect size to compute (if applicable)multicomp: whether (and how) to apply multi-comparison correction of p-values to the last step in the pipeline (either “test” or “posthoc”) usingmulticomp(). The arguments for the call tomulticomp()are supplied via dictionary.…
- Parameters
steps (list of tuples) – list of tuples specifying statistical analysis pipeline
params (dict) – dictionary with parameter names and their values
**kwargs (dict) –
additional arguments, such as:
round: Set the default decimal rounding of the output dataframes orNoneto disable rounding. Default: Rounding to 4 digits on p-value columns only. Seeround()for further options.
- results_cat(category)[source]¶
Return results for pipeline category.
This function filters results from steps belonging to the specified category and returns them.
- Parameters
category ({"prep", "test", "posthoc"}) – category name
- Returns
dataframe with results from the specified category or dict of such if multiple steps belong to the same category.
- Return type
DataFrameor dict
- display_results(sig_only=None, **kwargs)[source]¶
Display formatted results of statistical analysis pipeline.
This function displays the results of the statistical analysis pipeline. The output is Markdown-formatted and optimized for Jupyter Notebooks. The output can be configured to for example:
only show specific categories
only show statistically significant results for specific categories or for all pipeline categories
display results grouped by a grouper (when pipeline is applied on multiple groups of data, e.g., on multiple feature of the same type independently)
- Parameters
sig_only (bool, str, list, or dict, optional) –
whether to only show statistically significant (p < 0.05) results or not.
sig_onlyaccepts multiple possible formats:str: filter only one specific category or “all” to filter all categories by statistical significancebool:Trueto filter all categories by statistical significance,Falseotherwiselist: list of categories whose results should be filtered by statistical significancedict: dictionary with category names and bool values to filter (or not) for statistical significance
Default:
None(no filtering)**kwargs –
additional arguments to be passed to the function, such as:
categorynames:Trueto display results of this category,Falseto skip displaying results of this category. Default: show results from all categoriesgrouped:Trueto group results by the variable “groupby” specified in the parameter dictionary when initializing theStatsPipelineinstance.
- export_statistics(file_path)[source]¶
Export results of statistics analysis pipeline to Excel file.
Each step of the analysis pipeline is saved into its own sheet. The first sheet is an overview of the parameters specified in the analysis pipeline.
- Parameters
file_path (
Pathor str) – path to export file
- sig_brackets(stats_category_or_data, stats_effect_type, stats_type=None, plot_type='single', features=None, x=None, subplots=False)[source]¶
Generate significance brackets used indicate statistical significance in boxplots.
- Parameters
stats_category_or_data ({“prep”, “test”, “posthoc”} or
DataFrame) – either a string to specify the pipeline category to use for generating significance brackets or a dataframe with statistical results if significance brackets should be generated from the dataframe.stats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
stats_type ({"between", "within", "interaction"}) –
Note
Deprecated in 0.4.0 stats_type will be removed in 0.5.0, it is replaced by stats_effect_type.
plot_type ({"single", "multi"}) – type of plot for which significance brackets are generated: “multi” if boxplots are grouped (by
huevariable), “single” (the default) otherwise.features (str, list or dict, optional) –
feature(s) used in boxplot. The resulting significance brackets will be filtered accordingly to only contain features present in the boxplot. It can have the following formats:
str: only one feature is plotted in the boxplot (returns significance brackets of only one feature)list: multiple features are combined into oneAxesobject (returns significance brackets of multiple features)dict: dictionary with feature (or list of features) per subplot if boxplots are structured in subplots (subplotsisTrue) (returns dictionary with significance brackets per subplot)
Default:
Noneto return significance brackets of all featuresx (str, optional) – name of
xvariable used to plot data in boxplot. Only required ifplot_typeis “multi”.subplots (bool, optional) –
Trueif multiple boxplots are structured in subplots,Falseotherwise. Default:False
- Returns
box_pairs – list with significance brackets (or dict of such if
subplotsisTrue)pvalues – list with p values belonging to the significance brackets in
box_pairs(or dict of such ifsubplotsisTrue)
- Return type
tuple[collections.abc.Sequence[tuple[str, str]], collections.abc.Sequence[float]] | tuple[dict[str, collections.abc.Sequence[tuple[tuple[str, str], tuple[str, str]]]], dict[str, collections.abc.Sequence[float]]]
- stats_to_latex(stats_test=None, index=None, data=None)[source]¶
Generate LaTeX output from statistical results.
- Parameters
stats_test (str, optional) – name of statistical test in
StatsPipeline, e.g., “pairwise_tests” or “anova” orNoneif external statistical results is provided viadataindex (str or tuple, optional) – row indexer of statistical result or
Noneto generate LaTeX output for all rows. Default:Nonedata (
DataFrame, optional) – dataframe with optional external statistical results
- Returns
LaTeX output that can be copied and pasted into LaTeX documents
- Return type
- Raises
ValueError – if both
dataandstats_testisNone
- results_to_latex_table(stats_test, data=None, stats_effect_type=None, unstack_levels=None, collapse_dof=True, si_table_format=None, index_kws=None, column_kws=None, show_a_b=False, **kwargs)[source]¶
Convert statistical result dataframe to LaTeX table.
This function converts a dataframe from a statistical analysis to a LaTeX table using
to_latex().This function uses the LaTeX package
siunitx(https://ctan.org/pkg/siunitx?lang=en) to represent numbers. By default, the column format for columns that contain numbers is “S” which is provided bysiunitx. The column format can be configured by thesi_table_formatargument.- Parameters
stats_test (str, optional) – name of statistical test in
StatsPipeline, e.g., “pairwise_tests” or “anova”.data (
DataFrame, optional) – dataframe with optional external statistical resultsstats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
unstack_levels (str or list of str, optional) – name(s) of dataframe index level(s) to be unstacked in the resulting latex table or
Noneto unstack no level(s)collapse_dof (bool, optional) –
Trueto collapse degree-of-freedom (dof) from a separate column into the column header of the t- or F-value, respectively,Falseto keep it as separate “dof” column. This only works if the degrees-of-freedom are the same for all tests in the table. Default:Truesi_table_format (str, optional) – table format for the numbers in the LaTeX table.
index_kws (dict, optional) –
dictionary containing arguments to configure how the table index is formatted. Possible arguments are:
index_italic : bool
Trueto format index columns in italic,Falseotherwise. Default:Trueindex_level_order : list list of index level names indicating the index level order of a
MultiIndexin the LaTeX table. If None the index order of the dataframe will be usedindex_value_order : list or dict list of index values if rows in LaTeX table should have a different order than the underlying dataframe or if only specific rows should be exported as LaTeX table. If the table index is a
MultiIndexthenindex_value_ordershould be a dictionary with the index level names as keys and lists of index values of the specific level as values- index_rename_mapdict
mapping with dictionary with index values as keys and new index values to be exported
- index_level_names_texstr of list of str
names of index levels in the LaTeX table or
Noneto keep the index level names of the dataframe
column_kws (dict, optional) –
dictionary containing arguments to configure how the table columns are formatted. Possible arguments are:
column_level_order : list list of column level names indicating the column level order of a
MultiIndexin the LaTeX table. If None the order of the dataframe will be usedcolumn_value_order : list or dict list of column values if columns in LaTeX table should have a different order than the underlying dataframe or if only specific columns should be exported as LaTeX table. If the table column index is a
MultiIndexthencolumn_value_ordershould be a dictionary with the column level names as keys and lists of column values of the specific level as values- column_rename_mapdict
mapping with dictionary with column values as keys and new column values to be exported
- column_level_names_texstr of list of str
names of column levels in the LaTeX table or
Noneto keep the column level names of the dataframe
show_a_b (bool, optional) –
Trueto add the names of the measurements (columns “A” and “B”) to the output table,Falseotherwise. Only needed for pairwise tests (pairwise_tests()). Default:Falsekwargs –
additional keywords that are passed to
to_latex(). The following default arguments will be passed if not specified otherwise:column_format: str The columns format as specified in LaTeX table format e.g. “rcl” for 3 columns. By default, the column format is automatically inferred from the dataframe, with index columns being formatted as “l” and value columns formatted as “S”. If column headers are multi-columns, “|” will be added as separators between each group.
multicolumn_format : str The alignment for multi-columns. Default: “c”
escape : bool By default, the value will be read from the pandas config module. When set to
Falseprevents from escaping latex special characters in column names. Default:Falseposition : str The LaTeX positional argument for tables, to be placed after
\begin{}in the output. Default: “th!”
- Returns
LaTeX code of formatted table
- Return type
- multicomp(stats_category_or_data, levels=False, method='bonf')[source]¶
Apply multi-comparison correction to results from statistical analysis.
This function will add a new column
p-corrto the dataframe which contains the adjusted p-values. The level(s) on which to perform multi-comparison correction on can be specified by thelevelsparameter.- Parameters
stats_category_or_data (
DataFrame) – dataframe with results from statistical analysislevels (bool, str, or list of str, optional) – index level(s) on which to perform multi-comparison correction on,
Trueto perform multi-comparison correction on the whole dataset (i.e., on no particular index level), orFalseorNoneto perform multi-comparison correction on all index levels. Default:Falsemethod (str, optional) – method used for testing and adjustment of p-values. See
multicomp()for the available methods. Default: “bonf”
- Returns
dataframe with adjusted p-values
- Return type