biopsykit.stats.stats module¶
Module for setting up a pipeline for statistical analysis.
- class biopsykit.stats.stats.StatsPipeline(steps, params, **kwargs)[source]¶
Bases:
object
Class to set up a pipeline for statistical analysis.
The purpose of such a pipeline is to assemble several steps of a typical statistical analysis procedure while setting different parameters. The parameters passed to this class depend on the used statistical functions. It enables setting parameters of the various steps using their names and the parameter name separated by a “__”, as in the examples below.
The interface of this class is inspired by the scikit-learn Pipeline for machine learning tasks (
Pipeline
).All functions used are from the
pingouin
library (https://pingouin-stats.org/) for statistical analysis.The different steps of statistical analysis can be divided into categories:
Preparatory Analysis (
prep
): Analyses applied to the data before performing the actual statistical analysis. Currently supported functions are:normality
: Test whether a random sample comes from a normal distribution. Seenormality()
for further information.equal_var
: Test equality of variances (homoscedasticity). Seehomoscedasticity()
for further information.
Statistical Test (
test
): Statistical test to determine differences or similarities in the data. Currently supported functions are:pairwise_tests
: Pairwise tests (either for independent or dependent samples). Seepairwise_tests()
for further information.anova
: One-way or N-way ANOVA. Seeanova()
for further information.welch_anova
: One-way Welch-ANOVA. Seewelch_anova()
for further information.rm_anova
: One-way and two-way repeated measures ANOVA. Seerm_anova()
further information.mixed_anova
: Mixed-design (split-plot) ANOVA. Seemixed_anova()
for further information.kruskal
: Kruskal-Wallis H-test for independent samples. Seekruskal()
for further information.
Posthoc Tests (
posthoc
): Posthoc tests to determine differences of individual groups if more than two groups are analyzed. Currently supported functions are:pairwise_tests
: Pairwise tests (either for independent or dependent samples). Seepairwise_tests()
for further information.pairwise_tukey
: Pairwise Tukey-HSD post-hoc test. Seepairwise_tukey()
for further information.pairwise_gameshowell
: Pairwise Games-Howell post-hoc test. Seepairwise_gameshowell()
for further information.
A
StatsPipeline
consists of a list of tuples specifying the individualsteps
of the pipeline. The first value of each tuple indicates the category this step belongs to (prep
,test
, orposthoc
), the second value indicates the analysis function to use in this step (e.g.,normality
, orrm_anova
).Furthermore, a
params
dictionary specifying the parameters and variables for statistical analysis needs to be supplied. Parameters can either be specified globally, i.e., for all steps in the pipeline (the default), or locally, i.e., only for one specific category, by prepending the category and separating it from the parameter name by a __. The parameters depend on the type of analysis used in the pipeline.Examples are:
dv
: column name of the dependent variablebetween
: column name of the between-subject factorwithin
: column name of the within-subject factoreffsize
: type of effect size to compute (if applicable)multicomp
: whether (and how) to apply multi-comparison correction of p-values to the last step in the pipeline (either “test” or “posthoc”) usingmulticomp()
. The arguments for the call tomulticomp()
are supplied via dictionary.…
- Parameters
steps (list of tuples) – list of tuples specifying statistical analysis pipeline
params (dict) – dictionary with parameter names and their values
**kwargs (dict) –
additional arguments, such as:
round
: Set the default decimal rounding of the output dataframes orNone
to disable rounding. Default: Rounding to 4 digits on p-value columns only. Seeround()
for further options.
- results_cat(category)[source]¶
Return results for pipeline category.
This function filters results from steps belonging to the specified category and returns them.
- Parameters
category ({"prep", "test", "posthoc"}) – category name
- Returns
dataframe with results from the specified category or dict of such if multiple steps belong to the same category.
- Return type
DataFrame
or dict
- display_results(sig_only=None, **kwargs)[source]¶
Display formatted results of statistical analysis pipeline.
This function displays the results of the statistical analysis pipeline. The output is Markdown-formatted and optimized for Jupyter Notebooks. The output can be configured to for example:
only show specific categories
only show statistically significant results for specific categories or for all pipeline categories
display results grouped by a grouper (when pipeline is applied on multiple groups of data, e.g., on multiple feature of the same type independently)
- Parameters
sig_only (bool, str, list, or dict, optional) –
whether to only show statistically significant (p < 0.05) results or not.
sig_only
accepts multiple possible formats:str
: filter only one specific category or “all” to filter all categories by statistical significancebool
:True
to filter all categories by statistical significance,False
otherwiselist
: list of categories whose results should be filtered by statistical significancedict
: dictionary with category names and bool values to filter (or not) for statistical significance
Default:
None
(no filtering)**kwargs –
additional arguments to be passed to the function, such as:
category
names:True
to display results of this category,False
to skip displaying results of this category. Default: show results from all categoriesgrouped
:True
to group results by the variable “groupby” specified in the parameter dictionary when initializing theStatsPipeline
instance.
- export_statistics(file_path)[source]¶
Export results of statistics analysis pipeline to Excel file.
Each step of the analysis pipeline is saved into its own sheet. The first sheet is an overview of the parameters specified in the analysis pipeline.
- Parameters
file_path (
Path
or str) – path to export file
- sig_brackets(stats_category_or_data, stats_effect_type, stats_type=None, plot_type='single', features=None, x=None, subplots=False)[source]¶
Generate significance brackets used indicate statistical significance in boxplots.
- Parameters
stats_category_or_data ({“prep”, “test”, “posthoc”} or
DataFrame
) – either a string to specify the pipeline category to use for generating significance brackets or a dataframe with statistical results if significance brackets should be generated from the dataframe.stats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
stats_type ({"between", "within", "interaction"}) –
Note
Deprecated in 0.4.0 stats_type will be removed in 0.5.0, it is replaced by stats_effect_type.
plot_type ({"single", "multi"}) – type of plot for which significance brackets are generated: “multi” if boxplots are grouped (by
hue
variable), “single” (the default) otherwise.features (str, list or dict, optional) –
feature(s) used in boxplot. The resulting significance brackets will be filtered accordingly to only contain features present in the boxplot. It can have the following formats:
str
: only one feature is plotted in the boxplot (returns significance brackets of only one feature)list
: multiple features are combined into oneAxes
object (returns significance brackets of multiple features)dict
: dictionary with feature (or list of features) per subplot if boxplots are structured in subplots (subplots
isTrue
) (returns dictionary with significance brackets per subplot)
Default:
None
to return significance brackets of all featuresx (str, optional) – name of
x
variable used to plot data in boxplot. Only required ifplot_type
is “multi”.subplots (bool, optional) –
True
if multiple boxplots are structured in subplots,False
otherwise. Default:False
- Returns
box_pairs – list with significance brackets (or dict of such if
subplots
isTrue
)pvalues – list with p values belonging to the significance brackets in
box_pairs
(or dict of such ifsubplots
isTrue
)
- Return type
Union[Tuple[Sequence[Tuple[str, str]], Sequence[float]], Tuple[Dict[str, Sequence[Tuple[Tuple[str, str], Tuple[str, str]]]], Dict[str, Sequence[float]]]]
- stats_to_latex(stats_test=None, index=None, data=None)[source]¶
Generate LaTeX output from statistical results.
- Parameters
stats_test (str, optional) – name of statistical test in
StatsPipeline
, e.g., “pairwise_tests” or “anova” orNone
if external statistical results is provided viadata
index (str or tuple, optional) – row indexer of statistical result or
None
to generate LaTeX output for all rows. Default:None
data (
DataFrame
, optional) – dataframe with optional external statistical results
- Returns
LaTeX output that can be copied and pasted into LaTeX documents
- Return type
- Raises
ValueError – if both
data
andstats_test
isNone
- results_to_latex_table(stats_test, data=None, stats_effect_type=None, unstack_levels=None, collapse_dof=True, si_table_format=None, index_kws=None, column_kws=None, show_a_b=False, **kwargs)[source]¶
Convert statistical result dataframe to LaTeX table.
This function converts a dataframe from a statistical analysis to a LaTeX table using
to_latex()
.This function uses the LaTeX package
siunitx
(https://ctan.org/pkg/siunitx?lang=en) to represent numbers. By default, the column format for columns that contain numbers is “S” which is provided bysiunitx
. The column format can be configured by thesi_table_format
argument.- Parameters
stats_test (str, optional) – name of statistical test in
StatsPipeline
, e.g., “pairwise_tests” or “anova”.data (
DataFrame
, optional) – dataframe with optional external statistical resultsstats_effect_type ({"between", "within", "interaction"}) – type of statistical effect (“between”, “within”, or “interaction”). Needed to extract the correct information from the analysis dataframe.
unstack_levels (str or list of str, optional) – name(s) of dataframe index level(s) to be unstacked in the resulting latex table or
None
to unstack no level(s)collapse_dof (bool, optional) –
True
to collapse degree-of-freedom (dof) from a separate column into the column header of the t- or F-value, respectively,False
to keep it as separate “dof” column. This only works if the degrees-of-freedom are the same for all tests in the table. Default:True
si_table_format (str, optional) – table format for the numbers in the LaTeX table.
index_kws (dict, optional) –
dictionary containing arguments to configure how the table index is formatted. Possible arguments are:
index_italic : bool
True
to format index columns in italic,False
otherwise. Default:True
index_level_order : list list of index level names indicating the index level order of a
MultiIndex
in the LaTeX table. If None the index order of the dataframe will be usedindex_value_order : list or dict list of index values if rows in LaTeX table should have a different order than the underlying dataframe or if only specific rows should be exported as LaTeX table. If the table index is a
MultiIndex
thenindex_value_order
should be a dictionary with the index level names as keys and lists of index values of the specific level as values- index_rename_mapdict
mapping with dictionary with index values as keys and new index values to be exported
- index_level_names_texstr of list of str
names of index levels in the LaTeX table or
None
to keep the index level names of the dataframe
column_kws (dict, optional) –
dictionary containing arguments to configure how the table columns are formatted. Possible arguments are:
column_level_order : list list of column level names indicating the column level order of a
MultiIndex
in the LaTeX table. If None the order of the dataframe will be usedcolumn_value_order : list or dict list of column values if columns in LaTeX table should have a different order than the underlying dataframe or if only specific columns should be exported as LaTeX table. If the table column index is a
MultiIndex
thencolumn_value_order
should be a dictionary with the column level names as keys and lists of column values of the specific level as values- column_rename_mapdict
mapping with dictionary with column values as keys and new column values to be exported
- column_level_names_texstr of list of str
names of column levels in the LaTeX table or
None
to keep the column level names of the dataframe
show_a_b (bool, optional) –
True
to add the names of the measurements (columns “A” and “B”) to the output table,False
otherwise. Only needed for pairwise tests (pairwise_tests()
). Default:False
kwargs –
additional keywords that are passed to
to_latex()
. The following default arguments will be passed if not specified otherwise:column_format: str The columns format as specified in LaTeX table format e.g. “rcl” for 3 columns. By default, the column format is automatically inferred from the dataframe, with index columns being formatted as “l” and value columns formatted as “S”. If column headers are multi-columns, “|” will be added as separators between each group.
multicolumn_format : str The alignment for multi-columns. Default: “c”
escape : bool By default, the value will be read from the pandas config module. When set to
False
prevents from escaping latex special characters in column names. Default:False
position : str The LaTeX positional argument for tables, to be placed after
\begin{}
in the output. Default: “th!”
- Returns
LaTeX code of formatted table
- Return type
- multicomp(stats_category_or_data, levels=False, method='bonf')[source]¶
Apply multi-comparison correction to results from statistical analysis.
This function will add a new column
p-corr
to the dataframe which contains the adjusted p-values. The level(s) on which to perform multi-comparison correction on can be specified by thelevels
parameter.- Parameters
stats_category_or_data (
DataFrame
) – dataframe with results from statistical analysislevels (bool, str, or list of str, optional) – index level(s) on which to perform multi-comparison correction on,
True
to perform multi-comparison correction on the whole dataset (i.e., on no particular index level), orFalse
orNone
to perform multi-comparison correction on all index levels. Default:False
method (str, optional) – method used for testing and adjustment of p-values. See
multicomp()
for the available methods. Default: “bonf”
- Returns
dataframe with adjusted p-values
- Return type