biopsykit.questionnaires.utils module¶
Module containing utility functions for manipulating and processing questionnaire data.
- biopsykit.questionnaires.utils.bin_scale(data, bins, cols=None, first_min=True, last_max=False, inplace=False, **kwargs)[source]¶
Bin questionnaire scales.
Questionnaire scales are binned using
pandas.cut()
according to the bins specified bybins
.- Parameters
bins (int or list of float or
IntervalIndex`
) –The criteria to bin by.
bins
can have one of the following types:int
: Defines the number of equal-width bins in the range ofdata
. The range ofdata
is extended by 0.1% on each side to include the minimum and maximum values ofdata
.sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of
data
is done.IntervalIndex
: Defines the exact bins to be used. Note that theIntervalIndex
forbins
must be non-overlapping.
cols (list of str or list of int, optional) – column name/index (or list of such) to be binned or
None
to use all columns (or ifdata
is a series). Default:None
first_min (bool, optional) – whether the minimum value should be added as the leftmost edge of the last bin or not. Only considered if
bins
is a list. Default:False
last_max (bool, optional) – whether the maximum value should be added as the rightmost edge of the last bin or not. Only considered if
bins
is a list. Default:False
inplace (bool, optional) – whether to perform the operation inplace or not. Default:
False
**kwargs – additional parameters that are passed to
pandas.cut()
- Returns
dataframe (or series) with binned scales or
None
ifinplace
isTrue
- Return type
See also
pandas.cut()
Pandas method to bin values into discrete intervals.
- biopsykit.questionnaires.utils.compute_scores(data, quest_dict, quest_kwargs=None)[source]¶
Compute questionnaire scores from dataframe.
This function can be used if multiple questionnaires from a dataframe should be computed at once. If the same questionnaire was assessed at multiple time points, these scores will be computed separately (see
Notes
andExamples
). The questionnaires (and the dataframe columns belonging to the questionnaires) are specified byquest_dict
.Note
If questionnaires were collected at different time points (e.g., pre and post), which should all be computed, then the dictionary keys need to have the following format: “<questionnaire_name>-<time_point>”.
- Parameters
data (
DataFrame
) – dataframe containing questionnaire dataquest_dict (dict) – dictionary with questionnaire names to be computed (keys) and columns of the questionnaires (values)
quest_kwargs (dict) – dictionary with optional arguments to be passed to questionnaire functions. The dictionary is expected consist of questionnaire names (keys) and
**kwargs
dictionaries (values) with arguments per questionnaire
- Returns
dataframe with computed questionnaire scores
- Return type
Examples
>>> from biopsykit.questionnaires.utils import compute_scores >>> quest_dict = { >>> "PSS": ["PSS_{:02d}".format(i) for i in range(1, 11)], # PSS: one time point >>> "PASA-pre": ["PASA_{:02d}_T0".format(i) for i in range(1, 17)], # PASA: two time points (pre and post) >>> "PASA-post": ["PASA_{:02d}_T1".format(i) for i in range(1, 17)], # PASA: two time points (pre and post) >>> } >>> compute_scores(data, quest_dict)
- biopsykit.questionnaires.utils.crop_scale(data, score_range, set_nan=False, inplace=False)[source]¶
Crop questionnaire scales, i.e., set values out of range to specific minimum and maximum values or to NaN.
- Parameters
score_range (list of int) – possible score range of the questionnaire items. Values out of
score_range
are cropped.set_nan (bool, optional) – whether to set values out of range to NaN or to the values specified by
score_range
. Default:False
inplace (bool, optional) – whether to perform the operation inplace or not. Default:
False
- Returns
dataframe (or series) with cropped scales or
None
ifinplace
isTrue
- Return type
- biopsykit.questionnaires.utils.convert_scale(data, offset, cols=None, inplace=False)[source]¶
Convert the score range of questionnaire items.
- Parameters
- Returns
dataframe with converted columns or
None
ifinplace
isTrue
- Return type
- Raises
ValidationError – if
data
is no dataframe or series
Examples
>>> from biopsykit.questionnaires.utils import convert_scale >>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]}) >>> # convert data from range [0, 4] to range [1, 5] >>> data_out = convert_scale(data_in, offset=1) >>> data_out["A"] >>> [2, 3, 4, 2] >>> data_out["B"] >>> [5, 1, 2, 4] >>> data_out["C"] >>> [1, 4, 3, 4] >>> data_out["D"] >>> [1, 2, 3, 5] >>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 2, 1, 3], "C": [3, 3, 2, 3], "D": [4, 1, 2, 4]}) >>> # convert data from range [1, 4] to range [0, 3] >>> data_out = convert_scale(data_in, offset=-1) >>> print(data_out) >>> # convert only specific columns >>> data_out = convert_scale(data_in, offset=-1, columns=["A", "C"]) >>> print(data_out)
- biopsykit.questionnaires.utils.find_cols(data, regex_str=None, starts_with=None, ends_with=None, contains=None, zero_pad_numbers=True)[source]¶
Find columns in dataframe that match a specific pattern.
This function is useful to find all columns that belong to a questionnaire. Column names can be filtered based on one (or a combination of) the following criteria:
starts_with
: columns have to start with the specified stringends_with
: columns have to end with the specified stringcontains
: columns have to contain the specified string
Optionally, the item numbers in the matching column names can be zero-padded, if they are not already.
Note
If
zero_pad_numbers
isTrue
then the column names returned by this function will be renamed and might thus not match the column names of the original dataframe. To solve this, make sure your orignal dataframe already has zero-padded columns (by manually renaming them) or convert column names usingzero_pad_columns()
.Warning
Zero-padding using
zero_pad_columns()
assumes, by default, that numbers are at the end of column names. If you want to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.- Parameters
data (
DataFrame
) – dataframe with columns to be filteredregex_str (str, optional) – regex string to extract column names. If this parameter is passed the other parameters (
starts_with
,ends_with
,contains
) will be ignored. Default:None
starts_with (str, optional) – string columns have to start with. Default:
None
ends_with (str, optional) – string columns have to end with. Default:
None
contains (str, optional) – string columns have to contain. Default:
None
zero_pad_numbers (bool, optional) – whether to zero-pad numbers in column names. Default:
True
- Returns
- Return type
Examples
>>> import biopsykit as bp >>> import pandas as pd >>> # Option 1: has to start with "XX" >>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)]) >>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX") >>> print(cols) >>> ["XX_01", "XX_02", ..., "XX_10"] >>> # Option 2: has to end with "Post" >>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post"]) >>> df, cols = bp.questionnaires.utils.find_cols(data, ends_with="Post") >>> print(cols) >>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"] >>> # Option 3: has to start with "XX" and end with "Post" >>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post", "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"]) >>> bp.questionnaires.utils.find_cols(data, starts_with="XX", ends_with="Post") >>> print(cols) >>> # WARNING: this will not zero-pad the questionnaire numbers! >>> ["XX_1_Post", "XX_2_Post", "XX_3_Post"] >>> # Option 4: pass custom regex string >>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post", "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"]) >>> bp.questionnaires.utils.find_cols(data, regex_str=r"XX_\d+_\w+") >>> print(cols) >>> # here, zero-padding will be possible again >>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"] >>> # Option 5: disable zero-padding >>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)]) >>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX", zero_pad_numbers=False) >>> print(cols) >>> ["XX_1", "XX_2", ..., "XX_10"]
- biopsykit.questionnaires.utils.zero_pad_columns(data, inplace=False)[source]¶
Add zero-padding to numbers at the end of column names in a dataframe.
Warning
By default, this function assumes that numbers are at the end of column names. If you need to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.
- biopsykit.questionnaires.utils.invert(data, score_range, cols=None, inplace=False)[source]¶
Invert questionnaire scores.
In many questionnaires some items need to be inverted (reversed) before sum scores can be computed. This function can be used to either invert a single column (Series), selected columns in a dataframe (by specifying columns in the
cols
parameter), or a complete dataframe.- Parameters
- Returns
dataframe with inverted columns or
None
ifinplace
isTrue
- Return type
DataFrame
orNone
- Raises
ValidationError – if
data
is no dataframe or series ifscore_range
does not have length 2ValueRangeError – if values in
data
are not inscore_range
Examples
>>> from biopsykit.questionnaires.utils import invert >>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]}) >>> data_out = invert(data_in, score_range=[0, 4]) >>> data_out["A"] >>> [3, 2, 1, 3] >>> data_out["B"] >>> [0, 4, 3, 1] >>> data_out["C"] >>> [4, 1, 2, 1] >>> data_out["D"] >>> [4, 3, 2, 0] >>> # Other score range >>> data_out = invert(data, score_range=[0, 5]) >>> data_out["A"] >>> [3, 2, 1, 3] >>> data_out["B"] >>> [1, 5, 4, 2] >>> data_out["C"] >>> [5, 2, 3, 2] >>> data_out["D"] >>> [5, 4, 3, 1] >>> # Invert only specific columns >>> data_out = invert(data, score_range=[0, 4], cols=["A", "C"]) >>> data_out["A"] >>> [3, 2, 1, 3] >>> data_out["B"] >>> [4, 0, 1, 3] >>> data_out["C"] >>> [4, 1, 2, 1] >>> data_out["D"] >>> [0, 1, 2, 4]
- biopsykit.questionnaires.utils.to_idx(col_idxs)[source]¶
Convert questionnaire item indices into array indices.
In questionnaires, items indices start at 1. To avoid confusion in the implementation of questionnaires (because array indices start at 0) all questionnaire indices in BioPsyKit also start at 1 and are converted to 0-based indexing using this function.
- Parameters
col_idxs (list of int) – list of indices to convert to 0-based indexing
- Returns
array with converted indices
- Return type
- biopsykit.questionnaires.utils.wide_to_long(data, quest_name, levels)[source]¶
Convert a dataframe wide-format into long-format.
Warning
This function is deprecated and will be removed in the future! Please use
wide_to_long()
instead.- Parameters
data (
DataFrame
) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subject.quest_name (str) – questionnaire name, i.e., common name for each column to be converted into long-format.
levels (str or list of str) – index levels of the resulting long-format dataframe.
- Returns
pandas DataFrame in long-format
- Return type
See also
wide_to_long()
convert dataframe from wide to long format