biopsykit.questionnaires.utils module¶

Module containing utility functions for manipulating and processing questionnaire data.

biopsykit.questionnaires.utils.bin_scale(data, bins, cols=None, first_min=True, last_max=False, inplace=False, **kwargs)[source]¶

Bin questionnaire scales.

Questionnaire scales are binned using pandas.cut() according to the bins specified by bins.

Parameters

data (DataFrame or Series) – data with scales to be binned
bins (int or list of float or IntervalIndex`) –
The criteria to bin by. bins can have one of the following types:
- int : Defines the number of equal-width bins in the range of data. The range of data is extended by 0.1% on each side to include the minimum and maximum values of data.
- sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of data is done.
- IntervalIndex : Defines the exact bins to be used. Note that the IntervalIndex for bins must be non-overlapping.
cols (list of str or list of int, optional) – column name/index (or list of such) to be binned or None to use all columns (or if data is a series). Default: None
first_min (bool, optional) – whether the minimum value should be added as the leftmost edge of the last bin or not. Only considered if bins is a list. Default: False
last_max (bool, optional) – whether the maximum value should be added as the rightmost edge of the last bin or not. Only considered if bins is a list. Default: False
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False
**kwargs – additional parameters that are passed to pandas.cut()

Returns

dataframe (or series) with binned scales or None if inplace is True

Return type

DataFrame, Series, or None

See also

pandas.cut(): Pandas method to bin values into discrete intervals.

biopsykit.questionnaires.utils.compute_scores(data, quest_dict, quest_kwargs=None)[source]¶

Compute questionnaire scores from dataframe.

This function can be used if multiple questionnaires from a dataframe should be computed at once. If the same questionnaire was assessed at multiple time points, these scores will be computed separately (see Notes and Examples). The questionnaires (and the dataframe columns belonging to the questionnaires) are specified by quest_dict.

Note

If questionnaires were collected at different time points (e.g., pre and post), which should all be computed, then the dictionary keys need to have the following format: “<questionnaire_name>-<time_point>”.

Parameters

data (DataFrame) – dataframe containing questionnaire data
quest_dict (dict) – dictionary with questionnaire names to be computed (keys) and columns of the questionnaires (values)
quest_kwargs (dict) – dictionary with optional arguments to be passed to questionnaire functions. The dictionary is expected consist of questionnaire names (keys) and **kwargs dictionaries (values) with arguments per questionnaire

Returns

dataframe with computed questionnaire scores

Return type

DataFrame

Examples

>>> from biopsykit.questionnaires.utils import compute_scores
>>> quest_dict = {
>>>     "PSS": ["PSS_{:02d}".format(i) for i in range(1, 11)], # PSS: one time point
>>>     "PASA-pre": ["PASA_{:02d}_T0".format(i) for i in range(1, 17)], # PASA: two time points (pre and post)
>>>     "PASA-post": ["PASA_{:02d}_T1".format(i) for i in range(1, 17)], # PASA: two time points (pre and post)
>>> }
>>> compute_scores(data, quest_dict)

biopsykit.questionnaires.utils.crop_scale(data, score_range, set_nan=False, inplace=False)[source]¶

Crop questionnaire scales, i.e., set values out of range to specific minimum and maximum values or to NaN.

Parameters

data (DataFrame or Series) – data to be cropped
score_range (list of int) – possible score range of the questionnaire items. Values out of score_range are cropped.
set_nan (bool, optional) – whether to set values out of range to NaN or to the values specified by score_range. Default: False
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe (or series) with cropped scales or None if inplace is True

Return type

DataFrame, Series, or None

biopsykit.questionnaires.utils.convert_scale(data, offset, cols=None, inplace=False)[source]¶

Convert the score range of questionnaire items.

Parameters

data (DataFrame or Series) – questionnaire data to invert
offset (int) – offset to add to questionnaire items
cols (list of str or list of int) – list of column names or column indices
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with converted columns or None if inplace is True

Return type

DataFrame, Series, or None

Raises

ValidationError – if data is no dataframe or series

Examples

>>> from biopsykit.questionnaires.utils import convert_scale
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]})
>>> # convert data from range [0, 4] to range [1, 5]
>>> data_out = convert_scale(data_in, offset=1)
>>> data_out["A"]
>>> [2, 3, 4, 2]
>>> data_out["B"]
>>> [5, 1, 2, 4]
>>> data_out["C"]
>>> [1, 4, 3, 4]
>>> data_out["D"]
>>> [1, 2, 3, 5]
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 2, 1, 3], "C": [3, 3, 2, 3], "D": [4, 1, 2, 4]})
>>> # convert data from range [1, 4] to range [0, 3]
>>> data_out = convert_scale(data_in, offset=-1)
>>> print(data_out)
>>> # convert only specific columns
>>> data_out = convert_scale(data_in, offset=-1, columns=["A", "C"])
>>> print(data_out)

biopsykit.questionnaires.utils.find_cols(data, regex_str=None, starts_with=None, ends_with=None, contains=None, zero_pad_numbers=True)[source]¶

Find columns in dataframe that match a specific pattern.

This function is useful to find all columns that belong to a questionnaire. Column names can be filtered based on one (or a combination of) the following criteria:

starts_with: columns have to start with the specified string
ends_with: columns have to end with the specified string
contains: columns have to contain the specified string

Optionally, the item numbers in the matching column names can be zero-padded, if they are not already.

Note

If zero_pad_numbers is True then the column names returned by this function will be renamed and might thus not match the column names of the original dataframe. To solve this, make sure your orignal dataframe already has zero-padded columns (by manually renaming them) or convert column names using zero_pad_columns().

Warning

Zero-padding using zero_pad_columns() assumes, by default, that numbers are at the end of column names. If you want to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.

Parameters

data (DataFrame) – dataframe with columns to be filtered
regex_str (str, optional) – regex string to extract column names. If this parameter is passed the other parameters (starts_with, ends_with, contains) will be ignored. Default: None
starts_with (str, optional) – string columns have to start with. Default: None
ends_with (str, optional) – string columns have to end with. Default: None
contains (str, optional) – string columns have to contain. Default: None
zero_pad_numbers (bool, optional) – whether to zero-pad numbers in column names. Default: True

Returns

data_filt (DataFrame) – dataframe with filtered columns that match the specified pattern
cols (Index) – columns that match the specified pattern

Return type

Tuple[pandas.core.frame.DataFrame, Sequence[str]]

Examples

>>> import biopsykit as bp
>>> import pandas as pd
>>> # Option 1: has to start with "XX"
>>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)])
>>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX")
>>> print(cols)
>>> ["XX_01", "XX_02", ..., "XX_10"]
>>> # Option 2: has to end with "Post"
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post"])
>>> df, cols = bp.questionnaires.utils.find_cols(data, ends_with="Post")
>>> print(cols)
>>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"]
>>> # Option 3: has to start with "XX" and end with "Post"
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post",
 "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"])
>>> bp.questionnaires.utils.find_cols(data, starts_with="XX", ends_with="Post")
>>> print(cols)
>>> # WARNING: this will not zero-pad the questionnaire numbers!
>>> ["XX_1_Post", "XX_2_Post", "XX_3_Post"]
>>> # Option 4: pass custom regex string
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post",
 "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"])
>>> bp.questionnaires.utils.find_cols(data, regex_str=r"XX_\d+_\w+")
>>> print(cols)
>>> # here, zero-padding will be possible again
>>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"]
>>> # Option 5: disable zero-padding
>>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)])
>>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX", zero_pad_numbers=False)
>>> print(cols)
>>> ["XX_1", "XX_2", ..., "XX_10"]

biopsykit.questionnaires.utils.zero_pad_columns(data, inplace=False)[source]¶

Add zero-padding to numbers at the end of column names in a dataframe.

Warning

By default, this function assumes that numbers are at the end of column names. If you need to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.

Parameters

data (DataFrame) – dataframe with columns to zero-pad
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with zero-padded columns or None if inplace is True

Return type

DataFrame or None

biopsykit.questionnaires.utils.invert(data, score_range, cols=None, inplace=False)[source]¶

Invert questionnaire scores.

In many questionnaires some items need to be inverted (reversed) before sum scores can be computed. This function can be used to either invert a single column (Series), selected columns in a dataframe (by specifying columns in the cols parameter), or a complete dataframe.

Parameters

data (DataFrame or Series) – questionnaire data to invert
score_range (list of int) – possible score range of the questionnaire items
cols (list of str or list of int) – list of column names or column indices
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with inverted columns or None if inplace is True

Return type

DataFrame or None

Raises

ValidationError – if data is no dataframe or series if score_range does not have length 2
ValueRangeError – if values in data are not in score_range

Examples

>>> from biopsykit.questionnaires.utils import invert
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]})
>>> data_out = invert(data_in, score_range=[0, 4])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [0, 4, 3, 1]
>>> data_out["C"]
>>> [4, 1, 2, 1]
>>> data_out["D"]
>>> [4, 3, 2, 0]
>>> # Other score range
>>> data_out = invert(data, score_range=[0, 5])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [1, 5, 4, 2]
>>> data_out["C"]
>>> [5, 2, 3, 2]
>>> data_out["D"]
>>> [5, 4, 3, 1]
>>> # Invert only specific columns
>>> data_out = invert(data, score_range=[0, 4], cols=["A", "C"])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [4, 0, 1, 3]
>>> data_out["C"]
>>> [4, 1, 2, 1]
>>> data_out["D"]
>>> [0, 1, 2, 4]

biopsykit.questionnaires.utils.to_idx(col_idxs)[source]¶

Convert questionnaire item indices into array indices.

In questionnaires, items indices start at 1. To avoid confusion in the implementation of questionnaires (because array indices start at 0) all questionnaire indices in BioPsyKit also start at 1 and are converted to 0-based indexing using this function.

Parameters: col_idxs (list of int) – list of indices to convert to 0-based indexing
Returns: array with converted indices
Return type: ndarray

biopsykit.questionnaires.utils.wide_to_long(data, quest_name, levels)[source]¶

Convert a dataframe wide-format into long-format.

Warning

This function is deprecated and will be removed in the future! Please use wide_to_long() instead.

Parameters

data (DataFrame) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subject.
quest_name (str) – questionnaire name, i.e., common name for each column to be converted into long-format.
levels (str or list of str) – index levels of the resulting long-format dataframe.

Returns

pandas DataFrame in long-format

Return type

DataFrame