biopsykit.questionnaires.utils module

Module containing utility functions for manipulating and processing questionnaire data.

biopsykit.questionnaires.utils.bin_scale(data, bins, cols=None, first_min=True, last_max=False, inplace=False, **kwargs)[source]

Bin questionnaire scales.

Questionnaire scales are binned using pandas.cut() according to the bins specified by bins.

Parameters
  • data (DataFrame or Series) – data with scales to be binned

  • bins (int or list of float or IntervalIndex`) –

    The criteria to bin by. bins can have one of the following types:

    • int : Defines the number of equal-width bins in the range of data. The range of data is extended by 0.1% on each side to include the minimum and maximum values of data.

    • sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of data is done.

    • IntervalIndex : Defines the exact bins to be used. Note that the IntervalIndex for bins must be non-overlapping.

  • cols (list of str or list of int, optional) – column name/index (or list of such) to be binned or None to use all columns (or if data is a series). Default: None

  • first_min (bool, optional) – whether the minimum value should be added as the leftmost edge of the last bin or not. Only considered if bins is a list. Default: False

  • last_max (bool, optional) – whether the maximum value should be added as the rightmost edge of the last bin or not. Only considered if bins is a list. Default: False

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

  • **kwargs – additional parameters that are passed to pandas.cut()

Returns

dataframe (or series) with binned scales or None if inplace is True

Return type

DataFrame, Series, or None

See also

pandas.cut()

Pandas method to bin values into discrete intervals.

biopsykit.questionnaires.utils.compute_scores(data, quest_dict, quest_kwargs=None)[source]

Compute questionnaire scores from dataframe.

This function can be used if multiple questionnaires from a dataframe should be computed at once. If the same questionnaire was assessed at multiple time points, these scores will be computed separately (see Notes and Examples). The questionnaires (and the dataframe columns belonging to the questionnaires) are specified by quest_dict.

Note

If questionnaires were collected at different time points (e.g., pre and post), which should all be computed, then the dictionary keys need to have the following format: “<questionnaire_name>-<time_point>”.

Parameters
  • data (DataFrame) – dataframe containing questionnaire data

  • quest_dict (dict) – dictionary with questionnaire names to be computed (keys) and columns of the questionnaires (values)

  • quest_kwargs (dict) – dictionary with optional arguments to be passed to questionnaire functions. The dictionary is expected consist of questionnaire names (keys) and **kwargs dictionaries (values) with arguments per questionnaire

Returns

dataframe with computed questionnaire scores

Return type

DataFrame

Examples

>>> from biopsykit.questionnaires.utils import compute_scores
>>> quest_dict = {
>>>     "PSS": ["PSS_{:02d}".format(i) for i in range(1, 11)], # PSS: one time point
>>>     "PASA-pre": ["PASA_{:02d}_T0".format(i) for i in range(1, 17)], # PASA: two time points (pre and post)
>>>     "PASA-post": ["PASA_{:02d}_T1".format(i) for i in range(1, 17)], # PASA: two time points (pre and post)
>>> }
>>> compute_scores(data, quest_dict)
biopsykit.questionnaires.utils.crop_scale(data, score_range, set_nan=False, inplace=False)[source]

Crop questionnaire scales, i.e., set values out of range to specific minimum and maximum values or to NaN.

Parameters
  • data (DataFrame or Series) – data to be cropped

  • score_range (list of int) – possible score range of the questionnaire items. Values out of score_range are cropped.

  • set_nan (bool, optional) – whether to set values out of range to NaN or to the values specified by score_range. Default: False

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe (or series) with cropped scales or None if inplace is True

Return type

DataFrame, Series, or None

biopsykit.questionnaires.utils.convert_scale(data, offset, cols=None, inplace=False)[source]

Convert the score range of questionnaire items.

Parameters
  • data (DataFrame or Series) – questionnaire data to invert

  • offset (int) – offset to add to questionnaire items

  • cols (list of str or list of int) – list of column names or column indices

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with converted columns or None if inplace is True

Return type

DataFrame, Series, or None

Raises

ValidationError – if data is no dataframe or series

Examples

>>> from biopsykit.questionnaires.utils import convert_scale
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]})
>>> # convert data from range [0, 4] to range [1, 5]
>>> data_out = convert_scale(data_in, offset=1)
>>> data_out["A"]
>>> [2, 3, 4, 2]
>>> data_out["B"]
>>> [5, 1, 2, 4]
>>> data_out["C"]
>>> [1, 4, 3, 4]
>>> data_out["D"]
>>> [1, 2, 3, 5]
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 2, 1, 3], "C": [3, 3, 2, 3], "D": [4, 1, 2, 4]})
>>> # convert data from range [1, 4] to range [0, 3]
>>> data_out = convert_scale(data_in, offset=-1)
>>> print(data_out)
>>> # convert only specific columns
>>> data_out = convert_scale(data_in, offset=-1, columns=["A", "C"])
>>> print(data_out)
biopsykit.questionnaires.utils.find_cols(data, regex_str=None, starts_with=None, ends_with=None, contains=None, zero_pad_numbers=True)[source]

Find columns in dataframe that match a specific pattern.

This function is useful to find all columns that belong to a questionnaire. Column names can be filtered based on one (or a combination of) the following criteria:

  • starts_with: columns have to start with the specified string

  • ends_with: columns have to end with the specified string

  • contains: columns have to contain the specified string

Optionally, the item numbers in the matching column names can be zero-padded, if they are not already.

Note

If zero_pad_numbers is True then the column names returned by this function will be renamed and might thus not match the column names of the original dataframe. To solve this, make sure your orignal dataframe already has zero-padded columns (by manually renaming them) or convert column names using zero_pad_columns().

Warning

Zero-padding using zero_pad_columns() assumes, by default, that numbers are at the end of column names. If you want to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.

Parameters
  • data (DataFrame) – dataframe with columns to be filtered

  • regex_str (str, optional) – regex string to extract column names. If this parameter is passed the other parameters (starts_with, ends_with, contains) will be ignored. Default: None

  • starts_with (str, optional) – string columns have to start with. Default: None

  • ends_with (str, optional) – string columns have to end with. Default: None

  • contains (str, optional) – string columns have to contain. Default: None

  • zero_pad_numbers (bool, optional) – whether to zero-pad numbers in column names. Default: True

Returns

  • data_filt (DataFrame) – dataframe with filtered columns that match the specified pattern

  • cols (Index) – columns that match the specified pattern

Return type

Tuple[pandas.core.frame.DataFrame, Sequence[str]]

Examples

>>> import biopsykit as bp
>>> import pandas as pd
>>> # Option 1: has to start with "XX"
>>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)])
>>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX")
>>> print(cols)
>>> ["XX_01", "XX_02", ..., "XX_10"]
>>> # Option 2: has to end with "Post"
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post"])
>>> df, cols = bp.questionnaires.utils.find_cols(data, ends_with="Post")
>>> print(cols)
>>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"]
>>> # Option 3: has to start with "XX" and end with "Post"
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post",
 "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"])
>>> bp.questionnaires.utils.find_cols(data, starts_with="XX", ends_with="Post")
>>> print(cols)
>>> # WARNING: this will not zero-pad the questionnaire numbers!
>>> ["XX_1_Post", "XX_2_Post", "XX_3_Post"]
>>> # Option 4: pass custom regex string
>>> data = pd.DataFrame(columns=["XX_1_Pre", "XX_2_Pre", "XX_3_Pre", "XX_1_Post", "XX_2_Post", "XX_3_Post",
 "YY_1_Pre", "YY_2_Pre", "YY_1_Post", "YY_2_Post"])
>>> bp.questionnaires.utils.find_cols(data, regex_str=r"XX_\d+_\w+")
>>> print(cols)
>>> # here, zero-padding will be possible again
>>> ["XX_01_Post", "XX_02_Post", "XX_03_Post"]
>>> # Option 5: disable zero-padding
>>> data = pd.DataFrame(columns=["XX_{}".format(i) for i in range(1, 11)])
>>> df, cols = bp.questionnaires.utils.find_cols(data, starts_with="XX", zero_pad_numbers=False)
>>> print(cols)
>>> ["XX_1", "XX_2", ..., "XX_10"]
biopsykit.questionnaires.utils.zero_pad_columns(data, inplace=False)[source]

Add zero-padding to numbers at the end of column names in a dataframe.

Warning

By default, this function assumes that numbers are at the end of column names. If you need to change that behavior (e.g., because the column names have string suffixes), you might need to apply zero-padding manually.

Parameters
  • data (DataFrame) – dataframe with columns to zero-pad

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with zero-padded columns or None if inplace is True

Return type

DataFrame or None

biopsykit.questionnaires.utils.invert(data, score_range, cols=None, inplace=False)[source]

Invert questionnaire scores.

In many questionnaires some items need to be inverted (reversed) before sum scores can be computed. This function can be used to either invert a single column (Series), selected columns in a dataframe (by specifying columns in the cols parameter), or a complete dataframe.

Parameters
  • data (DataFrame or Series) – questionnaire data to invert

  • score_range (list of int) – possible score range of the questionnaire items

  • cols (list of str or list of int) – list of column names or column indices

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with inverted columns or None if inplace is True

Return type

DataFrame or None

Raises
  • ValidationError – if data is no dataframe or series if score_range does not have length 2

  • ValueRangeError – if values in data are not in score_range

Examples

>>> from biopsykit.questionnaires.utils import invert
>>> data_in = pd.DataFrame({"A": [1, 2, 3, 1], "B": [4, 0, 1, 3], "C": [0, 3, 2, 3], "D": [0, 1, 2, 4]})
>>> data_out = invert(data_in, score_range=[0, 4])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [0, 4, 3, 1]
>>> data_out["C"]
>>> [4, 1, 2, 1]
>>> data_out["D"]
>>> [4, 3, 2, 0]
>>> # Other score range
>>> data_out = invert(data, score_range=[0, 5])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [1, 5, 4, 2]
>>> data_out["C"]
>>> [5, 2, 3, 2]
>>> data_out["D"]
>>> [5, 4, 3, 1]
>>> # Invert only specific columns
>>> data_out = invert(data, score_range=[0, 4], cols=["A", "C"])
>>> data_out["A"]
>>> [3, 2, 1, 3]
>>> data_out["B"]
>>> [4, 0, 1, 3]
>>> data_out["C"]
>>> [4, 1, 2, 1]
>>> data_out["D"]
>>> [0, 1, 2, 4]
biopsykit.questionnaires.utils.to_idx(col_idxs)[source]

Convert questionnaire item indices into array indices.

In questionnaires, items indices start at 1. To avoid confusion in the implementation of questionnaires (because array indices start at 0) all questionnaire indices in BioPsyKit also start at 1 and are converted to 0-based indexing using this function.

Parameters

col_idxs (list of int) – list of indices to convert to 0-based indexing

Returns

array with converted indices

Return type

ndarray

biopsykit.questionnaires.utils.wide_to_long(data, quest_name, levels)[source]

Convert a dataframe wide-format into long-format.

Warning

This function is deprecated and will be removed in the future! Please use wide_to_long() instead.

Parameters
  • data (DataFrame) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subject.

  • quest_name (str) – questionnaire name, i.e., common name for each column to be converted into long-format.

  • levels (str or list of str) – index levels of the resulting long-format dataframe.

Returns

pandas DataFrame in long-format

Return type

DataFrame

See also

wide_to_long()

convert dataframe from wide to long format

biopsykit.questionnaires.utils.get_supported_questionnaires()[source]

List all supported (i.e., implemented) questionnaires.

Returns

dictionary with questionnaire names (keys) and description (values)

Return type

dict