biopsykit.utils.dataframe_handling module¶

Module providing various functions for advanced handling of pandas dataframes.

biopsykit.utils.dataframe_handling.apply_codebook(data, codebook)[source]¶

Apply codebook to convert numerical to categorical values.

The codebook is expected to be a dataframe in a standardized format (see CodebookDataFrame for further information).

Parameters

codebook (CodebookDataFrame) – path to codebook or dataframe to be used as codebook
data (DataFrame) – data to apply codebook on

Returns

data with numerical values converted to categorical values

Return type

DataFrame

See also

load_codebook(): load Codebook

Examples

>>> codebook = pd.DataFrame(
>>>     {
>>>         0: [None, None, "Morning"],
>>>         1: ["Male", "No", "Intermediate"],
>>>         2: ["Female", "Not very often", "Evening"],
>>>         3: [None, "Often", None],
>>>         4: [None, "Very often", None]
>>>     },
>>>     index=pd.Index(["gender", "smoking", "chronotype"], name="variable")
>>> )
>>> apply_codebook(codebook, data)

biopsykit.utils.dataframe_handling.add_space_to_camel(name)[source]¶

Add space to string in “camelCase”.

Parameters: name (str) – string to transform
Returns: string with space added
Return type: str

Examples

>>> from biopsykit.utils.dataframe_handling import add_space_to_camel
>>> add_space_to_camel("HelloWorld")
Hello World
>>> add_space_to_camel("ABC")
ABC

biopsykit.utils.dataframe_handling.camel_to_snake(name, lower=True)[source]¶

Convert string in “camelCase” to “snake_case”.

Note

If all letters in name are capital letters the string will not be computed into snake_case because it is assumed to be an abbreviation.

Parameters

name (str) – string to convert from camelCase to snake_case
lower (bool, optional) – True to convert all capital letters in to lower case (“actual” snake_case), False to keep capital letters, if present

Returns

string converted into snake_case

Return type

str

Examples

>>> from biopsykit.utils.dataframe_handling import camel_to_snake
>>> camel_to_snake("HelloWorld")
hello_world
>>> camel_to_snake("HelloWorld", lower=False)
Hello_World
>>> camel_to_snake("ABC")
ABC

biopsykit.utils.dataframe_handling.snake_to_camel(name)[source]¶

Convert string in “snake_case” to “camelCase”.

Parameters: name (str) – string to convert from snake_case to camelCase

biopsykit.utils.dataframe_handling.convert_nan(data, inplace=False)[source]¶

Convert missing values to NaN.

Data exported from programs like SPSS often uses negative integers to encode missing values because these negative numbers are “unrealistic” values. Use this function to convert these negative numbers to “actual” missing values: not-a-number (NaN).

Values that will be replaced with NaN are -66, -77, -99 (integer and string representations).

Parameters

data (DataFrame or Series) – input data
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with converted missing values or None if inplace is True

Return type

DataFrame or None

biopsykit.utils.dataframe_handling.int_from_str_idx(data, idx_levels, regex, func=None)[source]¶

Extract integers from strings in index levels and set them as new index values.

Parameters

data (DataFrame) – data with index to extract information from
idx_levels (str or list of str) – name of index level or list of index level names
regex (str or list of str) – regex string or list of regex strings to extract integers from strings
func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default: None

Returns

dataframe with new index values

Return type

DataFrame

biopsykit.utils.dataframe_handling.int_from_str_col(data, col_name, regex, func=None)[source]¶

Extract integers from strings in the column of a dataframe and return it.

Parameters

data (DataFrame) – data with column names to extract information from
col_name (str) – name of column with string values to extract
regex (str) – regex string used to extract integers from string values
func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default: None

Returns

series object with extracted integer values

Return type

Series

biopsykit.utils.dataframe_handling.multi_xs(data, keys, level, drop_level=True)[source]¶

Return cross-section of multiple keys from the dataframe.

This function internally calls the pandas.DataFrame.xs() method, but it can take a list of key arguments to return multiple keys at once, in comparison to the original xs() method which only takes one possible key.

Parameters

data (DataFrame or Series) – input data to get cross-section from
keys (str or list of str) – label(s) contained in the index, or partially in a MultiIndex
level (str, int, or list of such) – in case of keys partially contained in a MultiIndex, indicate which index levels are used. Levels can be referred by label or position.
drop_level (bool, optional) – if False, returns object with same levels as self. Default: True

Returns

cross-section from the original dataframe or series

Return type

DataFrame or Series

biopsykit.utils.dataframe_handling.replace_missing_data(data, target_col, source_col, dropna=False, inplace=False)[source]¶

Replace missing data in one column by data from another column.

Parameters

data (DataFrame) – input data with values to replace
target_col (str) – target column, i.e., column in which missing values should be replaced
source_col (str) – source column, i.e., column values used to replace missing values in target_col
dropna (bool, optional) – whether to drop rows with missing values in target_col or not. Default: False
inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with replaced missing values or None if inplace is True

Return type

DataFrame or None

biopsykit.utils.dataframe_handling.stack_groups_percent(data, hue, stacked, order=None)[source]¶

Create dataframe with stacked groups.

To create a stacked bar chart, i.e. a plot with different bar charts along a categorical axis, where the variables of each bar chart are stacked along the value axis, the data needs to be rearranged and normalized in percent.

The columns of the resulting dataframe be the categorical values specified by hue, the index items will be the variables specified by stacked.

Parameters

data (DataFrame) – data to compute stacked group in percent
hue (str) – column name of grouping categorical variable. This typically corresponds to the x axis in a stacked bar chart.
stacked (str) – column name of variable that is stacked along the y axis
order (str) – order of categorical variable specified by hue

Returns

dataframe in a format that can be used to create a stacked bar chart

Return type

DataFrame

See also

stacked_barchart(): function to create a stacked bar chart

biopsykit.utils.dataframe_handling.wide_to_long(data, stubname, levels, sep='_')[source]¶

Convert a dataframe wide-format into long-format.

In the wide-format dataframe, the index levels to be converted into long-format are expected to be encoded in the column names and separated by sep. If multiple levels should be converted into long-format, e.g., for a questionnaire with subscales (level subscale) that was assessed pre and post (level time), then the different levels are all encoded into the string. The level order is specified by levels.

Parameters

data (DataFrame) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subject
stubname (str) – common name for each column to be converted into long-format. Usually, this is either the name of the questionnaire (e.g., “PSS”) or the saliva type (e.g., “cortisol”).
levels (str or list of str) – index levels of the resulting long-format dataframe.
sep (str, optional) – character separating index levels in the column names of the wide-format dataframe. Default: _

Returns

pandas DataFrame in long-format

Return type

DataFrame

Examples

>>> data = pd.DataFrame(
>>>     columns=[
>>>         "MDBF_GoodBad_pre", "MDBF_AwakeTired_pre", "MDBF_CalmNervous_pre",
>>>         "MDBF_GoodBad_post", "MDBF_AwakeTired_post",  "MDBF_CalmNervous_post"
>>>     ],
>>>     index=pd.Index(range(0, 5), name="subject")
>>> )
>>> data_long = wide_to_long(data, stubname="MDBF", levels=["subscale", "time"], sep="_")
>>> print(data_long.index.names)
['subject', 'subscale', 'time']
>>> print(data_long.index)
MultiIndex([(0,  'AwakeTired', 'post'),
        (0,  'AwakeTired',  'pre'),
        (0, 'CalmNervous', 'post'),
        (0, 'CalmNervous',  'pre'),
        (0,     'GoodBad', 'post'),
        (0,     'GoodBad',  'pre'),
        (1,  'AwakeTired', 'post'),
        ...

biopsykit.utils.data_processing module biopsykit.utils.datatype_helper module