biopsykit.utils.dataframe_handling module

Module providing various functions for advanced handling of pandas dataframes.

biopsykit.utils.dataframe_handling.apply_codebook(data, codebook)[source]

Apply codebook to convert numerical to categorical values.

The codebook is expected to be a dataframe in a standardized format (see CodebookDataFrame for further information).

Parameters
  • codebook (CodebookDataFrame) – path to codebook or dataframe to be used as codebook

  • data (DataFrame) – data to apply codebook on

Returns

data with numerical values converted to categorical values

Return type

DataFrame

See also

load_codebook()

load Codebook

Examples

>>> codebook = pd.DataFrame(
>>>     {
>>>         0: [None, None, "Morning"],
>>>         1: ["Male", "No", "Intermediate"],
>>>         2: ["Female", "Not very often", "Evening"],
>>>         3: [None, "Often", None],
>>>         4: [None, "Very often", None]
>>>     },
>>>     index=pd.Index(["gender", "smoking", "chronotype"], name="variable")
>>> )
>>> apply_codebook(codebook, data)
biopsykit.utils.dataframe_handling.add_space_to_camel(name)[source]

Add space to string in “camelCase”.

Parameters

name (str) – string to transform

Returns

string with space added

Return type

str

Examples

>>> from biopsykit.utils.dataframe_handling import add_space_to_camel
>>> add_space_to_camel("HelloWorld")
Hello World
>>> add_space_to_camel("ABC")
ABC
biopsykit.utils.dataframe_handling.camel_to_snake(name, lower=True)[source]

Convert string in “camelCase” to “snake_case”.

Note

If all letters in name are capital letters the string will not be computed into snake_case because it is assumed to be an abbreviation.

Parameters
  • name (str) – string to convert from camelCase to snake_case

  • lower (bool, optional) – True to convert all capital letters in to lower case (“actual” snake_case), False to keep capital letters, if present

Returns

string converted into snake_case

Return type

str

Examples

>>> from biopsykit.utils.dataframe_handling import camel_to_snake
>>> camel_to_snake("HelloWorld")
hello_world
>>> camel_to_snake("HelloWorld", lower=False)
Hello_World
>>> camel_to_snake("ABC")
ABC
biopsykit.utils.dataframe_handling.snake_to_camel(name)[source]

Convert string in “snake_case” to “camelCase”.

Parameters

name (str) – string to convert from snake_case to camelCase

biopsykit.utils.dataframe_handling.convert_nan(data, inplace=False)[source]

Convert missing values to NaN.

Data exported from programs like SPSS often uses negative integers to encode missing values because these negative numbers are “unrealistic” values. Use this function to convert these negative numbers to “actual” missing values: not-a-number (NaN).

Values that will be replaced with NaN are -66, -77, -99 (integer and string representations).

Parameters
  • data (DataFrame or Series) – input data

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with converted missing values or None if inplace is True

Return type

DataFrame or None

biopsykit.utils.dataframe_handling.int_from_str_idx(data, idx_levels, regex, func=None)[source]

Extract integers from strings in index levels and set them as new index values.

Parameters
  • data (DataFrame) – data with index to extract information from

  • idx_levels (str or list of str) – name of index level or list of index level names

  • regex (str or list of str) – regex string or list of regex strings to extract integers from strings

  • func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default: None

Returns

dataframe with new index values

Return type

DataFrame

biopsykit.utils.dataframe_handling.int_from_str_col(data, col_name, regex, func=None)[source]

Extract integers from strings in the column of a dataframe and return it.

Parameters
  • data (DataFrame) – data with column names to extract information from

  • col_name (str) – name of column with string values to extract

  • regex (str) – regex string used to extract integers from string values

  • func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default: None

Returns

series object with extracted integer values

Return type

Series

biopsykit.utils.dataframe_handling.multi_xs(data, keys, level, drop_level=True)[source]

Return cross-section of multiple keys from the dataframe.

This function internally calls the pandas.DataFrame.xs() method, but it can take a list of key arguments to return multiple keys at once, in comparison to the original xs() method which only takes one possible key.

Parameters
  • data (DataFrame or Series) – input data to get cross-section from

  • keys (str or list of str) – label(s) contained in the index, or partially in a MultiIndex

  • level (str, int, or list of such) – in case of keys partially contained in a MultiIndex, indicate which index levels are used. Levels can be referred by label or position.

  • drop_level (bool, optional) – if False, returns object with same levels as self. Default: True

Returns

cross-section from the original dataframe or series

Return type

DataFrame or Series

biopsykit.utils.dataframe_handling.replace_missing_data(data, target_col, source_col, dropna=False, inplace=False)[source]

Replace missing data in one column by data from another column.

Parameters
  • data (DataFrame) – input data with values to replace

  • target_col (str) – target column, i.e., column in which missing values should be replaced

  • source_col (str) – source column, i.e., column values used to replace missing values in target_col

  • dropna (bool, optional) – whether to drop rows with missing values in target_col or not. Default: False

  • inplace (bool, optional) – whether to perform the operation inplace or not. Default: False

Returns

dataframe with replaced missing values or None if inplace is True

Return type

DataFrame or None

biopsykit.utils.dataframe_handling.stack_groups_percent(data, hue, stacked, order=None)[source]

Create dataframe with stacked groups.

To create a stacked bar chart, i.e. a plot with different bar charts along a categorical axis, where the variables of each bar chart are stacked along the value axis, the data needs to be rearranged and normalized in percent.

The columns of the resulting dataframe be the categorical values specified by hue, the index items will be the variables specified by stacked.

Parameters
  • data (DataFrame) – data to compute stacked group in percent

  • hue (str) – column name of grouping categorical variable. This typically corresponds to the x axis in a stacked bar chart.

  • stacked (str) – column name of variable that is stacked along the y axis

  • order (str) – order of categorical variable specified by hue

Returns

dataframe in a format that can be used to create a stacked bar chart

Return type

DataFrame

See also

stacked_barchart()

function to create a stacked bar chart

biopsykit.utils.dataframe_handling.wide_to_long(data, stubname, levels, sep='_')[source]

Convert a dataframe wide-format into long-format.

In the wide-format dataframe, the index levels to be converted into long-format are expected to be encoded in the column names and separated by sep. If multiple levels should be converted into long-format, e.g., for a questionnaire with subscales (level subscale) that was assessed pre and post (level time), then the different levels are all encoded into the string. The level order is specified by levels.

Parameters
  • data (DataFrame) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subject

  • stubname (str) – common name for each column to be converted into long-format. Usually, this is either the name of the questionnaire (e.g., “PSS”) or the saliva type (e.g., “cortisol”).

  • levels (str or list of str) – index levels of the resulting long-format dataframe.

  • sep (str, optional) – character separating index levels in the column names of the wide-format dataframe. Default: _

Returns

pandas DataFrame in long-format

Return type

DataFrame

Examples

>>> data = pd.DataFrame(
>>>     columns=[
>>>         "MDBF_GoodBad_pre", "MDBF_AwakeTired_pre", "MDBF_CalmNervous_pre",
>>>         "MDBF_GoodBad_post", "MDBF_AwakeTired_post",  "MDBF_CalmNervous_post"
>>>     ],
>>>     index=pd.Index(range(0, 5), name="subject")
>>> )
>>> data_long = wide_to_long(data, stubname="MDBF", levels=["subscale", "time"], sep="_")
>>> print(data_long.index.names)
['subject', 'subscale', 'time']
>>> print(data_long.index)
MultiIndex([(0,  'AwakeTired', 'post'),
        (0,  'AwakeTired',  'pre'),
        (0, 'CalmNervous', 'post'),
        (0, 'CalmNervous',  'pre'),
        (0,     'GoodBad', 'post'),
        (0,     'GoodBad',  'pre'),
        (1,  'AwakeTired', 'post'),
        ...