biopsykit.utils.dataframe_handling module¶
Module providing various functions for advanced handling of pandas dataframes.
- biopsykit.utils.dataframe_handling.apply_codebook(data, codebook)[source]¶
Apply codebook to convert numerical to categorical values.
The codebook is expected to be a dataframe in a standardized format (see
CodebookDataFrame
for further information).- Parameters
codebook (
CodebookDataFrame
) – path to codebook or dataframe to be used as codebookdata (
DataFrame
) – data to apply codebook on
- Returns
data with numerical values converted to categorical values
- Return type
See also
load_codebook()
load Codebook
Examples
>>> codebook = pd.DataFrame( >>> { >>> 0: [None, None, "Morning"], >>> 1: ["Male", "No", "Intermediate"], >>> 2: ["Female", "Not very often", "Evening"], >>> 3: [None, "Often", None], >>> 4: [None, "Very often", None] >>> }, >>> index=pd.Index(["gender", "smoking", "chronotype"], name="variable") >>> ) >>> apply_codebook(codebook, data)
- biopsykit.utils.dataframe_handling.add_space_to_camel(name)[source]¶
Add space to string in “camelCase”.
Examples
>>> from biopsykit.utils.dataframe_handling import add_space_to_camel >>> add_space_to_camel("HelloWorld") Hello World >>> add_space_to_camel("ABC") ABC
- biopsykit.utils.dataframe_handling.camel_to_snake(name, lower=True)[source]¶
Convert string in “camelCase” to “snake_case”.
Note
If all letters in
name
are capital letters the string will not be computed into snake_case because it is assumed to be an abbreviation.- Parameters
- Returns
string converted into snake_case
- Return type
Examples
>>> from biopsykit.utils.dataframe_handling import camel_to_snake >>> camel_to_snake("HelloWorld") hello_world >>> camel_to_snake("HelloWorld", lower=False) Hello_World >>> camel_to_snake("ABC") ABC
- biopsykit.utils.dataframe_handling.snake_to_camel(name)[source]¶
Convert string in “snake_case” to “camelCase”.
- Parameters
name (str) – string to convert from snake_case to camelCase
- biopsykit.utils.dataframe_handling.convert_nan(data, inplace=False)[source]¶
Convert missing values to NaN.
Data exported from programs like SPSS often uses negative integers to encode missing values because these negative numbers are “unrealistic” values. Use this function to convert these negative numbers to “actual” missing values: not-a-number (
NaN
).Values that will be replaced with
NaN
are -66, -77, -99 (integer and string representations).
- biopsykit.utils.dataframe_handling.int_from_str_idx(data, idx_levels, regex, func=None)[source]¶
Extract integers from strings in index levels and set them as new index values.
- Parameters
data (
DataFrame
) – data with index to extract information fromidx_levels (str or list of str) – name of index level or list of index level names
regex (str or list of str) – regex string or list of regex strings to extract integers from strings
func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default:
None
- Returns
dataframe with new index values
- Return type
- biopsykit.utils.dataframe_handling.int_from_str_col(data, col_name, regex, func=None)[source]¶
Extract integers from strings in the column of a dataframe and return it.
- Parameters
data (
DataFrame
) – data with column names to extract information fromcol_name (str) – name of column with string values to extract
regex (str) – regex string used to extract integers from string values
func (function, optional) – function to apply to the extracted integer values. This can, for example, be a lambda function which increments all integers by 1. Default:
None
- Returns
series object with extracted integer values
- Return type
- biopsykit.utils.dataframe_handling.multi_xs(data, keys, level, drop_level=True)[source]¶
Return cross-section of multiple keys from the dataframe.
This function internally calls the
pandas.DataFrame.xs()
method, but it can take a list of key arguments to return multiple keys at once, in comparison to the originalxs()
method which only takes one possible key.- Parameters
data (
DataFrame
orSeries
) – input data to get cross-section fromkeys (str or list of str) – label(s) contained in the index, or partially in a
MultiIndex
level (str, int, or list of such) – in case of keys partially contained in a
MultiIndex
, indicate which index levels are used. Levels can be referred by label or position.drop_level (bool, optional) – if
False
, returns object with same levels as self. Default:True
- Returns
cross-section from the original dataframe or series
- Return type
- biopsykit.utils.dataframe_handling.replace_missing_data(data, target_col, source_col, dropna=False, inplace=False)[source]¶
Replace missing data in one column by data from another column.
- Parameters
data (
DataFrame
) – input data with values to replacetarget_col (str) – target column, i.e., column in which missing values should be replaced
source_col (str) – source column, i.e., column values used to replace missing values in
target_col
dropna (bool, optional) – whether to drop rows with missing values in
target_col
or not. Default:False
inplace (bool, optional) – whether to perform the operation inplace or not. Default:
False
- Returns
dataframe with replaced missing values or
None
ifinplace
isTrue
- Return type
DataFrame
orNone
- biopsykit.utils.dataframe_handling.stack_groups_percent(data, hue, stacked, order=None)[source]¶
Create dataframe with stacked groups.
To create a stacked bar chart, i.e. a plot with different bar charts along a categorical axis, where the variables of each bar chart are stacked along the value axis, the data needs to be rearranged and normalized in percent.
The columns of the resulting dataframe be the categorical values specified by
hue
, the index items will be the variables specified bystacked
.- Parameters
data (
DataFrame
) – data to compute stacked group in percenthue (str) – column name of grouping categorical variable. This typically corresponds to the
x
axis in a stacked bar chart.stacked (str) – column name of variable that is stacked along the
y
axisorder (str) – order of categorical variable specified by
hue
- Returns
dataframe in a format that can be used to create a stacked bar chart
- Return type
See also
stacked_barchart()
function to create a stacked bar chart
- biopsykit.utils.dataframe_handling.wide_to_long(data, stubname, levels, sep='_')[source]¶
Convert a dataframe wide-format into long-format.
In the wide-format dataframe, the index levels to be converted into long-format are expected to be encoded in the column names and separated by
sep
. If multiple levels should be converted into long-format, e.g., for a questionnaire with subscales (level subscale) that was assessed pre and post (level time), then the different levels are all encoded into the string. The level order is specified bylevels
.- Parameters
data (
DataFrame
) – pandas DataFrame containing saliva data in wide-format, i.e. one column per saliva sample, one row per subjectstubname (str) – common name for each column to be converted into long-format. Usually, this is either the name of the questionnaire (e.g., “PSS”) or the saliva type (e.g., “cortisol”).
levels (str or list of str) – index levels of the resulting long-format dataframe.
sep (str, optional) – character separating index levels in the column names of the wide-format dataframe. Default:
_
- Returns
pandas DataFrame in long-format
- Return type
Examples
>>> data = pd.DataFrame( >>> columns=[ >>> "MDBF_GoodBad_pre", "MDBF_AwakeTired_pre", "MDBF_CalmNervous_pre", >>> "MDBF_GoodBad_post", "MDBF_AwakeTired_post", "MDBF_CalmNervous_post" >>> ], >>> index=pd.Index(range(0, 5), name="subject") >>> ) >>> data_long = wide_to_long(data, stubname="MDBF", levels=["subscale", "time"], sep="_") >>> print(data_long.index.names) ['subject', 'subscale', 'time'] >>> print(data_long.index) MultiIndex([(0, 'AwakeTired', 'post'), (0, 'AwakeTired', 'pre'), (0, 'CalmNervous', 'post'), (0, 'CalmNervous', 'pre'), (0, 'GoodBad', 'post'), (0, 'GoodBad', 'pre'), (1, 'AwakeTired', 'post'), ...