biopsykit.utils.data_processing module¶

Module providing various functions for processing more complex structured data (e.g., collected during a study).

biopsykit.utils.data_processing.split_data(data, time_intervals, include_start=False)[source]¶

Split data into different phases based on time intervals.

The start and end times of the phases are prodivded via the time_intervals parameter and can either be a Series, 1 row of a DataFrame, or a dictionary with start and end times per phase.

Parameters

data (DataFrame) – data to be split
time_intervals (dict or Series or DataFrame) –
time intervals indicating where the data should be split. This can be:
- Series object or 1 row of a DataFrame with start times of each phase. The phase names are then derived from the index names in case of a Series or from the columns names in case of a DataFrame.
- dictionary with phase names (keys) and tuples with start and end times of the phase (values)
include_start (bool, optional) – True to include data from the beginning of the recording to the start of the first phase as the first phase (this phase will be named “Start”), False to discard this data. Default: False

Returns

dictionary with data split into different phases

Return type

dict

Examples

>>> from biopsykit.utils.data_processing import split_data
>>> # read pandas dataframe from csv file and split data based on time interval dictionary
>>> data = pd.read_csv("path-to-file.csv")
>>> # Example 1: define time intervals (start and end) of the different recording phases as dictionary
>>> time_intervals = {"Part1": ("09:00", "09:30"), "Part2": ("09:30", "09:45"), "Part3": ("09:45", "10:00")}
>>> data_dict = split_data(data=data, time_intervals=time_intervals)
>>> # Example 2: define time intervals as pandas Series. Here, only start times of the are required, it is assumed
>>> # that the phases are back to back
>>> time_intervals = pd.Series(data=["09:00", "09:30", "09:45", "10:00"], index=["Part1", "Part2", "Part3", "End"])
>>> data_dict = split_data(data=data, time_intervals=time_intervals)
>>>
>>> # Example: Get Part 2 of data_dict
>>> print(data_dict['Part2'])

biopsykit.utils.data_processing.exclude_subjects(excluded_subjects, index_name='subject', **kwargs)[source]¶

Exclude subjects from dataframes.

This function can be used to exclude subject IDs for later analysis from different kinds of dataframes, such as:

dataframes with subject condition information (SubjectConditionDataFrame)
dataframes with time log information
dataframes with (processed) data (e.g., biopsykit.utils.datatype_helper.SalivaRawDataFrame or MeanSeDataFrame)

All dataframes can be supplied at once via **kwargs.

Parameters

excluded_subjects (list of str or int) – list with subjects IDs to be excluded
index_name (str, optional) – name of dataframe index level with subject IDs. Default: “subject”
**kwargs – data to be cleaned as key-value pairs

Returns

dictionary with cleaned versions of the dataframes passed to the function via **kwargs or dataframe if function was only called with one single dataframe

Return type

DataFrame or dict of such

biopsykit.utils.data_processing.normalize_to_phase(subject_data_dict, phase)[source]¶

Normalize time series data per subject to the phase specified by normalize_to.

The result is the relative change (of, for example, heart rate) compared to the mean value in phase.

Parameters

subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e., a dictionary with a PhaseDict for each subject
phase (str or DataFrame) – phase to normalize all other data to. If phase is a string then it is interpreted as the name of a phase present in subject_data_dict. If phase is a DataFrame then the data will be normalized (per subject) to the mean value of the DataFrame.

Returns

dictionary with normalized data per subject

Return type

dict

biopsykit.utils.data_processing.resample_sec(data)[source]¶

Resample input data to a frequency of 1 Hz.

Note

For resampling the index of data either be has to be a DatetimeIndex or a Index with time information in seconds.

Parameters: data (DataFrame or Series) – data to resample. Index of data needs to be a DatetimeIndex
Returns: dataframe with data resampled to 1 Hz
Return type: DataFrame
Raises: ValueError – If data is not a DataFrame or Series

biopsykit.utils.data_processing.resample_dict_sec(data_dict)[source]¶

Resample all data in the dictionary to 1 Hz data.

This function recursively looks for all dataframes in the dictionary and resamples data to 1 Hz using resample_sec().

Parameters: data_dict (dict) – nested dictionary with data to be resampled
Returns: nested dictionary with data resampled to 1 Hz
Return type: dict

See also

resample_sec(): resample dataframe to 1 Hz

biopsykit.utils.data_processing.select_dict_phases(subject_data_dict, phases)[source]¶

Select specific phases from SubjectDataDict.

Parameters

subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e. a dictionary with PhaseDict for each subject
phases (list of str) – list of phases to select

Returns

SubjectDataDict containing only the phases of interest

Return type

SubjectDataDict

biopsykit.utils.data_processing.rearrange_subject_data_dict(subject_data_dict)[source]¶

Rearrange SubjectDataDict to StudyDataDict.

A StudyDataDict is constructed from a SubjectDataDict by swapping outer (subject IDs) and inner (phase names) dictionary keys.

The input needs to be a SubjectDataDict, a nested dictionary in the following format:

{

“subject1” : { “phase_1” : dataframe, “phase_2” : dataframe, … },
“subject2” : { “phase_1” : dataframe, “phase_2” : dataframe, … },
…

}

The output format will be the following:

{

“phase_1” : { “subject1” : dataframe, “subject2” : dataframe, … },
“phase_2” : { “subject1” : dataframe, “subject2” : dataframe, … },
…

}

Parameters: subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e. a dictionary with data from multiple subjects, each containing data from multiple phases (in form of a PhaseDict)
Returns: rearranged SubjectDataDict
Return type: StudyDataDict

biopsykit.utils.data_processing.cut_phases_to_shortest(study_data_dict, phases=None)[source]¶

Cut time-series data to shortest duration of a subject in each phase.

To overlay time-series data from multiple subjects in an ensemble plot it is beneficial if all data have the same length. For that reason, data can be cut to the same length using this function.

Parameters

study_data_dict (StudyDataDict) – StudyDataDict, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.
phases (list of str, optional) – list of phases if only a subset of phases should be cut or None to cut all phases. Default: None

Returns

StudyDataDict with data cut to the shortest duration in each phase

Return type

StudyDataDict

biopsykit.utils.data_processing.merge_study_data_dict(study_data_dict, dict_levels=None)[source]¶

Merge inner dictionary level of StudyDataDict into one dataframe.

This function removes the inner level of the nested StudyDataDict by merging data from all subjects into one dataframe for each phase.

Note

To merge data from different subjects into one dataframe the data are all expected to have the same length! If this is not the case, all data needs to be cut to equal length first, e.g. using cut_phases_to_shortest().

Parameters

study_data_dict (StudyDataDict) – StudyDataDict, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.
dict_levels (list of str) – list with names of dictionary levels.

Returns

MergedStudyDataDict with data of all subjects merged into one dataframe for each phase

Return type

MergedStudyDataDict

biopsykit.utils.data_processing.split_dict_into_subphases(data_dict, subphases)[source]¶

Split dataframes in a nested dictionary into subphases.

By further splitting a dataframe into subphases a new dictionary level is created. The new dictionary level then contains the subphases with their data.

Note

If the duration of the last subphase is unknown (e.g., because it has variable length) this can be indicated by setting the duration of this subphase to 0. The duration of this subphase will then be inferred from the data.

Parameters

data_dict (dict) – dictionary with an arbitrary number of outer level (e.g., conditions, phases, etc.) as keys and dataframes with data to be split into subphases as values
subphases (dict) – dictionary with subphase names (keys) and subphase durations (values) in seconds

Returns

dictionary where each dataframe in the dictionary is split into the subphases specified by subphases

Return type

dict

biopsykit.utils.data_processing.get_subphase_durations(data, subphases)[source]¶

Compute subphase durations from dataframe.

The subphases can be specified in two different ways:

If the dictionary entries in subphases are integer, it’s assumed that subphases are consecutive, i.e., each subphase begins right after the previous one, and the entries indicate the durations of each subphase. The start and end times of each subphase will then be computed from the subphase durations.
If the dictionary entries in subphases are tuples, it’s assumed that the start and end times of each subphase are directly provided.

Note

Parameters

data (DataFrame) – dataframe with data from one phase. Used to compute the duration of the last subphase if this subphase is expected to have variable duration.
subphases (dict) – dictionary with subphase names as keys and subphase durations (as integer) or start and end times (as tuples of integer) as values in seconds

Returns

list with start and end times of each subphase in seconds relative to beginning of the phase

Return type

list

Examples

>>> from biopsykit.utils.data_processing import get_subphase_durations
>>> # Option 1: Subphases consecutive, subphase durations provided
>>> get_subphase_durations(data, {"Start": 60, "Middle": 120, "End": 60})
>>> # Option 2: Subphase start and end times provided
>>> get_subphase_durations(data, {"Start": (0, 50), "Middle": (60, 160), "End": (180, 240)})

biopsykit.utils.data_processing.add_subject_conditions(data, condition_list)[source]¶

Add subject conditions to dataframe.

This function expects a dataframe with data from multiple subjects and information on which subject belongs to which condition.

Parameters

data (DataFrame) – dataframe where new index level condition with subject conditions should be added to
condition_list (SubjectConditionDict or SubjectConditionDataFrame) – SubjectConditionDict or SubjectConditionDataFrame with information on which subject belongs to which condition

Returns

dataframe with new index level condition indicating which subject belongs to which condition

Return type

DataFrame

biopsykit.utils.data_processing.split_subject_conditions(data_dict, condition_list)[source]¶

Split dictionary with data based on conditions subjects were assigned to.

This function adds a new outer dictionary level with the different conditions as keys and dictionaries belonging to the conditions as values. For that, it expects a dictionary with data from multiple subjects and information on which subject belongs to which condition.

Parameters

data_dict (dict) – (nested) dictionary with data which should be split based on the conditions subjects belong to
condition_list (SubjectConditionDict or SubjectConditionDataFrame) – SubjectConditionDict or SubjectConditionDataFrame with information on which subject belongs to which condition

Returns

dictionary with additional outer level indicating which subject belongs to which condition

Return type

dict

biopsykit.utils.data_processing.mean_per_subject_dict(data, dict_levels, param_name)[source]¶

Compute mean values of time-series data from a nested dictionary.

This function computes the mean value of time-series data in a nested dictionary per subject and combines it into a joint dataframe. The dictionary will be traversed recursively and can thus have arbitrary depth. The most inner level must contain dataframes with time-series data of which mean values will be computed. The names of the dictionary levels are specified by dict_levels.

Parameters

data (dict) – nested dictionary with data on which mean should be computed. The number of nested levels must match the number of levels specified in dict_levels.
dict_levels (list of str) – list with names of dictionary levels.
param_name (str) – type of data of which mean values will be computed from. This will also be the column name in the resulting dataframe.

Returns

dataframe with dict_levels as index levels and mean values of time-series data as column values

Return type

DataFrame

biopsykit.utils.data_processing.mean_se_per_phase(data)[source]¶

Compute mean and standard error over all subjects in a dataframe.

Note

The dataframe in data is expected to have a MultiIndex with at least two levels, one of the levels being the level “subject”!

Parameters: data (DataFrame) – dataframe with MultiIndex from which to compute mean and standard error
Returns: dataframe with mean and standard error over all subjects
Return type: DataFrame

biopsykit.utils.array_handling module biopsykit.utils.dataframe_handling module