biopsykit.utils.data_processing module

Module providing various functions for processing more complex structured data (e.g., collected during a study).

biopsykit.utils.data_processing.split_data(data, time_intervals, include_start=False)[source]

Split data into different phases based on time intervals.

The start and end times of the phases are prodivded via the time_intervals parameter and can either be a Series, 1 row of a DataFrame, or a dictionary with start and end times per phase.

Parameters
  • data (DataFrame) – data to be split

  • time_intervals (dict or Series or DataFrame) –

    time intervals indicating where the data should be split. This can be:

    • Series object or 1 row of a DataFrame with start times of each phase. The phase names are then derived from the index names in case of a Series or from the columns names in case of a DataFrame.

    • dictionary with phase names (keys) and tuples with start and end times of the phase (values)

  • include_start (bool, optional) – True to include data from the beginning of the recording to the start of the first phase as the first phase (this phase will be named “Start”), False to discard this data. Default: False

Returns

dictionary with data split into different phases

Return type

dict

Examples

>>> from biopsykit.utils.data_processing import split_data
>>> # read pandas dataframe from csv file and split data based on time interval dictionary
>>> data = pd.read_csv("path-to-file.csv")
>>> # Example 1: define time intervals (start and end) of the different recording phases as dictionary
>>> time_intervals = {"Part1": ("09:00", "09:30"), "Part2": ("09:30", "09:45"), "Part3": ("09:45", "10:00")}
>>> data_dict = split_data(data=data, time_intervals=time_intervals)
>>> # Example 2: define time intervals as pandas Series. Here, only start times of the are required, it is assumed
>>> # that the phases are back to back
>>> time_intervals = pd.Series(data=["09:00", "09:30", "09:45", "10:00"], index=["Part1", "Part2", "Part3", "End"])
>>> data_dict = split_data(data=data, time_intervals=time_intervals)
>>>
>>> # Example: Get Part 2 of data_dict
>>> print(data_dict['Part2'])
biopsykit.utils.data_processing.exclude_subjects(excluded_subjects, index_name='subject', **kwargs)[source]

Exclude subjects from dataframes.

This function can be used to exclude subject IDs for later analysis from different kinds of dataframes, such as:

All dataframes can be supplied at once via **kwargs.

Parameters
  • excluded_subjects (list of str or int) – list with subjects IDs to be excluded

  • index_name (str, optional) – name of dataframe index level with subject IDs. Default: “subject”

  • **kwargs – data to be cleaned as key-value pairs

Returns

dictionary with cleaned versions of the dataframes passed to the function via **kwargs or dataframe if function was only called with one single dataframe

Return type

DataFrame or dict of such

biopsykit.utils.data_processing.normalize_to_phase(subject_data_dict, phase)[source]

Normalize time series data per subject to the phase specified by normalize_to.

The result is the relative change (of, for example, heart rate) compared to the mean value in phase.

Parameters
  • subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e., a dictionary with a PhaseDict for each subject

  • phase (str or DataFrame) – phase to normalize all other data to. If phase is a string then it is interpreted as the name of a phase present in subject_data_dict. If phase is a DataFrame then the data will be normalized (per subject) to the mean value of the DataFrame.

Returns

dictionary with normalized data per subject

Return type

dict

biopsykit.utils.data_processing.resample_sec(data)[source]

Resample input data to a frequency of 1 Hz.

Note

For resampling the index of data either be has to be a DatetimeIndex or a Index with time information in seconds.

Parameters

data (DataFrame or Series) – data to resample. Index of data needs to be a DatetimeIndex

Returns

dataframe with data resampled to 1 Hz

Return type

DataFrame

Raises

ValueError – If data is not a DataFrame or Series

biopsykit.utils.data_processing.resample_dict_sec(data_dict)[source]

Resample all data in the dictionary to 1 Hz data.

This function recursively looks for all dataframes in the dictionary and resamples data to 1 Hz using resample_sec().

Parameters

data_dict (dict) – nested dictionary with data to be resampled

Returns

nested dictionary with data resampled to 1 Hz

Return type

dict

See also

resample_sec()

resample dataframe to 1 Hz

biopsykit.utils.data_processing.select_dict_phases(subject_data_dict, phases)[source]

Select specific phases from SubjectDataDict.

Parameters
  • subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e. a dictionary with PhaseDict for each subject

  • phases (list of str) – list of phases to select

Returns

SubjectDataDict containing only the phases of interest

Return type

SubjectDataDict

biopsykit.utils.data_processing.rearrange_subject_data_dict(subject_data_dict)[source]

Rearrange SubjectDataDict to StudyDataDict.

A StudyDataDict is constructed from a SubjectDataDict by swapping outer (subject IDs) and inner (phase names) dictionary keys.

The input needs to be a SubjectDataDict, a nested dictionary in the following format:

{
“subject1” : { “phase_1” : dataframe, “phase_2” : dataframe, … },
“subject2” : { “phase_1” : dataframe, “phase_2” : dataframe, … },
}

The output format will be the following:

{
“phase_1” : { “subject1” : dataframe, “subject2” : dataframe, … },
“phase_2” : { “subject1” : dataframe, “subject2” : dataframe, … },
}
Parameters

subject_data_dict (SubjectDataDict) – SubjectDataDict, i.e. a dictionary with data from multiple subjects, each containing data from multiple phases (in form of a PhaseDict)

Returns

rearranged SubjectDataDict

Return type

StudyDataDict

biopsykit.utils.data_processing.cut_phases_to_shortest(study_data_dict, phases=None)[source]

Cut time-series data to shortest duration of a subject in each phase.

To overlay time-series data from multiple subjects in an ensemble plot it is beneficial if all data have the same length. For that reason, data can be cut to the same length using this function.

Parameters
  • study_data_dict (StudyDataDict) – StudyDataDict, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.

  • phases (list of str, optional) – list of phases if only a subset of phases should be cut or None to cut all phases. Default: None

Returns

StudyDataDict with data cut to the shortest duration in each phase

Return type

StudyDataDict

biopsykit.utils.data_processing.merge_study_data_dict(study_data_dict, dict_levels=None)[source]

Merge inner dictionary level of StudyDataDict into one dataframe.

This function removes the inner level of the nested StudyDataDict by merging data from all subjects into one dataframe for each phase.

Note

To merge data from different subjects into one dataframe the data are all expected to have the same length! If this is not the case, all data needs to be cut to equal length first, e.g. using cut_phases_to_shortest().

Parameters
  • study_data_dict (StudyDataDict) – StudyDataDict, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.

  • dict_levels (list of str) – list with names of dictionary levels.

Returns

MergedStudyDataDict with data of all subjects merged into one dataframe for each phase

Return type

MergedStudyDataDict

biopsykit.utils.data_processing.split_dict_into_subphases(data_dict, subphases)[source]

Split dataframes in a nested dictionary into subphases.

By further splitting a dataframe into subphases a new dictionary level is created. The new dictionary level then contains the subphases with their data.

Note

If the duration of the last subphase is unknown (e.g., because it has variable length) this can be indicated by setting the duration of this subphase to 0. The duration of this subphase will then be inferred from the data.

Parameters
  • data_dict (dict) – dictionary with an arbitrary number of outer level (e.g., conditions, phases, etc.) as keys and dataframes with data to be split into subphases as values

  • subphases (dict) – dictionary with subphase names (keys) and subphase durations (values) in seconds

Returns

dictionary where each dataframe in the dictionary is split into the subphases specified by subphases

Return type

dict

biopsykit.utils.data_processing.get_subphase_durations(data, subphases)[source]

Compute subphase durations from dataframe.

The subphases can be specified in two different ways:

  • If the dictionary entries in subphases are integer, it’s assumed that subphases are consecutive, i.e., each subphase begins right after the previous one, and the entries indicate the durations of each subphase. The start and end times of each subphase will then be computed from the subphase durations.

  • If the dictionary entries in subphases are tuples, it’s assumed that the start and end times of each subphase are directly provided.

Note

If the duration of the last subphase is unknown (e.g., because it has variable length) this can be indicated by setting the duration of this subphase to 0. The duration of this subphase will then be inferred from the data.

Parameters
  • data (DataFrame) – dataframe with data from one phase. Used to compute the duration of the last subphase if this subphase is expected to have variable duration.

  • subphases (dict) – dictionary with subphase names as keys and subphase durations (as integer) or start and end times (as tuples of integer) as values in seconds

Returns

list with start and end times of each subphase in seconds relative to beginning of the phase

Return type

list

Examples

>>> from biopsykit.utils.data_processing import get_subphase_durations
>>> # Option 1: Subphases consecutive, subphase durations provided
>>> get_subphase_durations(data, {"Start": 60, "Middle": 120, "End": 60})
>>> # Option 2: Subphase start and end times provided
>>> get_subphase_durations(data, {"Start": (0, 50), "Middle": (60, 160), "End": (180, 240)})
biopsykit.utils.data_processing.add_subject_conditions(data, condition_list)[source]

Add subject conditions to dataframe.

This function expects a dataframe with data from multiple subjects and information on which subject belongs to which condition.

Parameters
  • data (DataFrame) – dataframe where new index level condition with subject conditions should be added to

  • condition_list (SubjectConditionDict or SubjectConditionDataFrame) – SubjectConditionDict or SubjectConditionDataFrame with information on which subject belongs to which condition

Returns

dataframe with new index level condition indicating which subject belongs to which condition

Return type

DataFrame

biopsykit.utils.data_processing.split_subject_conditions(data_dict, condition_list)[source]

Split dictionary with data based on conditions subjects were assigned to.

This function adds a new outer dictionary level with the different conditions as keys and dictionaries belonging to the conditions as values. For that, it expects a dictionary with data from multiple subjects and information on which subject belongs to which condition.

Parameters
  • data_dict (dict) – (nested) dictionary with data which should be split based on the conditions subjects belong to

  • condition_list (SubjectConditionDict or SubjectConditionDataFrame) – SubjectConditionDict or SubjectConditionDataFrame with information on which subject belongs to which condition

Returns

dictionary with additional outer level indicating which subject belongs to which condition

Return type

dict

biopsykit.utils.data_processing.mean_per_subject_dict(data, dict_levels, param_name)[source]

Compute mean values of time-series data from a nested dictionary.

This function computes the mean value of time-series data in a nested dictionary per subject and combines it into a joint dataframe. The dictionary will be traversed recursively and can thus have arbitrary depth. The most inner level must contain dataframes with time-series data of which mean values will be computed. The names of the dictionary levels are specified by dict_levels.

Parameters
  • data (dict) – nested dictionary with data on which mean should be computed. The number of nested levels must match the number of levels specified in dict_levels.

  • dict_levels (list of str) – list with names of dictionary levels.

  • param_name (str) – type of data of which mean values will be computed from. This will also be the column name in the resulting dataframe.

Returns

dataframe with dict_levels as index levels and mean values of time-series data as column values

Return type

DataFrame

biopsykit.utils.data_processing.mean_se_per_phase(data)[source]

Compute mean and standard error over all subjects in a dataframe.

Note

The dataframe in data is expected to have a MultiIndex with at least two levels, one of the levels being the level “subject”!

Parameters

data (DataFrame) – dataframe with MultiIndex from which to compute mean and standard error

Returns

dataframe with mean and standard error over all subjects

Return type

DataFrame