biopsykit.utils.data_processing module¶
Module providing various functions for processing more complex structured data (e.g., collected during a study).
- biopsykit.utils.data_processing.split_data(data, time_intervals, include_start=False)[source]¶
Split data into different phases based on time intervals.
The start and end times of the phases are prodivded via the
time_intervals
parameter and can either be aSeries
, 1 row of aDataFrame
, or a dictionary with start and end times per phase.- Parameters
data (
DataFrame
) – data to be splittime_intervals (dict or
Series
orDataFrame
) –time intervals indicating where the data should be split. This can be:
include_start (bool, optional) –
True
to include data from the beginning of the recording to the start of the first phase as the first phase (this phase will be named “Start”),False
to discard this data. Default:False
- Returns
dictionary with data split into different phases
- Return type
Examples
>>> from biopsykit.utils.data_processing import split_data >>> # read pandas dataframe from csv file and split data based on time interval dictionary >>> data = pd.read_csv("path-to-file.csv") >>> # Example 1: define time intervals (start and end) of the different recording phases as dictionary >>> time_intervals = {"Part1": ("09:00", "09:30"), "Part2": ("09:30", "09:45"), "Part3": ("09:45", "10:00")} >>> data_dict = split_data(data=data, time_intervals=time_intervals) >>> # Example 2: define time intervals as pandas Series. Here, only start times of the are required, it is assumed >>> # that the phases are back to back >>> time_intervals = pd.Series(data=["09:00", "09:30", "09:45", "10:00"], index=["Part1", "Part2", "Part3", "End"]) >>> data_dict = split_data(data=data, time_intervals=time_intervals) >>> >>> # Example: Get Part 2 of data_dict >>> print(data_dict['Part2'])
- biopsykit.utils.data_processing.exclude_subjects(excluded_subjects, index_name='subject', **kwargs)[source]¶
Exclude subjects from dataframes.
This function can be used to exclude subject IDs for later analysis from different kinds of dataframes, such as:
dataframes with subject condition information (
SubjectConditionDataFrame
)dataframes with time log information
dataframes with (processed) data (e.g.,
biopsykit.utils.datatype_helper.SalivaRawDataFrame
orMeanSeDataFrame
)
All dataframes can be supplied at once via
**kwargs
.- Parameters
- Returns
dictionary with cleaned versions of the dataframes passed to the function via
**kwargs
or dataframe if function was only called with one single dataframe- Return type
DataFrame
or dict of such
- biopsykit.utils.data_processing.normalize_to_phase(subject_data_dict, phase)[source]¶
Normalize time series data per subject to the phase specified by
normalize_to
.The result is the relative change (of, for example, heart rate) compared to the mean value in
phase
.- Parameters
subject_data_dict (
SubjectDataDict
) –SubjectDataDict
, i.e., a dictionary with aPhaseDict
for each subjectphase (str or
DataFrame
) – phase to normalize all other data to. Ifphase
is a string then it is interpreted as the name of a phase present insubject_data_dict
. Ifphase
is a DataFrame then the data will be normalized (per subject) to the mean value of the DataFrame.
- Returns
dictionary with normalized data per subject
- Return type
- biopsykit.utils.data_processing.resample_sec(data)[source]¶
Resample input data to a frequency of 1 Hz.
Note
For resampling the index of
data
either be has to be aDatetimeIndex
or aIndex
with time information in seconds.- Parameters
data (
DataFrame
orSeries
) – data to resample. Index of data needs to be aDatetimeIndex
- Returns
dataframe with data resampled to 1 Hz
- Return type
- Raises
ValueError – If
data
is not a DataFrame or Series
- biopsykit.utils.data_processing.resample_dict_sec(data_dict)[source]¶
Resample all data in the dictionary to 1 Hz data.
This function recursively looks for all dataframes in the dictionary and resamples data to 1 Hz using
resample_sec()
.- Parameters
data_dict (dict) – nested dictionary with data to be resampled
- Returns
nested dictionary with data resampled to 1 Hz
- Return type
See also
resample_sec()
resample dataframe to 1 Hz
- biopsykit.utils.data_processing.select_dict_phases(subject_data_dict, phases)[source]¶
Select specific phases from
SubjectDataDict
.- Parameters
subject_data_dict (
SubjectDataDict
) –SubjectDataDict
, i.e. a dictionary withPhaseDict
for each subjectphases (list of str) – list of phases to select
- Returns
SubjectDataDict
containing only the phases of interest- Return type
- biopsykit.utils.data_processing.rearrange_subject_data_dict(subject_data_dict)[source]¶
Rearrange
SubjectDataDict
toStudyDataDict
.A
StudyDataDict
is constructed from aSubjectDataDict
by swapping outer (subject IDs) and inner (phase names) dictionary keys.The input needs to be a
SubjectDataDict
, a nested dictionary in the following format:{“subject1” : { “phase_1” : dataframe, “phase_2” : dataframe, … },“subject2” : { “phase_1” : dataframe, “phase_2” : dataframe, … },…}The output format will be the following:
{“phase_1” : { “subject1” : dataframe, “subject2” : dataframe, … },“phase_2” : { “subject1” : dataframe, “subject2” : dataframe, … },…}- Parameters
subject_data_dict (
SubjectDataDict
) –SubjectDataDict
, i.e. a dictionary with data from multiple subjects, each containing data from multiple phases (in form of aPhaseDict
)- Returns
rearranged
SubjectDataDict
- Return type
- biopsykit.utils.data_processing.cut_phases_to_shortest(study_data_dict, phases=None)[source]¶
Cut time-series data to shortest duration of a subject in each phase.
To overlay time-series data from multiple subjects in an ensemble plot it is beneficial if all data have the same length. For that reason, data can be cut to the same length using this function.
- Parameters
study_data_dict (
StudyDataDict
) –StudyDataDict
, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.phases (list of str, optional) – list of phases if only a subset of phases should be cut or
None
to cut all phases. Default:None
- Returns
StudyDataDict
with data cut to the shortest duration in each phase- Return type
- biopsykit.utils.data_processing.merge_study_data_dict(study_data_dict, dict_levels=None)[source]¶
Merge inner dictionary level of
StudyDataDict
into one dataframe.This function removes the inner level of the nested
StudyDataDict
by merging data from all subjects into one dataframe for each phase.Note
To merge data from different subjects into one dataframe the data are all expected to have the same length! If this is not the case, all data needs to be cut to equal length first, e.g. using
cut_phases_to_shortest()
.- Parameters
study_data_dict (
StudyDataDict
) –StudyDataDict
, i.e. a dictionary with data from multiple phases, each phase containing data from different subjects.dict_levels (list of str) – list with names of dictionary levels.
- Returns
MergedStudyDataDict
with data of all subjects merged into one dataframe for each phase- Return type
- biopsykit.utils.data_processing.split_dict_into_subphases(data_dict, subphases)[source]¶
Split dataframes in a nested dictionary into subphases.
By further splitting a dataframe into subphases a new dictionary level is created. The new dictionary level then contains the subphases with their data.
Note
If the duration of the last subphase is unknown (e.g., because it has variable length) this can be indicated by setting the duration of this subphase to 0. The duration of this subphase will then be inferred from the data.
- Parameters
- Returns
dictionary where each dataframe in the dictionary is split into the subphases specified by
subphases
- Return type
- biopsykit.utils.data_processing.get_subphase_durations(data, subphases)[source]¶
Compute subphase durations from dataframe.
The subphases can be specified in two different ways:
If the dictionary entries in
subphases
are integer, it’s assumed that subphases are consecutive, i.e., each subphase begins right after the previous one, and the entries indicate the durations of each subphase. The start and end times of each subphase will then be computed from the subphase durations.If the dictionary entries in
subphases
are tuples, it’s assumed that the start and end times of each subphase are directly provided.
Note
If the duration of the last subphase is unknown (e.g., because it has variable length) this can be indicated by setting the duration of this subphase to 0. The duration of this subphase will then be inferred from the data.
- Parameters
data (
DataFrame
) – dataframe with data from one phase. Used to compute the duration of the last subphase if this subphase is expected to have variable duration.subphases (dict) – dictionary with subphase names as keys and subphase durations (as integer) or start and end times (as tuples of integer) as values in seconds
- Returns
list with start and end times of each subphase in seconds relative to beginning of the phase
- Return type
Examples
>>> from biopsykit.utils.data_processing import get_subphase_durations >>> # Option 1: Subphases consecutive, subphase durations provided >>> get_subphase_durations(data, {"Start": 60, "Middle": 120, "End": 60}) >>> # Option 2: Subphase start and end times provided >>> get_subphase_durations(data, {"Start": (0, 50), "Middle": (60, 160), "End": (180, 240)})
- biopsykit.utils.data_processing.add_subject_conditions(data, condition_list)[source]¶
Add subject conditions to dataframe.
This function expects a dataframe with data from multiple subjects and information on which subject belongs to which condition.
- Parameters
data (
DataFrame
) – dataframe where new index levelcondition
with subject conditions should be added tocondition_list (
SubjectConditionDict
orSubjectConditionDataFrame
) –SubjectConditionDict
orSubjectConditionDataFrame
with information on which subject belongs to which condition
- Returns
dataframe with new index level
condition
indicating which subject belongs to which condition- Return type
- biopsykit.utils.data_processing.split_subject_conditions(data_dict, condition_list)[source]¶
Split dictionary with data based on conditions subjects were assigned to.
This function adds a new outer dictionary level with the different conditions as keys and dictionaries belonging to the conditions as values. For that, it expects a dictionary with data from multiple subjects and information on which subject belongs to which condition.
- Parameters
data_dict (dict) – (nested) dictionary with data which should be split based on the conditions subjects belong to
condition_list (
SubjectConditionDict
orSubjectConditionDataFrame
) –SubjectConditionDict
orSubjectConditionDataFrame
with information on which subject belongs to which condition
- Returns
dictionary with additional outer level indicating which subject belongs to which condition
- Return type
- biopsykit.utils.data_processing.mean_per_subject_dict(data, dict_levels, param_name)[source]¶
Compute mean values of time-series data from a nested dictionary.
This function computes the mean value of time-series data in a nested dictionary per subject and combines it into a joint dataframe. The dictionary will be traversed recursively and can thus have arbitrary depth. The most inner level must contain dataframes with time-series data of which mean values will be computed. The names of the dictionary levels are specified by
dict_levels
.- Parameters
data (dict) – nested dictionary with data on which mean should be computed. The number of nested levels must match the number of levels specified in
dict_levels
.dict_levels (list of str) – list with names of dictionary levels.
param_name (str) – type of data of which mean values will be computed from. This will also be the column name in the resulting dataframe.
- Returns
dataframe with
dict_levels
as index levels and mean values of time-series data as column values- Return type
- biopsykit.utils.data_processing.mean_se_per_phase(data)[source]¶
Compute mean and standard error over all subjects in a dataframe.
Note
The dataframe in
data
is expected to have aMultiIndex
with at least two levels, one of the levels being the level “subject”!- Parameters
data (
DataFrame
) – dataframe withMultiIndex
from which to compute mean and standard error- Returns
dataframe with mean and standard error over all subjects
- Return type