Withings Sleep Analyzer Data Import

This example explains how to import and parse data retrieved from the Withings Health Mate app.

Note: This notebook is just to illustrate how to generally approach such a data wrangling problem. The full code (and much more!) is readily available in BioPsyKit: biopsykit.sleep_analyzer.io.load_withings_sleep_analyzer_raw().

Setup and Helper Functions

[1]:
from ast import literal_eval

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from fau_colors import cmaps

import biopsykit as bp

%matplotlib inline
%load_ext autoreload
%autoreload 2
[2]:
plt.close("all")

tz = "Europe/Berlin"

palette = sns.color_palette(cmaps.faculties)
sns.set_theme(context="notebook", style="ticks", font="sans-serif", palette=palette)

plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["mathtext.default"] = "regular"

palette
[2]:

Data Import

Read Data from File

Load example data (or read the csv file into a dataframe using pandas.read_csv()).

[3]:
data = bp.example_data.get_sleep_analyzer_raw_file_unformatted(data_source="heart_rate")

We first want to get an impression how the data looks like by displaying the data. In Jupyter Notebooks, ending a cell with the name of a variable or unassigned output of a statement, Jupyter will display that variable (in a nice layout) without the need for a print statement.

You can for example call data to display the or data.head() to display the beginning of the dataframe.

We see that we have three columns: A ‘start’ column with timestamps, a ‘duration’ column and a ‘value’ column. We can read this data row-wise and as follows: Beginning at time ‘start’, we get the heart rate values in the ‘value’ column for a ‘duration’ per value.

[4]:
data.head()
[4]:
start duration value
0 2020-10-23T22:03:00+02:00 [60,60] [66,65]
1 2020-10-23T22:08:00+02:00 [60,60] [65,60]
2 2020-10-23T22:11:00+02:00 [60] [61]
3 2020-10-23T22:16:00+02:00 [60] [56]
4 2020-10-23T22:18:00+02:00 [60] [67]

Data type conversion

All values are imported as strings, so we need to convert these into the correct data types:

  • The String timestamps in the ‘start’ column are converted into datetime objects that offer extensive functions for handling time series data

  • The lists in the ‘duration’ and ‘value’ columns are also stored as strings so we need to convert them into actual lists with numbers. Googling “pandas convert string to array” leads us to this StackOverflow post https://stackoverflow.com/questions/23119472/in-pandas-python-reading-array-stored-as-string, where the accepted answer suggests this:

from ast import literal_eval
df['col2'] = df['col2'].apply(literal_eval)

In the end, we set the ‘start’ column as the new index of the dataframe and sort the data by the index

[5]:
print(f"Before: {[type(value) for value in data.iloc[0]]}")

data["start"] = pd.to_datetime(data["start"])
data["duration"] = data["duration"].apply(literal_eval)
data["value"] = data["value"].apply(literal_eval)

print(f"After: {[type(value) for value in data.iloc[0]]}")

data = data.set_index("start").sort_index()
# rename index
data.index.name = "time"
Before: [<class 'str'>, <class 'str'>, <class 'str'>]
After: [<class 'pandas._libs.tslibs.timestamps.Timestamp'>, <class 'list'>, <class 'list'>]

Our data now looks like this:

[6]:
data.head()
[6]:
duration value
time
2020-10-11 02:04:00+02:00 [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6... [52, 61, 65, 66, 65, 68, 65, 67, 67, 67, 65, 6...
2020-10-11 05:04:00+02:00 [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6... [61, 61, 63, 63, 55, 72, 72, 73, 73, 73, 73, 7...
2020-10-11 06:34:00+02:00 [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6... [63, 63, 64, 66, 63, 63, 64, 63, 69, 64, 64, 6...
2020-10-11 08:04:00+02:00 [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6... [63, 65, 64, 67, 64, 66, 67, 60, 60, 60, 60, 6...
2020-10-11 09:34:00+02:00 [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6... [59, 68, 62, 63, 63, 64, 65, 68, 66, 68, 69, 7...

Explode Arrays

We now want to convert the values stored in the arrays into single values. Googling “pandas convert list of values to rows” leads us to this StackOverflow post: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe. Here, we don’t take the accepted answer, but the answer below:

df.explode('column')
[7]:
print("Before Explode:")
display(data["value"].head())
print("")
print("After Explode:")
display(data["value"].explode("value").head())
Before Explode:
time
2020-10-11 02:04:00+02:00    [52, 61, 65, 66, 65, 68, 65, 67, 67, 67, 65, 6...
2020-10-11 05:04:00+02:00    [61, 61, 63, 63, 55, 72, 72, 73, 73, 73, 73, 7...
2020-10-11 06:34:00+02:00    [63, 63, 64, 66, 63, 63, 64, 63, 69, 64, 64, 6...
2020-10-11 08:04:00+02:00    [63, 65, 64, 67, 64, 66, 67, 60, 60, 60, 60, 6...
2020-10-11 09:34:00+02:00    [59, 68, 62, 63, 63, 64, 65, 68, 66, 68, 69, 7...
Name: value, dtype: object

After Explode:
0    52
1    61
2    65
3    66
4    65
Name: value, dtype: object

The pd.Series.explode() function only works on one single column. If we want to apply this on multiple columns at once, we need to call pd.DataFrame.apply() and pass the function as argument to the apply function.

[8]:
data_explode = data.apply(pd.Series.explode)

Our dataframe now looks like this:

[9]:
data_explode.head()
[9]:
duration value
time
2020-10-11 02:04:00+02:00 60 52
2020-10-11 02:04:00+02:00 60 61
2020-10-11 02:04:00+02:00 60 65
2020-10-11 02:04:00+02:00 60 66
2020-10-11 02:04:00+02:00 60 65

However, we now see that the timestamp is the same for each exploded value. The documentation of explode() says the following:

Transform each element of a list-like to a row, *replicating* index values.

To get the correct timestamps we would need to add the ‘duration’ values cumulatively to the timestamps. However, only summing up the values in ‘duration’ would not work, we need to perform this only within those timestamps that are the same. One way to achieve this is to group the data into subparts with the same timestamp using pd.DataFrame.groupby where we pass the index name (i.e. time) to group along. For that, we define our own function that is applied onto each group.

[10]:
def explode_timestamps(df):
    # sum up the time durations and subtract the first value from it (so that we start from 0)
    # dur_sum then looks like this: [0, 60, 120, 180, ...]
    dur_sum = df["duration"].cumsum() - df["duration"].iloc[0]
    # Add these time durations to the index timestamps.
    # For that, we need to convert the datetime objects from the pandas DatetimeIndex into a float and add the time onto it
    # (we first need to multiply it with 10^9 because the time in the index is stored in nanoseconds)
    index_sum = df.index.values.astype(float) + 1e9 * dur_sum
    # convert the float values back into a DatetimeIndex
    df["time"] = pd.to_datetime(index_sum)
    # set this as index and convert it back into the right time zone
    df = df.set_index("time")
    df = df.tz_localize("UTC").tz_convert(tz)
    # we don't need the duration column anymore so we can drop it
    df = df.drop(columns="duration")
    return df
[11]:
# call groupby and apply our custom function on each group
df_hr = data_explode.groupby("time", group_keys=False).apply(explode_timestamps)
# rename the value column
df_hr.columns = ["heart_rate"]

df_hr
[11]:
heart_rate
time
2020-10-11 02:04:00+02:00 52
2020-10-11 02:05:00+02:00 61
2020-10-11 02:06:00+02:00 65
2020-10-11 02:07:00+02:00 66
2020-10-11 02:08:00+02:00 65
... ...
2020-10-24 10:29:00+02:00 68
2020-10-24 10:30:00+02:00 70
2020-10-24 10:31:00+02:00 70
2020-10-24 10:32:00+02:00 68
2020-10-24 10:33:00+02:00 64

4863 rows × 1 columns

Filtering and plotting

Filter data by day

Assume we want to filter only data from a particular date, e.g. Oct 11 2020.

For this, we can slice the index to only include data from this particular date by doing the following steps:

  • Normalize the DateTimeIndex (set every date to midnight)

  • Filter for the desired day

  • Slice the DataFrame

[12]:
df_hr_day = df_hr.loc[df_hr.index.normalize() == "2020-10-11"]
[13]:
df_hr_day
[13]:
heart_rate
time
2020-10-11 02:04:00+02:00 52
2020-10-11 02:05:00+02:00 61
2020-10-11 02:06:00+02:00 65
2020-10-11 02:07:00+02:00 66
2020-10-11 02:08:00+02:00 65
... ...
2020-10-11 10:00:00+02:00 63
2020-10-11 10:01:00+02:00 71
2020-10-11 10:02:00+02:00 65
2020-10-11 10:08:00+02:00 67
2020-10-11 10:09:00+02:00 65

685 rows × 1 columns

Plot this data as example

[14]:
fig, ax = plt.subplots()
df_hr_day.plot(ax=ax)

ax.legend().remove()
ax.set_ylabel("Heart Rate [bpm]")
ax.set_xlabel("Time");
../../_images/examples__notebooks_Sleep_Analyzer_Import_Example_31_0.svg

That’s it!

This code is also available in BioPsyKit and can be used like this:

[15]:
sleep_data = bp.example_data.get_sleep_analyzer_raw_example()
/home/docs/checkouts/readthedocs.org/user_builds/biopsykit/envs/latest/lib/python3.10/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/docs/checkouts/readthedocs.org/user_builds/biopsykit/envs/latest/lib/python3.10/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/docs/checkouts/readthedocs.org/user_builds/biopsykit/envs/latest/lib/python3.10/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/docs/checkouts/readthedocs.org/user_builds/biopsykit/envs/latest/lib/python3.10/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
[16]:
sleep_data.keys()
[16]:
dict_keys([np.str_('2020-10-10'), np.str_('2020-10-11'), np.str_('2020-10-12'), np.str_('2020-10-15'), np.str_('2020-10-17'), np.str_('2020-10-19'), np.str_('2020-10-21'), np.str_('2020-10-23')])
[17]:
sleep_data["2020-10-10"].head()
[17]:
heart_rate respiration_rate sleep_state snoring
time
2020-10-11 02:04:00+02:00 52.0 15.0 0.0 0.0
2020-10-11 02:05:00+02:00 61.0 9.0 0.0 0.0
2020-10-11 02:06:00+02:00 65.0 16.0 0.0 0.0
2020-10-11 02:07:00+02:00 66.0 17.0 0.0 0.0
2020-10-11 02:08:00+02:00 65.0 11.0 0.0 0.0

Only load a specific data source (in this case, our example data):

[18]:
sleep_state_data = bp.example_data.get_sleep_analyzer_raw_file("sleep_state")
/home/docs/checkouts/readthedocs.org/user_builds/biopsykit/envs/latest/lib/python3.10/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)

Alternatively: Load your own Sleep Analyzer raw data

[19]:
# sleep_state_data = bp.io.sleep_analyzer.load_withings_sleep_analyzer_raw_file(
#    "<path-to-sleep-analyzer-raw-file.csv>",
#    data_source="sleep_state"
# )
[20]:
sleep_state_data["2020-10-10"].head()
[20]:
sleep_state
time
2020-10-11 02:04:00+02:00 0.0
2020-10-11 02:05:00+02:00 0.0
2020-10-11 02:06:00+02:00 0.0
2020-10-11 02:07:00+02:00 0.0
2020-10-11 02:08:00+02:00 0.0
[ ]:

Download Notebook
(Right-Click -> Save Link As...)