Accuracy - Fitbit Charge 6 vs home PSG / EEG sleep lab

Introduction

Recently, in a short conversation Rob ter Horst, known as the Quantified Scientist, highlighted a new study which analysed accuracy of Oura ring 3, Fitbit Sense, Dreem 2 and some research grade accelerometers against PSG. Results are pretty interesting and since i'm using Fitbit Charge series as may daily wrist tracker i can compare it against my home PSG sleep lab. Study have some limitations and Oura didnt perform as good as in their validation study and even worse than my data analysis, but that might be due study specific population with unusual sleep patterns and study protocol being uncomfortable for participants (they had a lot of awake and only 4 sleep cycles on average when usually healthy adult getting 5).

I have collection of EEG devices: Dreem 2, Muse, Hypnodyne ZMax. In this post i go further in building my personal sleep research home lab with OpenBCI Cyton acquiring high quality 500Hz EEG from multiple brain areas (frontal, temporal, occipital) with gold-cup electrodes and conductive paste to get medical / research grade EEG quality. This post will focus on comparison of Fitbit Charge 6 versus OpenBCI multichannel EEG derived hypnogram and Hypnodyne ZMax.

Summary

Here is final confusion matrix against 100+ hours of multichannel EEG:

Deep sleep prediction is not in EEG grade range, 70.5%. but on top of non-EEG wearables. 30% margin of error seems to be significant. Some of nights these errors sums to zero showing correct average for night, but some nights they are not which lead to errors in trends between the nights. Anyway Deep seems to be a bit better than for Oura ring V3 according to my analysis
REM sleep show good results 77.8%, not yet in EEG accuracy range but really really near it. Also seems to outperform Oura ring. But missing a lot of first REM segments.
Awake in okay range, not so different from Oura.
Overall device seems to be in a similar accuracy range as Oura ring V3. Thats a cool result since finger position seems to be better than wrist for heart rate, but Fitbit seems to handle it well.

I started using Fitbits from Charge 4 and we can see no difference against Dreem 2 EEG for analysis of Charge 4 and Charge 5, most sleep stages being around 70-75% accuracy but they were getting only 45-50% accuracy for Awake. Charge 6 keeps similar accuracy for sleep stages at high range of Charge 4 and Charge 5 and increases Awake accuracy from 50% to 75% which is significant improvement (at least for me)!

Overall these are good news. Fitbit Charge 6 introduces improved accuracy for Awake detection but still struggle to push sleep stages accuracy into EEG range.

Proceed to Hypnodyne ZMax section to see summary for 160 more hours (but they are not multichannel, this is why i present OpenBCI data here, because it has better quality)

Methods

Fitbit worn on non-dominant (left) hand (test device) about 5-10mm far from wrist ulna bone.
30-sec resolution hypnogram were extracted from raw data exported json files.
OpenBCI Cyton (reference device) EEG signals from multiple channels (F7,F8,T4,O2,O1 or subset of these channels with T3 as ref) were passed to YASA a sleep algorithm which was validated on 30000+ hours of PSG and was non-distinguishable from sleep specialists scoring consensus. As a result, 30-sec resolution consensus hypnogram based on sum of stage probabilities ("majority voting") was built.
Timestamps of each epoch were rounded to 30 seconds and then epochs were joined (inner join) together.
Confusion matrix was built and diagonal line represents percentage of epochs where both devices agree for epoch. Thats different from Quantified Scientist's (Rob) confusion matrixes which does not represent overall device accuracy and presents a lot of devices as highly accurate when they are not. For example, if there were 60 mins of deep sleep according to reference EEG device and test device (e.g. Apple Watch) detected 55 mins of deep sleep and falsely detected extra 30 minutes of deep sleep when there were no deep sleep then you will see 55/60 = 92% in Quantified Scientist confusion matrix and 55/85 = 65% in my confusion matrix.
Overall accuracy is calculated: Overall Accuracy = Correct / (Correct + Incorrect)
This approach is a bit different than in Oura ring 3.0 new sleep algo analysis due to Fitbit 30-sec epoch resolution in contrast to Oura ring 300-sec resolution, so we dont need to downsample OpenBCI which results in more detailed comparison, but also reveal some specific to YASA micro-awakenings which i think is just position changes. I'm not going to focus on single epoch micro-awakening paying attention mostly for a longer awakenings.

Results - OpenBCI

Lets start with explorations of the individual nights where i'm going to ignore single epoch micro-awakenings and highlight only big disagreements:

First short REM segment were missed
Just a bit of extra deep sleep were detected
A small part of REM were missed near the end of the night

Last chunk of data is excluded due OpenBCI battery discharge, but DEEP sleep looks strange :)

Agreement seems to be in good range for wrist wearable, but still not in EEG range. 80% REM looks solid.

Second night highlights some issues:

First REM not only missed but also being marked as Awake. REM and Awake is not easy to discriminate when we dont have EEG, so this is not something unexpected.
There a big chunk of overpredicted DEEP sleep which is not a good sign. Previous night it agree a lot better than this night. One may conclude that this night i have more DEEP sleep if we look at Fitbit data only. But this is wrong conclusion, because more DEEP sleep is due to device error. This how non-eeg device margin of error distorts trends and may lead to wrong conclusions.

As a result accuracy for DEEP sleep during this night is lower, also Light sleep 71% also looks not great. But Awake looks somewhat better than first night.

here Fitbit was able to detect first REM, a bit bigger than in reality but not bad
Deep sleep during first sleep cycle was a bit ovepredicted in contract to second sleep cycle where device detected less Deep sleep. When you look at averages you will not see this, Deep will be same, but device have some kind of error and sometimes errors will be in a different directions resulting in error being cancelled in averages (which doesnt make device accurate due to instability in errors from night to night). Anyway i would say that this night Deep sleep detection was good.
Some short REM were missed and some REM were mispredicted as Awake, but overall not too much.

So this night accuracy looks not bad at all for all kind of sleep stages except Awake, but thats due to much micro-awakenings from YASA.

This night looks pretty well! Just missed small segment of Deep sleep and a bit of REM. First REM detected again. At the end of the night OpenBCI battery run out, so i excluded last hour of data.

Deep and REM looks pretty solid, but we already expect this from hypnogram. Light doesnt look god, but most of errors is due to micro-awakenings (look at 27.1% in Awake vs Light) so dont need focus too much on these and least of the errors due to REM / DEEP misdetection.

This night also looks not bad with not too big Deep sleep errors (which cancels themselves again) and some of REM detected as LIGHT. Good performance in detecting major Awakenings.

But overall this night accuracy lower than in 2 previous nights.

After looking at these 5 nights of data i can clearly notice significant difference in Deep sleep prediction of Fitbit Charge 6 versus Oura ring 3.0 newerst 2.1 sleep algo. Oura tends to overpredict a lot of Deep sleep which is often not the case for Fitbit Charge 6. The dataset is the same as in Oura analysis because i worn Fitbit Charge 6 and Oura ring simultaneously during EEG recordings.

This night doesnt look well. Fitbit Missed a big part of Deep sleep in first 2 sleep cycles and overpredicted a lot of deep sleep during 3rd sleep cycle. REM was mostly fine with a bit of missed REM at the end of the night.

Interestingly Oura didnt miss 2nd Deep sleep segment but overpredicted 3rd in a similar way. So this night my HRV might be unusual and devices werent able to predict brain activity without EEG.

So this night look doesnt look good at all...

This night the only issue was DEEP sleep overprediction during 3rd sleep cycle, everything else looks pretty good.

And accuracy looks pretty good for REM at 91% but DEEP sleep overprediction result in not good agreement for DEEP and as consequence LIGHT.

Little bit of Deep were overpredicted. First REM and a bit or REM later were missed. But overall looks good

Solid night and good accuracy. We can see that some of nights device are not too far from EEG and some of nights it is too far. That means we cant put high trust into sleep trends, because accuracy between nights doesnt look stable.

This night OpenBCI battery run out (i was testing if will last for 2 days without charging) but i still present it because it reveals some problems during first half of night:

first REM were missed and some REM were missed at the end of 2nd and 3rd sleep cycles.
Deep were overpredicted at 3rd sleep cycle.

Seems to be a lot of errors for these 5 hours of data.

The only thing that was detected with okay accuracy is REM.

This night again we can see Deep sleep errors in both directions. Everything else looks fine.

Deep sleep not being detected well, but Light / REM is in okay range for non EEG wearable.

This night looks quite well, not too much to mention, just some 1st REM misalignment.

Accuracy looks look good, not too far from EEG range for that night.

This night device overpredicted a bit of Deep sleep and have weird misdetection of REM / Awake with Light at 4:30.

The agreement is in medium range.

One more night were i cant find major disagreements, only 1st REM being missed.

Accuracy not bad at all. This is the last night i have at the moment of post creation.

Lets summarize all nights:

Overall these data looks not bad with overall accuracy of 74.7% / F1 0.72 match Oura ring 3 75% / F1 0.75 with a bit of more stable distribution of errors accross stages in comparison with Oura ring 3.

107 hours of multichannel EEG seems to be enough for some kind of summary for single person.

Results - Hypnodyne ZMax

I'm not going to present here exploration of specific nights because Hypnodyne ZMax data looks not far from OpenBCI (with even more micro-awakenings) and thus not add too much of a new information here. Lets look at a summary:

Here we can see similar pattern for ZMax as we have seen in Oura analysis - a lot of disagreement for Awake. For some reason ZMax shows a lot more micro-awakenings, thats mostly due to electrodes position limited to frontal lobe which doesnt reveal alpha rhythm which belongs to awake time. Alpha rhythm originates from occipital area which is part of my OpenBCI electrode positions and seems to overcome ZMax weakness in detecting awake moments.

Overall for Deep and REM Oura was performed a bit better than Fitbit Charge 6. But since OpenBCI dataset is of better quality (multichannel eeg and consensus hypnogram) i just focus on OpenBCI final confusion matrix.

Reminder

Just to remind - do not compare Robs confusion matrix with presented here, they calculated differently! When i saw Eight sleep pod video presented as having 91% DEEP sleep accuracy i notice some weird thing - his hypnograms show clearly different results not near 91%! Look for this one for some examples of Eight pod hypnograms:

You can see that about a half of DEEP was wrongly detected, but device got insane 91% accuracy in Rob's confusion matrix:

I did not understand how that is possible and asked Rob how that can be and got his answer:

For example, if Eight sleep pod show you 100 minutes of deep sleep and reference device shows 60 minutes of sleep then in Rob's matrix you can see 90% accuracy. I do not accept that weird approach an use standard scientific approach used in validation / research studies (Overall Accuracy / F1 score / Kappa score etc).

Supplementary

R Script with analysis is published on github. Comment / correct me if i did something wrong :)

TODO

Post will be updated in a future:

collect more multichannel OpenBCI data
analyse trends accuracy (correlation for average nightly values over long time when i build up big OpenBCI multichannel dataset)
publish raw data (bdf files weight about 100-300MB each, so not yet sure how to share it correctly)
publish R code it github (done)