Quantified Self Experiments / Accuracy - Fitbit Charge 6 vs home PSG / EEG sleep lab


Recently, in a short conversation Rob ter Horst, known as the Quantified Scientist, highlighted a new study which analysed accuracy of Oura ring 3, Fitbit Sense, Dreem 2 and some research grade accelerometers against PSG. Results are pretty interesting and since i'm using Fitbit Charge series as may daily wrist tracker i can compare it against my home PSG sleep lab. Study have some limitations and Oura didnt perform as good as in their validation study and even worse than my data analysis, but that might be due study specific population with unusual sleep patterns and study protocol being uncomfortable for participants (they had a lot of awake and only 4 sleep cycles on average when usually healthy adult getting 5).

I have collection of EEG devices: Dreem 2, Muse, Hypnodyne ZMax. In this post i go further in building my personal sleep research home lab with OpenBCI Cyton acquiring high quality 500Hz EEG from multiple brain areas (frontal, temporal, occipital) with gold-cup electrodes and conductive paste to get medical / research grade EEG quality. This post will focus on comparison of Fitbit Charge 6 versus OpenBCI multichannel EEG derived hypnogram and Hypnodyne ZMax.


Here is final confusion matrix against 100+ hours of multichannel EEG:

I started using Fitbits from Charge 4 and we can see no difference against Dreem 2 EEG for analysis of Charge 4 and Charge 5, most sleep stages being around 70-75% accuracy but they were getting only 45-50% accuracy for Awake. Charge 6 keeps similar accuracy for sleep stages at high range of Charge 4 and Charge 5 and increases Awake accuracy from 50% to 75% which is significant improvement (at least for me)!

Overall these are good news. Fitbit Charge 6 introduces improved accuracy for Awake detection but still struggle to push sleep stages accuracy into EEG range.

Proceed to Hypnodyne ZMax section to see summary for 160 more hours (but they are not multichannel, this is why i present OpenBCI data here, because it has better quality)


Results - OpenBCI

Lets start with explorations of the individual nights where i'm going to ignore single epoch micro-awakenings and highlight only big disagreements:

Last chunk of data is excluded due OpenBCI battery discharge, but DEEP sleep looks strange :)

Agreement seems to be in good range for wrist wearable, but still not in EEG range. 80% REM looks solid.


Second night highlights some issues:

As a result accuracy for DEEP sleep during this night is lower, also Light sleep 71% also looks not great. But Awake looks somewhat better than first night.


So this night accuracy looks not bad at all for all kind of sleep stages except Awake, but thats due to much micro-awakenings from YASA.


This night looks pretty well! Just missed small segment of Deep sleep and a bit of REM. First REM detected again. At the end of the night OpenBCI battery run out, so i excluded last hour of data.

Deep and REM looks pretty solid, but we already expect this from hypnogram. Light doesnt look god, but most of errors is due to micro-awakenings (look at 27.1% in Awake vs Light) so dont need focus too much on these and least of the errors due to REM / DEEP misdetection.


This night also looks not bad with not too big Deep sleep errors (which cancels themselves again) and some of REM detected as LIGHT. Good performance in detecting major Awakenings.

But overall this night accuracy lower than in 2 previous nights.

After looking at these 5 nights of data i can clearly notice significant difference in Deep sleep prediction of Fitbit Charge 6 versus Oura ring 3.0 newerst 2.1 sleep algo. Oura tends to overpredict a lot of Deep sleep which is often not the case for Fitbit Charge 6. The dataset is the same as in Oura analysis because i worn Fitbit Charge 6 and Oura ring simultaneously during EEG recordings.


This night doesnt look well. Fitbit Missed a big part of Deep sleep in first 2 sleep cycles and overpredicted a lot of deep sleep during 3rd sleep cycle. REM was mostly fine with a bit of missed REM at the end of the night.

Interestingly Oura didnt miss 2nd Deep sleep segment but overpredicted 3rd in a similar way. So this night my HRV might be unusual and devices werent able to predict brain activity without EEG.

So this night look doesnt look good at all...


This night the only issue was DEEP sleep overprediction during 3rd sleep cycle, everything else looks pretty good.

And accuracy looks pretty good for REM at 91% but DEEP sleep overprediction result in not good agreement for DEEP and as consequence LIGHT.


Little bit of Deep were overpredicted. First REM and a bit or REM later were missed. But overall looks good

Solid night and good accuracy. We can see that some of nights device are not too far from EEG and some of nights it is too far. That means we cant put high trust into sleep trends, because accuracy between nights doesnt look stable.


This night OpenBCI battery run out (i was testing if will last for 2 days without charging) but i still present it because it reveals some problems during first half of night:

Seems to be a lot of errors for these 5 hours of data.

The only thing that was detected with okay accuracy is REM.


This night again we can see Deep sleep errors in both directions. Everything else looks fine.

Deep sleep not being detected well, but Light / REM is in okay range for non EEG wearable.


This night looks quite well, not too much to mention, just some 1st REM misalignment.

Accuracy looks look good, not too far from EEG range for that night.


This night device overpredicted a bit of Deep sleep and have weird misdetection of REM / Awake with Light at 4:30.

The agreement is in medium range.


One more night were i cant find major disagreements, only 1st REM being missed.

Accuracy not bad at all. This is the last night i have at the moment of post creation.

Lets summarize all nights:

Overall these data looks not bad with overall accuracy of 74.7% / F1 0.72 match Oura ring 3 75% / F1 0.75 with a bit of more stable distribution of errors accross stages in comparison with Oura ring 3.

107 hours of multichannel EEG seems to be enough for some kind of summary for single person.

Results - Hypnodyne ZMax

I'm not going to present here exploration of specific nights because Hypnodyne ZMax data looks not far from OpenBCI (with even more micro-awakenings) and thus not add too much of a new information here. Lets look at a summary:

Here we can see similar pattern for ZMax as we have seen in Oura analysis - a lot of disagreement for Awake. For some reason ZMax shows a lot more micro-awakenings, thats mostly due to electrodes position limited to frontal lobe which doesnt reveal alpha rhythm which belongs to awake time. Alpha rhythm originates from occipital area which is part of my OpenBCI electrode positions and seems to overcome ZMax weakness in detecting awake moments.

Overall for Deep and REM Oura was performed a bit better than Fitbit Charge 6. But since OpenBCI dataset is of better quality (multichannel eeg and consensus hypnogram) i just focus on OpenBCI final confusion matrix. 


Just to remind - do not compare Robs confusion matrix with presented here, they calculated differently! When i saw Eight sleep pod video presented as having 91% DEEP sleep accuracy i notice some weird thing - his hypnograms show clearly different results not near 91%! Look for this one for some examples of Eight pod hypnograms:

You can see that about a half of DEEP was wrongly detected, but device got insane 91% accuracy in Rob's confusion matrix:

I did not understand how that is possible and asked Rob how that can be and got his answer:

For example, if Eight sleep pod show you 100 minutes of deep sleep and reference device shows 60 minutes of sleep then in Rob's matrix you can see 90% accuracy. I do not accept that weird approach an use standard scientific approach used in validation / research studies (Overall Accuracy / F1 score / Kappa score etc).


R Script with analysis is published on github. Comment / correct me if i did something wrong :)


Post will be updated in a future: