Quantified Self Experiments / Accuracy - Oura ring 3 new sleep algo vs home PSG / EEG sleep lab

Introduction

Oura claims that their ring  it is one of most accurate sleep trackers and they introduced new sleep algorithm some time ago, and made a validation study which claims 79% agreement (F1 score 0.78) versus PSG. The validation study was small and made by manufacturer. And even 79% not so great, personally i'm looking for at least 84-85% for each stage in long and short term .

I have collection of EEG devices: Dreem 2, Muse, Hypnodyne ZMax. In this post i go further in building my personal sleep research home lab with OpenBCI Cyton acquiring high quality 500Hz EEG from multiple brain areas (frontal, temporal, occipital) with gold-cup electrodes and conductive paste to get medical / research grade EEG quality. Another post explains my 24/7 ECG (512Hz) acquisition which allows me to get accurate HRV. Tracking core body temperature 24/7 with Calera complements EEG and ECG data and allows me to verify ring from different angles: sleep, HRV and temperature. This post will focus on comparison of Oura ring V3 with newest sleep staging algorithm versus OpenBCI multichannel EEG derived hypnogram and Hypnodyne ZMax.

Shortest summary

Here is final confusion matrix against 60 hours of multichannel EEG:

Overall these are good news. Previous generation ring / algo was able to achieve only 60% for overall accuracy in my tests. New one is better with 75% of accuracy - great improvement, i hope one day it reach 85% and i stop saying that for good hypnogram we need EEG :)

Proceed to Hypnodyne ZMax section to see summary for 600 more hours (but they are not multichannel, this is why i present OpenBCI data here, because it has better quality), but you will see about same results.

Methods

Usually hypnogram have resolution of 30 seconds per epoch which is not the case for oura ring with 300 seconds resolution. This require some metodological decisions:

Results - OpenBCI

Lets start with explorations of the individual nights where i'm going to highlight only big disagreements:

  1. 1st REM segment were missed
  2. 3rd DEEP sleep segment was hugely overpredicted
  3. 4th DEEP segment were mispredicted 

Lets look at confusion matrix for this night

Ring overpredicted a lot of deep sleep in the middle of the night. Light sleep detection was better, but 75% agreement for sleep stage which accounts for half of night doesnt look well. REM / Awake doesnt impress me. Ring was not able to reach advertised 79% accuracy / 0.78 F1 score for any of stages with overall accuracy 67% and F1 score 0.67.

 

Here we can again see deep sleep oveprediction. Ring predicted almost all deep sleep from reference device but the issue is that it also falsely detected some extra deep sleep. Most of ring users doesnt have EEG device so they cannot say if deep sleep was overpredicted or not.

Another issue is big REM segment missing at the end of night.

This night looks better than first one. Deep is still not good, but Light and Awake seems to be better, resulting in 75% overall accuracy and F1 score of 0.75 which is not far from validation study.

 

This night OpenBCI battery went out at the end of the night, so i excluded this part from confusion matrix / accuracy calculation. Here we can see again that ring tends to overpredict deep sleep. First REM segment were missed again.

This also looks not bad, at least for REM and Awake which have more than 80% agreement. Same applies for accuracy / F1. Deep / Light seems to be the most problematic stages for ring.

 

This night looks interesting, OpenBCI missed first REM segment and ring detected big first REM segment?

Lets look at spectrogram and non downsampled 30-sec epochs hypnogram to find out if there were just pretty small amount of REM?

It seems that this was the case, there were small first REM segment which was lost due to downsampling, so it seems ring detected too much REM here. This night ring also overpredicted deep sleep, but not too bad.

This night deep / light agree better, but REM and AWAKE not doing well.

 

This night we can see similar too big first REM segment. During first 1-2 sleep cycles there are not too much REM in contrast to 4th-5th sleep cycles where REM is dominating.

Lets look to high resolution not downsampled spectrogram / hypnogram.

We can see that there were pretty small REM segment at start and it was lost due to downsamling to match oura time resolution. So again the problem seems to be the ring overdetecting first REM segment. At the end of night REM sleep was poorly detected. Some of deep sleep were overpredicted.

Here we can see poor REM detection and DEEP was also agree not well. Awake seems to be ok. Overall accuracy also not so good. About 30% of sleep was predicted incorrectly.

 

This night oura did not detect first REM segment, overpredicted some of deep sleep and missed some REM around 6:00. OpenBCI battery run out at the end, so i just excluded that segment.

This is the first night where device reaches advertised accuracy (but not F1 score).

 

First REM missed again and deep sleep overprediction.

This night looks not bad at all. Ring seem to struggle with deep sleep and first REM, but overall REM seems to be fine.

Someone may notice that there is no 29 Dec in hypnograms, thats because i forgot to charge batteries and OpenBCI turned off in a first half of sleep, to i just excluded that day.

Summary of all nights in the analysis:

Ring didnt reach advertised accuracy, but not far from it (75%, F1 0.75). The main contributor for overall accuracy is a Light sleep which is usually not something we focus. Deep sleep accuracy were at low 66% and REM / Awake a bit better at ~74%.

Here we can see improvements in technology over time - ring 2.0 had about 60% of accuracy in comparison to Dreem 2 EEG in my past analysis. Achieving 75% overall is a significant increase, and REM detection improved!

Results - Hypnodyne ZMax

Lets explore Hypnodyne ZMax data:

Here we can see a bit of deep sleep overprediction, first REM overpredicted also. Awake time doesnt match well, why is that?

Oura detected two major awakenings but ZMax detected them to be smaller. Thats might be due downsampling from 30 secs to 300 secs resolution (to match both devices). Lets explore raw ZMax spectrogram / hypnogram:

Looks like that was the case and ZMax doesnt not detected major awakenings. I do not know who i can trust here. ZMax captures EEG at frontal area (lobe) but alpha rhythm which belongs to awake with closed eyes is originated from occipital area (back of head) so i may do not trust too much for both devices. Since this pattern is pretty stable over nights i will not focus too much on Awake time. Lets look at confusion matrix:

Ignoring awake time results look pretty good for that night! DEEP / LIGHT / REM somewhere in EEG accuracy range!

 

 

Here we can se some missed rem, a bit of deep sleep overprediction and also missing deep sleep at the end of the night.

DEEP looks good, REM looks pretty good. Light doesnt looks good, might be due mixing with Awake.

First REM segment were missed, a bit deep sleep overpredicted and some REM was missed / mixed with Awake.

This night DEEP sleep doesnt look good, Light / REM looks good enough.

Zmax seems to miss or hide first REM because of downsampling, but ring detected it fine. The main issue that nights is a huge deep sleep overprediction. Some of REM were missed, but not too much.

We can see that DEEP sleep mostly misdetected when there were LIGHT sleep. REM again looks pretty well and seem to be a stable pattern here. Rings REM matches pretty well with ZMax. 

Other nights looks pretty similar to these ones, lets go to overall metrics computed on much much bigger dataset of ~600 hours of sleep:

We can see that ring able to reach 69% accuracy and F1 score of 0.63 versus Hypnodyne ZMax. The issue with Awake time is related to downsampling which significantly reduces awake time for ZMax but not for the ring. If we ignore Awake we can see 70% for DEEP, 79% for LIGHT and 78.5% for REM which is in range of OpenBCI results ( 66% / 82% / 74%). OpenBCI should be more accurate because multichannel approach contains more information of brain activity than ZMax which is limited to frontal lobe (which is not optimal for awake detection due to alpha rhythm being originated in occipital region which is far away from frontal), but anyway ignoring ZMax's awake will increase overall accuracy and F1 to OpenBCI levels. 600 hours of sleep seems to be a big enough dataset and OpenBCI data (~60h) supports results from ZMax dataset. Seeing roughly similar performance for 2 different references strengthens results.

Conclusions

Looking at presented data compared to OpenBCI / ZMax i can conclude:

Not bad at all - i'm pretty sure most of devices will not reach even that accuracy, but ring still not accurate enough to be in EEG accuracy range. Why? Dreem 2 dry EEG with propietary sleep autoscoring algo have 82.6% accuracy for DEEP, 87.5% accuracy for LIGHT, 82.9% accuracy for REM and 76.7% accuracy for AWAKE and overall accuracy 83.5% and F1 0.83. This is how diagonal of confusion matrix should look at minimum for device to be in EEG accuracy range. So the Oura ring V3 with 66.1% for DEEP, 81.3% for LIGHT, 73.9% for REM, 72.6% for AWAKE with overall 75% F1 .75 did not reach EEG accuracy range for everything except REM (74% is not far from Dreem 2 EEG which have 77%).

Reminder

Just to remind - do not compare Robs confusion matrix with presented here, they calculated differently! When i saw Eight sleep pod video presented as having 91% DEEP sleep accuracy i notice some weird thing - his hypnograms show clearly different results not near 91%! Look for this one for some examples of Eight pod hypnograms:

You can see that about a half of DEEP was wrongly detected, but device got insane 91% accuracy in Rob's confusion matrix:

I did not understand how that is possible and asked Rob how that can be and got his answer:

For example, if Eight sleep pod show you 100 minutes of deep sleep and reference device shows 60 minutes of sleep then in Rob's matrix you can see 90% accuracy. I do not accept that weird approach an use standard scientific approach used in validation / research studies (Overall Accuracy / F1 score / Kappa score etc).

So in Oura Ring 3 : Scientific Sleep Lab Test ! you can see 85% deep sleep agreement but this is just partial agreement rate, because it ignores misdetected extra deep sleep by oura ring and this is the main difference. As we can see Oura overpredicts huge amount of deep sleep and Robs approach present device as more accurate than in reality! If Rob takes into account overprediction then percentages in his matrix will drop further.

I added these details because i'm pretty sure most users who came here already seen Robs videos and i want to avoid confusion of results because of different approaches we use for confusion matrix percentages.

Supplementary

R Script with analysis is published on github.

TODO

Post will be updated in a future: