Accuracy - Oura ring 3 new sleep algo vs home PSG / EEG sleep lab

Introduction

Oura claims that their ring it is one of most accurate sleep trackers and they introduced new sleep algorithm some time ago, and made a validation study which claims 79% agreement (F1 score 0.78) versus PSG. The validation study was small and made by manufacturer. And even 79% not so great, personally i'm looking for at least 84-85% for each stage in long and short term .

I have collection of EEG devices: Dreem 2, Muse, Hypnodyne ZMax. In this post i go further in building my personal sleep research home lab with OpenBCI Cyton acquiring high quality 500Hz EEG from multiple brain areas (frontal, temporal, occipital) with gold-cup electrodes and conductive paste to get medical / research grade EEG quality. Another post explains my 24/7 ECG (512Hz) acquisition which allows me to get accurate HRV. Tracking core body temperature 24/7 with Calera complements EEG and ECG data and allows me to verify ring from different angles: sleep, HRV and temperature. This post will focus on comparison of Oura ring V3 with newest sleep staging algorithm versus OpenBCI multichannel EEG derived hypnogram and Hypnodyne ZMax.

Shortest summary

Here is final confusion matrix against 60 hours of multichannel EEG:

Deep sleep prediction is not great, 66%. Ring overpredicts a lof of deep sleep, e.g. when ring shows 100 minutes for deep sleep in reality there were 66 minutes, which results in ~34 minutes misdetected extra deep sleep. So, margin of error 33/66=50% is a pretty high for predicted deep sleep.
REM sleep show better results, not yet in EEG accuracy range but not so far from it. Same applies for awake
Ring is not far from advertised 79% of overall accuracy in my case, but i'm mostly interested in deep sleep and where device seems to be inaccurate.

Overall these are good news. Previous generation ring / algo was able to achieve only 60% for overall accuracy in my tests. New one is better with 75% of accuracy - great improvement, i hope one day it reach 85% and i stop saying that for good hypnogram we need EEG :)

Proceed to Hypnodyne ZMax section to see summary for 600 more hours (but they are not multichannel, this is why i present OpenBCI data here, because it has better quality), but you will see about same results.

Methods

Usually hypnogram have resolution of 30 seconds per epoch which is not the case for oura ring with 300 seconds resolution. This require some metodological decisions:

Oura ring (test device) hypnogram was upsampled to 30-sec epochs (single 300s epoch on deep sleep is separated to ten 30-sec epochs of deep sleep). Ring were placed on right ring finger and correct positioning were checked prior bedtime and adjusted if needed.
OpenBCI Cyton (reference device) EEG signals from multiple channels (F7,F8,T4,O2 with T3 as ref) were passed to YASA a sleep algorithm which was validated on 30000+ hours of PSG and was non-distinguishable from sleep specialists scoring consensus. Hypnogram (multichannel majority voting consensus) was then downsampled to 300s by selecting most frequent sleep stage during that period which results in similar 300s resolution as in oura ring. Then it upsampled back to 30s epochs. OpenBCI compared to nights with oura sleep algo (SA) version 2.1 (do not confuse with ring version)
Timestamps of each epoch were rounded to 30 seconds and then epochs were joined (inner join) together.
Similar approach applies for Hypnodyne ZMax which will be analyzed separately, but compared to version 2.0 of algo, because 2.1 was released in December 2023.
Confusion matrix was built and diagonal line represents percentage of epochs where both devices agree for epoch. Thats different from Quantified Scientist's (Rob) confusion matrixes which does not represent overall device accuracy and presents a lot of devices as highly accurate when they are not. For example, if there were 60 mins of deep sleep according to reference EEG device and test device (e.g. Apple Watch) detected 55 mins of deep sleep and falsely detected extra 30 minutes of deep sleep when there were no deep sleep then you will see 55/60 = 92% in Quantified Scientist confusion matrix and 55/85 = 65% in my confusion matrix.
Overall accuracy is calculated: Overall Accuracy = Correct / (Correct + Incorrect)

Results - OpenBCI

Lets start with explorations of the individual nights where i'm going to highlight only big disagreements:

1st REM segment were missed
3rd DEEP sleep segment was hugely overpredicted
4th DEEP segment were mispredicted

Lets look at confusion matrix for this night

Ring overpredicted a lot of deep sleep in the middle of the night. Light sleep detection was better, but 75% agreement for sleep stage which accounts for half of night doesnt look well. REM / Awake doesnt impress me. Ring was not able to reach advertised 79% accuracy / 0.78 F1 score for any of stages with overall accuracy 67% and F1 score 0.67.

Here we can again see deep sleep oveprediction. Ring predicted almost all deep sleep from reference device but the issue is that it also falsely detected some extra deep sleep. Most of ring users doesnt have EEG device so they cannot say if deep sleep was overpredicted or not.

Another issue is big REM segment missing at the end of night.

This night looks better than first one. Deep is still not good, but Light and Awake seems to be better, resulting in 75% overall accuracy and F1 score of 0.75 which is not far from validation study.

This night OpenBCI battery went out at the end of the night, so i excluded this part from confusion matrix / accuracy calculation. Here we can see again that ring tends to overpredict deep sleep. First REM segment were missed again.

This also looks not bad, at least for REM and Awake which have more than 80% agreement. Same applies for accuracy / F1. Deep / Light seems to be the most problematic stages for ring.

This night looks interesting, OpenBCI missed first REM segment and ring detected big first REM segment?

Lets look at spectrogram and non downsampled 30-sec epochs hypnogram to find out if there were just pretty small amount of REM?

It seems that this was the case, there were small first REM segment which was lost due to downsampling, so it seems ring detected too much REM here. This night ring also overpredicted deep sleep, but not too bad.

This night deep / light agree better, but REM and AWAKE not doing well.

This night we can see similar too big first REM segment. During first 1-2 sleep cycles there are not too much REM in contrast to 4th-5th sleep cycles where REM is dominating.

Lets look to high resolution not downsampled spectrogram / hypnogram.

We can see that there were pretty small REM segment at start and it was lost due to downsamling to match oura time resolution. So again the problem seems to be the ring overdetecting first REM segment. At the end of night REM sleep was poorly detected. Some of deep sleep were overpredicted.

Here we can see poor REM detection and DEEP was also agree not well. Awake seems to be ok. Overall accuracy also not so good. About 30% of sleep was predicted incorrectly.

This night oura did not detect first REM segment, overpredicted some of deep sleep and missed some REM around 6:00. OpenBCI battery run out at the end, so i just excluded that segment.

This is the first night where device reaches advertised accuracy (but not F1 score).

First REM missed again and deep sleep overprediction.

This night looks not bad at all. Ring seem to struggle with deep sleep and first REM, but overall REM seems to be fine.

Someone may notice that there is no 29 Dec in hypnograms, thats because i forgot to charge batteries and OpenBCI turned off in a first half of sleep, to i just excluded that day.

Summary of all nights in the analysis:

Ring didnt reach advertised accuracy, but not far from it (75%, F1 0.75). The main contributor for overall accuracy is a Light sleep which is usually not something we focus. Deep sleep accuracy were at low 66% and REM / Awake a bit better at ~74%.

Here we can see improvements in technology over time - ring 2.0 had about 60% of accuracy in comparison to Dreem 2 EEG in my past analysis. Achieving 75% overall is a significant increase, and REM detection improved!

Results - Hypnodyne ZMax

Lets explore Hypnodyne ZMax data:

Here we can see a bit of deep sleep overprediction, first REM overpredicted also. Awake time doesnt match well, why is that?

Oura detected two major awakenings but ZMax detected them to be smaller. Thats might be due downsampling from 30 secs to 300 secs resolution (to match both devices). Lets explore raw ZMax spectrogram / hypnogram:

Looks like that was the case and ZMax doesnt not detected major awakenings. I do not know who i can trust here. ZMax captures EEG at frontal area (lobe) but alpha rhythm which belongs to awake with closed eyes is originated from occipital area (back of head) so i may do not trust too much for both devices. Since this pattern is pretty stable over nights i will not focus too much on Awake time. Lets look at confusion matrix:

Ignoring awake time results look pretty good for that night! DEEP / LIGHT / REM somewhere in EEG accuracy range!

Here we can se some missed rem, a bit of deep sleep overprediction and also missing deep sleep at the end of the night.

DEEP looks good, REM looks pretty good. Light doesnt looks good, might be due mixing with Awake.

First REM segment were missed, a bit deep sleep overpredicted and some REM was missed / mixed with Awake.

This night DEEP sleep doesnt look good, Light / REM looks good enough.

Zmax seems to miss or hide first REM because of downsampling, but ring detected it fine. The main issue that nights is a huge deep sleep overprediction. Some of REM were missed, but not too much.

We can see that DEEP sleep mostly misdetected when there were LIGHT sleep. REM again looks pretty well and seem to be a stable pattern here. Rings REM matches pretty well with ZMax.

Other nights looks pretty similar to these ones, lets go to overall metrics computed on much much bigger dataset of ~600 hours of sleep:

We can see that ring able to reach 69% accuracy and F1 score of 0.63 versus Hypnodyne ZMax. The issue with Awake time is related to downsampling which significantly reduces awake time for ZMax but not for the ring. If we ignore Awake we can see 70% for DEEP, 79% for LIGHT and 78.5% for REM which is in range of OpenBCI results ( 66% / 82% / 74%). OpenBCI should be more accurate because multichannel approach contains more information of brain activity than ZMax which is limited to frontal lobe (which is not optimal for awake detection due to alpha rhythm being originated in occipital region which is far away from frontal), but anyway ignoring ZMax's awake will increase overall accuracy and F1 to OpenBCI levels. 600 hours of sleep seems to be a big enough dataset and OpenBCI data (~60h) supports results from ZMax dataset. Seeing roughly similar performance for 2 different references strengthens results.

Conclusions

Looking at presented data compared to OpenBCI / ZMax i can conclude:

Ring tends to overpredict / misdetect about a half of DEEP sleep. Accuracy over few nights is about 66% with overpredicting tendency, so when oura shows that there 100 minutes of deep sleep in reality there were 66 minutes. Pretty big margin of error, (34 minutes error for 66 minutes as base is about half).
Ring not well handles first REM segment. Sometimes it just misses it and sometimes is detects too big REM segment. But overall REM detection is better than DEEP, but not so stable (variation about 20% of accurcy between nights).
Awake accuracy not stable, being in range of 60-80%, sometimes as low as 40%.
The most accurate stage seems to be LIGHT, but it account for half of sleep, so each percentage for LIGHT is not same as for DEEP / REM (they account for 15-20% of sleep each in contrast to ~50% of light)
This instability / variability of accuracy between nights likely make trends not so useful because of this masking effect. This night we see 90 mins of REM, next night 70 mins - but that due less REM or lower accuracy? Will add that analysis when more data comes in.
The good thing is that first 2 deep sleep segments being well detected. So ring can be used to calculate HRV during midpoint of first deep sleep segment, which is seems to be a better marker of parasympathetic system (PNS) compared to whole night HRV. Cool!
Ring doesnt reach advertised accuracy but not too far from it, which is a good sign.
ZMax Awake doesnt look great after downsamling and thus seems not optimal for being reference device compared to OpenBCI with multi-channel approach covering most of brain areas and representing Awake better.
ZMax results doesnt differ too much (except Awake) and this increase my confidence in overall accuracy somewhere around 75%.

Not bad at all - i'm pretty sure most of devices will not reach even that accuracy, but ring still not accurate enough to be in EEG accuracy range. Why? Dreem 2 dry EEG with propietary sleep autoscoring algo have 82.6% accuracy for DEEP, 87.5% accuracy for LIGHT, 82.9% accuracy for REM and 76.7% accuracy for AWAKE and overall accuracy 83.5% and F1 0.83. This is how diagonal of confusion matrix should look at minimum for device to be in EEG accuracy range. So the Oura ring V3 with 66.1% for DEEP, 81.3% for LIGHT, 73.9% for REM, 72.6% for AWAKE with overall 75% F1 .75 did not reach EEG accuracy range for everything except REM (74% is not far from Dreem 2 EEG which have 77%).

Reminder

Just to remind - do not compare Robs confusion matrix with presented here, they calculated differently! When i saw Eight sleep pod video presented as having 91% DEEP sleep accuracy i notice some weird thing - his hypnograms show clearly different results not near 91%! Look for this one for some examples of Eight pod hypnograms:

You can see that about a half of DEEP was wrongly detected, but device got insane 91% accuracy in Rob's confusion matrix:

I did not understand how that is possible and asked Rob how that can be and got his answer:

For example, if Eight sleep pod show you 100 minutes of deep sleep and reference device shows 60 minutes of sleep then in Rob's matrix you can see 90% accuracy. I do not accept that weird approach an use standard scientific approach used in validation / research studies (Overall Accuracy / F1 score / Kappa score etc).

So in Oura Ring 3 : Scientific Sleep Lab Test ! you can see 85% deep sleep agreement but this is just partial agreement rate, because it ignores misdetected extra deep sleep by oura ring and this is the main difference. As we can see Oura overpredicts huge amount of deep sleep and Robs approach present device as more accurate than in reality! If Rob takes into account overprediction then percentages in his matrix will drop further.

I added these details because i'm pretty sure most users who came here already seen Robs videos and i want to avoid confusion of results because of different approaches we use for confusion matrix percentages.

Supplementary

R Script with analysis is published on github.

TODO

Post will be updated in a future:

collect more multichannel OpenBCI data
analyse trends accuracy (correlation for average nightly values over long time when i build up big OpenBCI multichannel dataset)
publish raw data (bdf files weight about 100-300MB each, so not yet sure how to share it correctly)
publish R code it github (done)