Wolf Garbe
Wolf Garbe CEO and co-founder of SeekStorm

The mystery of an 85% performance drop, lost in translation between CPU and OS.

The mystery of an 85% performance drop, lost in translation between CPU and OS.

Photo by Igor Omilaev

TLDR : 85% performance drop caused by a faulty Windows 11 24H2 software update,
that changed the task scheduler behavior into under-utilizing the P-Cores over E-cores of Intel hybrid CPUs.

Stakeholders and their interplay: Hardware, OS, Compiler, Code

This story is about performance, here specifically about MS Windows 11 and Intel hybrid CPUs, or rather their non-optimal interplay.
It is a story about the little surprises of all the actors secretly interfering with your work in ways that you might not always be aware of.

I know, I might be an outlier, developing on a Windows laptop, knowing that the software will ultimately run on a Linux server in most cases.
But that way, cross-platform development is internalized, not only an afterthought.

And I think the tale might be interesting even for Linux purists who never leave their beloved command line.
When developing high-performance software we know that architecture, data structures, algorithms, and hardware (processor, RAM, SSD) will determine the outcome.

Sometimes the operating system influences the performance as well, depending on drivers, the file system implementation, memory mapping, or OS-specific features like io_uring or IoRing.

Sometimes, CPU vendors add promising features, like the AVX512 VP2INTERSECT,
that made a brief appearance on Tiger Lake only to disappear from Intel altogether afterward.
So we can’t rely that what we have today, will be usable tomorrow, by us and our users.

Between our algorithms and the hardware there is also a compiler and possibly an .Net framework or Java virtual machine,
that might influence the efficiency of the produced code, and even might prohibit us e.g. from using _mm_cmpistrm,
a SIMD hardware acceleration feature present in our CPU that is very useful for intersection, but not supported by some programming languages for years.
The latter was the final straw that made me switch from C# to Rust.

Those are the things we are aware of. But then there are things, that might catch us by unpleasant surprise.

Windows 11 offers 3 power modes: best power efficiency, balanced, and best performance. Choosing the balance of performance/noise/energy consumption - your call.
For developing and testing I always have best performance switched on.
But yet, suddenly something was off…

The problem

During benchmarking, I suddenly experienced an 85% performance drop in index time, query latency, and throughput.

At first, I thought I had introduced some problem in the code, but I later realized that all the old SeekStorm versions as well as other open-source search libraries that were fine before,
suddenly exhibited the same performance degradation on my Windows 11 laptop with Intel Core i7 13700H (14 cores, divided in 6 performance-cores 6 and 8 efficiency-cores).

I benchmarked the code on another laptop with AMD Ryzen 7 6800H (8 cores), and sure enough, everything was fine.

I wrote a simple benchmark in Rust, to make sure the processor (or rather the OS as we will see later) is the culprit and not the SSD.
The simple benchmark just tries to saturate all processor cores with concurrent, multi-threaded dummy calculations. For the full benchmark code see below.

On the Intel Core i7 13700H we can see on the task manager screenshots that at first all cores are 100% utilized,
but after a short time, the utilization drops to 25%, with only 8 efficiency cores busy, and even those cores are not fully saturated, while all performance cores are idle.

Screenshot 1: ThinkPad P1 Gen 6 / Intel Core i7 13700H (Performance degradation)
Elapsed time: 106.714 s

Intel Core i7 13700H

While on the AMD Ryzen 7 6800H all cores are properly saturated.

Screenshot 2: Lenovo Legion 5 Pro / AMD Ryzen 7 6800H (Performance unchanged)
Elapsed time: 29.737 s

AMD Ryzen 7 6800H

I didn’t tinker with the hardware, nor did I install or update any software recently. I checked the Windows update log, but there also wasn’t something suspicious.
I even cleaned the dust from the fans and did all BIOS and OS updates, but the issue remained.

Finding the culprit

So my first working thesis was that my mobile Intel processor might exhibit the same performance problems that were previously described mainly for the Desktop Version of the Intel Core 13th & 14th Gen CPU.

I was almost ready to give in and buy a new laptop, spending between 3000…4000 Euro. And going through the annoyance of setting up all the software again.
I was researching the latest advancements in processor technology and finally settled between an Intel Core Ultra9 275hx or an AMD 9955hx3d.
The AMD was not yet available and I was leaning toward the Intel despite its lack of AVX512 VP2INTERSECT.

Before making the final decision I read some comparison reviews. It hit me when I suddenly spotted the following comment “Avoiding the P core vs E core scheduling issues on this 13900HX…”.
Could those “P core vs E core scheduling issues” be the root cause of my 85% performance drop?

Software usually doesn’t use the highest thread priority, to give concurrently running programs space and ensure an overall responsive system.
That approach has worked for a long time, but it has some unintended side effects that are not well understood when hybrid CPUs enter the game.
After more research, I found that Windows 11 24H2 and the following updates silently introduced a problem with the Task scheduler and underutilization of P-Cores over E-cores.

The fix

How to fix it? The surprisingly easy way is to start your program as an administrator or with Sudo for Windows.
Alternatively, you can set the thread priority to high start "benchmark" /high "benchmark.exe".

An of course, there is a Rust fix for that too - that doesn’t require the users intervention.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
//use the winapi crate with the winbase featere activated:
//winapi = { version ="0.3.9",  features = ["winbase"]}

use std::{ error::Error, sync:: Arc, time::Instant};
use num_format::{Locale, ToFormattedString};
use tokio::sync::Semaphore;
use winapi::um::{processthreadsapi::{GetCurrentProcess, SetPriorityClass}, winbase::ABOVE_NORMAL_PRIORITY_CLASS};

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
    // Fixes a 85% performance drop caused by a faulty Windows 11 24H2 software update,  
    // that changed the task scheduler behavior into under-utilizing the P-Cores over E-cores of Intel hybrid CPUs.
    // This is a workaround until Microsoft fixes the issue in a future update.
    // https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setpriorityclass
    #[cfg(target_os = "windows")]
    unsafe {
        let process = GetCurrentProcess();
        SetPriorityClass(process, ABOVE_NORMAL_PRIORITY_CLASS);
    }

Et Voila - the full performance is back, 4000 Euro saved and I can cure my headache.

Screenshot 3: ThinkPad P1 Gen 6 / Intel Core i7 13700H (Full performance after fix)
Elapsed time: 16.520 s

Intel Core i7 13700H

Conclusion

Unbelievably, my capable laptop would have almost landed at the scrapyard, for no other reasons than a careless software update.
One would think that the two giant companies involved in this disaster would have more throughout testing in place before jeopardizing the equipment of their customers.
They brag when they achieve a 10% performance increase between processor generations. But here, 85% drop - and silence!
I guess many users, who aren’t as performance-obsessed as I am, are not fully aware of to what extent their computers became lame and half-dead.
They will feel that their computer has become slower and more laggy, but many are used to that certain operating systems become slower over time.
Even worse, user trying SeekStorm on Windows for a quick evaluation would blame it on our software.

Let’s hope that the Windows issue gets finally fixed, and things get back to normal without the need for the admin rights hack or fiddling with priorities.
Priorities are for prioritizing between multiple applications, but wilfully forcing most of the processor cores into an idle or under-utilized state,
despite that there is a multi-threaded, performance-critical application running that is able to fully saturate all processor cores if only the OS would let it?
That’s not a genius move, especially if the computer is connected to the main power and the best performance is selected.

But now that this is out of the way, let’s get back to building the future of search 🙂.

Benchmark program (Rust)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
use std::{ error::Error, sync:: Arc, time::Instant};

use num_format::{Locale, ToFormattedString};

use tokio::sync::Semaphore;

pub async fn load(){

let mut result=0.0;

for i in 1..=20_000 { //50_000_000

result+=f64::from(i).tan();

}

}

#[tokio::main]

async fn main() -> Result<(), Box<dyn Error + Send + Sync>> {

let thread_number =20;

let permits = Arc::new(Semaphore::new(thread_number));

let start_time = Instant::now();

for i in 0..1_000_000 {

let permit_thread = permits.clone().acquire_owned().await.unwrap();

tokio::spawn(async move {

load().await;

drop(permit_thread);

});

if i % 10_000 == 0 {

println!("docs {}", i.to_formatted_string(&Locale::en));

}

}

//wait for all threads to finish

let mut permit_vec = Vec::new();

for _i in 0..thread_number {

permit_vec.push(permits.acquire().await.unwrap());

}

let elapsed_time = start_time.elapsed().as_nanos();

println!("Elapsed time: {:.3} s", elapsed_time as f64 / 1_000_000_000.0);

println!("Press Enter to exit...");
let mut line = String::new();
std::io::stdin().read_line(&mut line).unwrap();
Ok(())
}
Rating: