This was mainly an exercise on coming up ways how I could optimise my code. I took prime computation as example it made the scope smaller.
Problem statement: Make Prime Computation by Sieve of Eratosthenes faster
If you might need a refresher Sieve of Eratosthenes says
- First start with 2 and then mark all its multiples as not prime
- The next unmarked number is prime, take that number and mark all of its multiple as not primes
- Keep doing this any unmarked number is a prime; mark all its multiple. if a number is marked we don’t really care about.
- Finally return all the unmarked numbers.
- Note: this is a sequential problem
When i first looked at it, I wouldn’t lie I thought it would be just straight forward: I would just use goroutines and then marking all multiples and it will make the program faster.
Of course I was very wrong.
Before going into that I will give a brief explanation about how I have structured my code
There are three parts
- Registry - this is where I mark if a number is prime or not. think of an array of boolean. Initialised all values as true except 0 and 1
- MarkAllmultiplesAsNotPrime - a function which marks all multiples of a prime as not primes. Takes Registry and number as inputs
- Result - Iterating through registry and gathering all still marked as true.
This was the initial implementation to compare to.
You can check the code here. The sequential version is in file: computePrimeSingle.go
Our Base Benchmark
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=10-8 13815188 75.59 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=20-8 10208707 120.1 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=100-8 3170428 383.4 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=1000-8 394338 3085 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=100000-8 2769 428106 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=1000000-8 268 4609889 ns/op
Using goroutines for speed
The idea was simple to add multiple goroutines to mark numbers concurrently. Code is here: computePrimeWithGoroutines.go
The issue that immediate came up was that if i move markMultiplesAsNotPrime in goroutines is that ordering is not guaranteed. It can very well happen that we pick a number that is not yet marked as multiple and we start marking its multiples; this would be extra work which we don’t want to do.
So to counter that I added a isReallyPrime function; this would quickly check if a number is truly prime and then we can correctly mark its multiple as not primes. The tradeoff
- Initially we would have to do extra work to verify primes
- But after a particular number of primes we would have filtered out significant number of numbers as not prime and mostly only for truly primes we would be running isReallyPrime
I also add a RWLock to registry, so as we don’t face race condition.
Another addition was wait groups to make sure are processing is really complete.
The result was an extremely slow implementation.
Let’s compare the benchmark
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=10-8 799340 1427 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=20-8 373285 2998 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=100-8 62986 18970 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=1000-8 3364 350988 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=100000-8 19 53346985 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=1000000-8 2 617334458 ns/op
Looking into the cpu profile of this implementation.
File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:04 IST
Duration: 802.40ms, Total samples = 1.05s (130.86%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.04s, 99.05% of 1.05s total
Showing top 10 nodes out of 65
flat flat% sum% cum cum%
0.48s 45.71% 45.71% 0.48s 45.71% runtime.usleep
0.33s 31.43% 77.14% 0.33s 31.43% runtime.pthread_cond_wait
0.16s 15.24% 92.38% 0.16s 15.24% runtime.pthread_cond_signal
0.01s 0.95% 93.33% 0.02s 1.90% internal/sync.(*Mutex).lockSlow
0.01s 0.95% 94.29% 0.01s 0.95% main.isReallyPrime
0.01s 0.95% 95.24% 0.28s 26.67% runtime.lock2
0.01s 0.95% 96.19% 0.01s 0.95% runtime.nanotime1
0.01s 0.95% 97.14% 0.01s 0.95% runtime.releasem
0.01s 0.95% 98.10% 0.24s 22.86% runtime.stealWork
0.01s 0.95% 99.05% 0.03s 2.86% runtime.unlock2
Observations
- I am just firing goroutines and all of them and a lot of time is spent on scheduling goroutines.
- Also goroutines are trying to write on the same underlying data structure which would be causing a lot of contention.
- After scheduling, the second most time spent is with the lock.
- Next was the isReallyPrime function, we are doing extra work here.
Using goroutines; but limiting them
Keeping the goroutine code intact, I added buffered channel for bounded concurrency. Code is here: computePrimesWithGoroutinesBoundedConcurrecy.go
Keeping the channel size as runtime.NumCPU() to make sure we are not causing any delay due to scheduler having to manage large number of goroutines.
Benchmark
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=10-8 689505 1773 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=20-8 326319 3622 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=100-8 42302 28656 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=1000-8 2812 421174 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=100000-8 28 38294449 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=1000000-8 3 392406611 ns/op
Result:
- No drastic improvement.
- I saw very minor improvement for larger value of n.
- For smaller value of n it was basically a degradation.
It kind of made sense, for smaller value of n’s we won’t be having a lot of concurrent markings going on.
CPU Profiles
File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 502.56ms, Total samples = 970ms (193.01%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 920ms, 94.85% of 970ms total
Showing top 10 nodes out of 71
flat flat% sum% cum cum%
450ms 46.39% 46.39% 450ms 46.39% runtime.usleep
260ms 26.80% 73.20% 260ms 26.80% runtime.pthread_cond_wait
110ms 11.34% 84.54% 110ms 11.34% runtime.pthread_cond_signal
30ms 3.09% 87.63% 120ms 12.37% internal/sync.(*Mutex).lockSlow
20ms 2.06% 89.69% 110ms 11.34% runtime.wakep
10ms 1.03% 90.72% 130ms 13.40% internal/sync.(*Mutex).Lock
10ms 1.03% 91.75% 10ms 1.03% main.isReallyPrime
10ms 1.03% 92.78% 10ms 1.03% runtime.(*semaRoot).dequeue
10ms 1.03% 93.81% 10ms 1.03% runtime.(*timers).check
10ms 1.03% 94.85% 10ms 1.03% runtime.cansemacquire (inline)
Next steps
- The improvement was minor, still not as fast as not using goroutines
- We have two major points that we can fix
- Remove the lock
- Remove the reallyPrime function.
Going lockless, using atomics
In the earlier operation, I was locking the entire array for any writes and we can remove that by using atomics.
So I changed the registry to an array of atomics and remove the lock.
The rest of the code was still the same. Code is here: primeComputeAtomic.go
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=10-8 966913 1228 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=20-8 538166 2236 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=100-8 196950 6215 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=1000-8 27308 42110 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=100000-8 278 4250877 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=1000000-8 19 59738417 ns/op
Result
- A noticeable bump in performance
- Still not as good enough as the single version
CPU Profiles
File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 202.48ms, Total samples = 150ms (74.08%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 150ms, 100% of 150ms total
Showing top 10 nodes out of 29
flat flat% sum% cum cum%
60ms 40.00% 40.00% 60ms 40.00% runtime.usleep
30ms 20.00% 60.00% 30ms 20.00% runtime.madvise
20ms 13.33% 73.33% 20ms 13.33% main.isReallyPrime (inline)
20ms 13.33% 86.67% 20ms 13.33% main.markMultipleNotPrimeAtomic (inline)
10ms 6.67% 93.33% 10ms 6.67% runtime.casgstatus
10ms 6.67% 100% 10ms 6.67% runtime.pthread_cond_wait
0 0% 100% 20ms 13.33% main.computePrimeGoroutinesAtomic
0 0% 100% 20ms 13.33% main.computePrimeGoroutinesAtomic.func1
0 0% 100% 20ms 13.33% main.main
0 0% 100% 20ms 13.33% main.main.func4
Atomic With bounded concurrency
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=10-8 764203 1562 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=20-8 358052 2858 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=100-8 127882 9371 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=1000-8 18506 65526 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=100000-8 224 5140853 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=1000000-8 19 61960467 ns/op
Even with limiting the number of goroutines, we don’t see much improvement. Which again makes sense as the goroutines are not competing for the same resource. (Because we are not locking the entire array)
Code is here: primeComputeAtomicBounded.go
File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 202.51ms, Total samples = 170ms (83.95%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 170ms, 100% of 170ms total
Showing top 10 nodes out of 32
flat flat% sum% cum cum%
100ms 58.82% 58.82% 100ms 58.82% runtime.usleep
20ms 11.76% 70.59% 20ms 11.76% main.markMultipleNotPrimeAtomic (inline)
20ms 11.76% 82.35% 20ms 11.76% runtime.madvise
10ms 5.88% 88.24% 20ms 11.76% runtime.chansend
10ms 5.88% 94.12% 40ms 23.53% runtime.lock2
10ms 5.88% 100% 10ms 5.88% runtime.pthread_cond_wait
0 0% 100% 20ms 11.76% main.computePrimeGoroutinesAtomicBounded
0 0% 100% 50ms 29.41% main.computePrimeGoroutinesAtomicBounded.func1
0 0% 100% 30ms 17.65% main.computePrimeGoroutinesAtomicBounded.func1.1
0 0% 100% 20ms 11.76% main.main
Avoiding the isReallyPrime function all together
I wanted to remove isReallyPrime function but due to no ordering guaratee in goroutines I was forced to use it.
But one thing struck me
If I discover a prime p, that means I know all the prime numbers from 2 to p.
using only this I can mark all the multiples up to p^2
That meant if I know prime numbers till p, I can find all prime numbers till p^2.
So if I know 2 is prime, then I can find all prime number till 4 if i know 3 is prime I can find all prime numbers till 9 I then know 7 is prime from there I can find all prime numbers till 49
You get the drift.
And here I don’t have to worry about ordering
if I know 2,3,5, and 7 are prime they can independently go and mark there multiples as not prime and I don’t need to synchronize anything.
Using this approach I went ahead with it, the registry will just be an array of bool, I also stored some additional book keeping info like lastPrime and computedTill and also waitgroup to know a segments processing is done or not.
We had finally removed the need of isReallyPrime function
Code is here: chunkedPrimeCompute.go
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=10-8 1000000 1187 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=20-8 940856 1409 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=100-8 458127 2617 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=1000-8 159067 8282 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=100000-8 2592 413818 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=1000000-8 361 3193429 ns/op
Results
- Finally we have a speedier implementation than the sequential version.
- For small size numbers it was not better than single routine version.
- For large sizes it was much better than the single routine version.
- Note: Although we have a race condition, due to no synchronisation. But since we are only changing data once it does not matter, just for this usecase.
PS: I did add a version which is atomic and chunked version and it was less performant as single go-routine version. Code is here: atomicChunkedPrimeCompute.go