Just concurrency will not make your go code faster

This was mainly an exercise on coming up ways how I could optimise my code. I took prime computation as example it made the scope smaller.

Problem statement: Make Prime Computation by Sieve of Eratosthenes faster

If you might need a refresher Sieve of Eratosthenes says

First start with 2 and then mark all its multiples as not prime
The next unmarked number is prime, take that number and mark all of its multiple as not primes
Keep doing this any unmarked number is a prime; mark all its multiple. if a number is marked we don’t really care about.
Finally return all the unmarked numbers.
Note: this is a sequential problem

When i first looked at it, I wouldn’t lie I thought it would be just straight forward: I would just use goroutines and then marking all multiples and it will make the program faster.

Of course I was very wrong.

Before going into that I will give a brief explanation about how I have structured my code

There are three parts

Registry - this is where I mark if a number is prime or not. think of an array of boolean. Initialised all values as true except 0 and 1
MarkAllmultiplesAsNotPrime - a function which marks all multiples of a prime as not primes. Takes Registry and number as inputs
Result - Iterating through registry and gathering all still marked as true.

This was the initial implementation to compare to.

You can check the code here. The sequential version is in file: computePrimeSingle.go

Our Base Benchmark

BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=10-8         	13815188	        75.59 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=20-8         	10208707	       120.1 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=100-8        	 3170428	       383.4 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=1000-8       	  394338	      3085 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=100000-8     	    2769	    428106 ns/op
BenchmarkComputePrimeSingle/Using_compute_prime_single_to_find_primes_till_n=1000000-8    	     268	   4609889 ns/op

Using goroutines for speed

The idea was simple to add multiple goroutines to mark numbers concurrently. Code is here: computePrimeWithGoroutines.go

The issue that immediate came up was that if i move markMultiplesAsNotPrime in goroutines is that ordering is not guaranteed. It can very well happen that we pick a number that is not yet marked as multiple and we start marking its multiples; this would be extra work which we don’t want to do.

So to counter that I added a isReallyPrime function; this would quickly check if a number is truly prime and then we can correctly mark its multiple as not primes. The tradeoff

Initially we would have to do extra work to verify primes
But after a particular number of primes we would have filtered out significant number of numbers as not prime and mostly only for truly primes we would be running isReallyPrime

I also add a RWLock to registry, so as we don’t face race condition.

Another addition was wait groups to make sure are processing is really complete.

The result was an extremely slow implementation.

Let’s compare the benchmark

BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=10-8         	  799340	      1427 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=20-8         	  373285	      2998 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=100-8        	   62986	     18970 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=1000-8       	    3364	    350988 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=100000-8     	      19	  53346985 ns/op
BenchmarkComputePrimeWithGoroutines/Using_compute_primes_with_goroutines_to_find_primes_till_n=1000000-8    	       2	 617334458 ns/op

Looking into the cpu profile of this implementation.

File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:04 IST
Duration: 802.40ms, Total samples = 1.05s (130.86%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.04s, 99.05% of 1.05s total
Showing top 10 nodes out of 65
      flat  flat%   sum%        cum   cum%
     0.48s 45.71% 45.71%      0.48s 45.71%  runtime.usleep
     0.33s 31.43% 77.14%      0.33s 31.43%  runtime.pthread_cond_wait
     0.16s 15.24% 92.38%      0.16s 15.24%  runtime.pthread_cond_signal
     0.01s  0.95% 93.33%      0.02s  1.90%  internal/sync.(*Mutex).lockSlow
     0.01s  0.95% 94.29%      0.01s  0.95%  main.isReallyPrime
     0.01s  0.95% 95.24%      0.28s 26.67%  runtime.lock2
     0.01s  0.95% 96.19%      0.01s  0.95%  runtime.nanotime1
     0.01s  0.95% 97.14%      0.01s  0.95%  runtime.releasem
     0.01s  0.95% 98.10%      0.24s 22.86%  runtime.stealWork
     0.01s  0.95% 99.05%      0.03s  2.86%  runtime.unlock2

Observations

I am just firing goroutines and all of them and a lot of time is spent on scheduling goroutines.
Also goroutines are trying to write on the same underlying data structure which would be causing a lot of contention.
After scheduling, the second most time spent is with the lock.
Next was the isReallyPrime function, we are doing extra work here.

Using goroutines; but limiting them

Keeping the goroutine code intact, I added buffered channel for bounded concurrency. Code is here: computePrimesWithGoroutinesBoundedConcurrecy.go

Keeping the channel size as runtime.NumCPU() to make sure we are not causing any delay due to scheduler having to manage large number of goroutines.

Benchmark

BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=10-8         	  689505	      1773 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=20-8         	  326319	      3622 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=100-8        	   42302	     28656 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=1000-8       	    2812	    421174 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=100000-8     	      28	  38294449 ns/op
BenchmarkComputePrimeWithGoroutinesBoundedConcurrency/Using_compute_primes_with_goroutines_bounded_concurrency_to_find_primes_till_n=1000000-8    	       3	 392406611 ns/op

Result:

No drastic improvement.
I saw very minor improvement for larger value of n.
For smaller value of n it was basically a degradation.

It kind of made sense, for smaller value of n’s we won’t be having a lot of concurrent markings going on.

CPU Profiles

File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 502.56ms, Total samples = 970ms (193.01%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 920ms, 94.85% of 970ms total
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
     450ms 46.39% 46.39%      450ms 46.39%  runtime.usleep
     260ms 26.80% 73.20%      260ms 26.80%  runtime.pthread_cond_wait
     110ms 11.34% 84.54%      110ms 11.34%  runtime.pthread_cond_signal
      30ms  3.09% 87.63%      120ms 12.37%  internal/sync.(*Mutex).lockSlow
      20ms  2.06% 89.69%      110ms 11.34%  runtime.wakep
      10ms  1.03% 90.72%      130ms 13.40%  internal/sync.(*Mutex).Lock
      10ms  1.03% 91.75%       10ms  1.03%  main.isReallyPrime
      10ms  1.03% 92.78%       10ms  1.03%  runtime.(*semaRoot).dequeue
      10ms  1.03% 93.81%       10ms  1.03%  runtime.(*timers).check
      10ms  1.03% 94.85%       10ms  1.03%  runtime.cansemacquire (inline)

Next steps

The improvement was minor, still not as fast as not using goroutines
We have two major points that we can fix
- Remove the lock
- Remove the reallyPrime function.

Going lockless, using atomics

In the earlier operation, I was locking the entire array for any writes and we can remove that by using atomics.

So I changed the registry to an array of atomics and remove the lock.

The rest of the code was still the same. Code is here: primeComputeAtomic.go

BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=10-8                                            	  966913	      1228 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=20-8                                            	  538166	      2236 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=100-8                                           	  196950	      6215 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=1000-8                                          	   27308	     42110 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=100000-8                                        	     278	   4250877 ns/op
BenchmarkComputePrimeGoroutinesAtomic/Using_compute_prime_goroutines_atomic_to_find_primes_till_n=1000000-8                                       	      19	  59738417 ns/op

Result

A noticeable bump in performance
Still not as good enough as the single version

CPU Profiles

File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 202.48ms, Total samples = 150ms (74.08%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 150ms, 100% of 150ms total
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
      60ms 40.00% 40.00%       60ms 40.00%  runtime.usleep
      30ms 20.00% 60.00%       30ms 20.00%  runtime.madvise
      20ms 13.33% 73.33%       20ms 13.33%  main.isReallyPrime (inline)
      20ms 13.33% 86.67%       20ms 13.33%  main.markMultipleNotPrimeAtomic (inline)
      10ms  6.67% 93.33%       10ms  6.67%  runtime.casgstatus
      10ms  6.67%   100%       10ms  6.67%  runtime.pthread_cond_wait
         0     0%   100%       20ms 13.33%  main.computePrimeGoroutinesAtomic
         0     0%   100%       20ms 13.33%  main.computePrimeGoroutinesAtomic.func1
         0     0%   100%       20ms 13.33%  main.main
         0     0%   100%       20ms 13.33%  main.main.func4

Atomic With bounded concurrency

BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=10-8                             	  764203	      1562 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=20-8                             	  358052	      2858 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=100-8                            	  127882	      9371 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=1000-8                           	   18506	     65526 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=100000-8                         	     224	   5140853 ns/op
BenchmarkComputePrimeGoroutinesAtomicBounded/Using_compute_prime_goroutines_atomic_bounded_to_find_primes_till_n=1000000-8                        	      19	  61960467 ns/op

Even with limiting the number of goroutines, we don’t see much improvement. Which again makes sense as the goroutines are not competing for the same resource. (Because we are not locking the entire array)

Code is here: primeComputeAtomicBounded.go

File: primeCompute
Type: cpu
Time: 2026-06-14 14:41:05 IST
Duration: 202.51ms, Total samples = 170ms (83.95%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 170ms, 100% of 170ms total
Showing top 10 nodes out of 32
      flat  flat%   sum%        cum   cum%
     100ms 58.82% 58.82%      100ms 58.82%  runtime.usleep
      20ms 11.76% 70.59%       20ms 11.76%  main.markMultipleNotPrimeAtomic (inline)
      20ms 11.76% 82.35%       20ms 11.76%  runtime.madvise
      10ms  5.88% 88.24%       20ms 11.76%  runtime.chansend
      10ms  5.88% 94.12%       40ms 23.53%  runtime.lock2
      10ms  5.88%   100%       10ms  5.88%  runtime.pthread_cond_wait
         0     0%   100%       20ms 11.76%  main.computePrimeGoroutinesAtomicBounded
         0     0%   100%       50ms 29.41%  main.computePrimeGoroutinesAtomicBounded.func1
         0     0%   100%       30ms 17.65%  main.computePrimeGoroutinesAtomicBounded.func1.1
         0     0%   100%       20ms 11.76%  main.main

Avoiding the isReallyPrime function all together

I wanted to remove isReallyPrime function but due to no ordering guaratee in goroutines I was forced to use it.

But one thing struck me

If I discover a prime p, that means I know all the prime numbers from 2 to p.

using only this I can mark all the multiples up to p^2

That meant if I know prime numbers till p, I can find all prime numbers till p^2.

So if I know 2 is prime, then I can find all prime number till 4 if i know 3 is prime I can find all prime numbers till 9 I then know 7 is prime from there I can find all prime numbers till 49

You get the drift.

And here I don’t have to worry about ordering

if I know 2,3,5, and 7 are prime they can independently go and mark there multiples as not prime and I don’t need to synchronize anything.

Using this approach I went ahead with it, the registry will just be an array of bool, I also stored some additional book keeping info like lastPrime and computedTill and also waitgroup to know a segments processing is done or not.

We had finally removed the need of isReallyPrime function

Code is here: chunkedPrimeCompute.go

BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=10-8                                   1000000	      1187 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=20-8                                   940856	      1409 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=100-8                                  458127	      2617 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=1000-8                                  159067	      8282 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=100000-8                                  2592	    413818 ns/op
BenchmarkComputePrimeChunked/Using_compute_prime_chunked_to_find_primes_till_n=1000000-8                                  361	   3193429 ns/op

Results

Finally we have a speedier implementation than the sequential version.
For small size numbers it was not better than single routine version.
For large sizes it was much better than the single routine version.
Note: Although we have a race condition, due to no synchronisation. But since we are only changing data once it does not matter, just for this usecase.

PS: I did add a version which is atomic and chunked version and it was less performant as single go-routine version. Code is here: atomicChunkedPrimeCompute.go