Breaking down some functions that break contiguity in memory and the performance implications of those breaks
Author
Kevin Bird
Published
February 1, 2026
I was optimizing some image processing code recently when I hit a wall. My tobytes() call was taking three times longer than expected.
I profiled. I tweaked. I stared at the code. Nothing helped.
I tried different buffer sizes. I experimented with memory allocation patterns. I even suspected thermal throttling. Nothing made a difference.
Eventually, I mentioned the problem to a chatbot. It asked me to add a log right before the call: img.flags['C_CONTIGUOUS']. To my surprise, this returned False. I deployed a version without any flipping or rotating, and suddenly tobytes() was fast again.
This led me to learn about contiguous memory. Hopefully this post helps others avoid the same issue.
The Misleading Profile
Here’s a simplified version of what my code looked like:
The rotation takes a fraction of a millisecond. The tobytes() takes much longer. So obviously tobytes() is the problem, right?
But watch what happens when I skip the rotation:
start = time.perf_counter()data = img.tobytes()print(f"tobytes (no rotation): {time.perf_counter() - start:.4f}s")
tobytes (no rotation): 0.0014s
Now tobytes() is fast. The rotation wasn’t slow. It was too fast. It didn’t actually rearrange the data; it just changed how NumPy views the data. And that left tobytes() to do the hard work of gathering scattered bytes.
What Is Contiguous Memory?
Imagine a bookshelf with novels arranged side by side: Book 1, Book 2, Book 3, and so on. If you need to grab all of them, you can sweep your hand across and scoop them up in one motion. That’s contiguous memory, where data is stored in one unbroken sequence.
Now imagine those same books scattered across the room: one on the couch, one on the kitchen table, one under the bed. Getting them all requires walking around and picking up each one individually. Same books, same information, but much more work.
When data is contiguous in memory, operations like copying or converting can happen quickly. The computer can grab everything at once. When it’s not contiguous, every piece requires a separate lookup.
Seeing It in Code
NumPy makes this easy to observe. Let’s create an array and check if it’s contiguous:
It says False! Flipping doesn’t actually move any data, it just changes how NumPy reads the existing data. It creates a view. Back to our bookshelf: it’s like we wrote down the books in reverse order, but didn’t actually change where the books are on the shelf. This is perfectly fine until somebody asks us to hand them the stack of books in the reversed order. Before we can do this, we must actually reverse the books and this is the operation that takes time. I am going to move away from the book analogy now, but I hope it has helped think about things up to this point.
Also False, but for a different reason. With flip, the data is still in rows, just read backward. With rot90, the data that should be in rows is actually scattered across columns of the original array.
Look at what rotation does: the first row of output [3, 7, 11] comes from the last column of the input. In memory, those values are 4 elements apart from each other. The second row [2, 6, 10] pulls from the second-to-last column, and again each value is separated by 4 memory positions.
When you call tobytes() on this view, NumPy can’t just stream through memory. It has to jump around, grabbing one byte here, skipping 4, grabbing another. This scattered access pattern is what makes non-contiguous arrays slow to serialize.
OpenCV Returns Contiguous Arrays
OpenCV’s functions return new contiguous arrays instead of views:
1.7 μs ± 5.63 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
555 μs ± 115 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
53 μs ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
cv2.rotate is significantly faster when you need contiguous output. The difference: np.ascontiguousarray has to traverse a non-contiguous strided view, hopping around memory. cv2.rotate does the rotation and copy in a single optimized C++ pass.
PyTorch Behavior
PyTorch has different behavior depending on the operation:
300 μs ± 66.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.77 ms ± 125 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
285 μs ± 37.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The .contiguous() call adds significant overhead as it copies the strided view into a new contiguous tensor.
The Fix
When I called tobytes() on my rotated image, NumPy had to hop around memory collecting scattered pieces. Each value it needed was in a different memory location rather than sitting in a nice sequential block.
The solution? Use OpenCV instead of NumPy for transformations when you need contiguous output:
# Instead of this (slow when you need contiguous):rotated = np.rot90(img)data = rotated.tobytes() # forced to gather scattered bytes# Use this:rotated = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)data = rotated.tobytes() # already contiguous, fast
OpenCV’s functions do the transformation and produce contiguous memory in a single optimized C++ pass, about 8x faster than creating a NumPy view and then forcing it contiguous with np.ascontiguousarray().
NumPy’s np.ascontiguousarray() is still useful when you already have a non-contiguous array and can’t go back to change how it was created. But if you control the transformation step, OpenCV is the better choice.
Summary
If you’re doing image processing and need contiguous output:
Library
Function
Contiguous?
Time (1000×2000)
NumPy
np.flip
No (view)
1.6 μs
NumPy
np.flip + ascontiguousarray
Yes
557 μs
NumPy
np.rot90
No (view)
5 μs
NumPy
np.rot90 + ascontiguousarray
Yes
1.17 ms
OpenCV
cv2.flip
Yes (copy)
53 μs
OpenCV
cv2.rotate
Yes (copy)
150 μs
PyTorch
torch.flip
Yes (copy)
323 μs
PyTorch
torch.rot90
No (view)
328 μs
PyTorch
torch.rot90 + .contiguous()
Yes
2.32 ms
OpenCV is ~8x faster than NumPy when you need contiguous output.
When chasing performance, the slow operation isn’t always the guilty one. My tobytes() was the victim, not the culprit. The rotation upstream had silently broken the memory layout, and tobytes() was left doing the hard work of reassembling it.
If something seems inexplicably slow, check your contiguity. That .flags['C_CONTIGUOUS'] might just solve your mystery.