Contiguous Memory A Cautionary Tale

technical
performance
python
Breaking down some functions that break contiguity in memory and the performance implications of those breaks
Author

Kevin Bird

Published

February 1, 2026

I was optimizing some image processing code recently when I hit a wall. My tobytes() call was taking three times longer than expected.

I profiled. I tweaked. I stared at the code. Nothing helped.

I tried different buffer sizes. I experimented with memory allocation patterns. I even suspected thermal throttling. Nothing made a difference.

Eventually, I mentioned the problem to a chatbot. It asked me to add a log right before the call: img.flags['C_CONTIGUOUS']. To my surprise, this returned False. I deployed a version without any flipping or rotating, and suddenly tobytes() was fast again.

This led me to learn about contiguous memory. Hopefully this post helps others avoid the same issue.

The Misleading Profile

Here’s a simplified version of what my code looked like:

import numpy as np
import time

img = np.random.randint(0, 256, (1000, 2000), dtype=np.uint8)

# Step 1: Rotate
start = time.perf_counter()
rotated = np.rot90(img)
print(f"rotate: {time.perf_counter() - start:.4f}s")

# Step 2: Convert to bytes
start = time.perf_counter()
data = rotated.tobytes()
print(f"tobytes: {time.perf_counter() - start:.4f}s")
rotate: 0.0001s
tobytes: 0.0068s

The rotation takes a fraction of a millisecond. The tobytes() takes much longer. So obviously tobytes() is the problem, right?

But watch what happens when I skip the rotation:

start = time.perf_counter()
data = img.tobytes()
print(f"tobytes (no rotation): {time.perf_counter() - start:.4f}s")
tobytes (no rotation): 0.0014s

Now tobytes() is fast. The rotation wasn’t slow. It was too fast. It didn’t actually rearrange the data; it just changed how NumPy views the data. And that left tobytes() to do the hard work of gathering scattered bytes.

What Is Contiguous Memory?

Imagine a bookshelf with novels arranged side by side: Book 1, Book 2, Book 3, and so on. If you need to grab all of them, you can sweep your hand across and scoop them up in one motion. That’s contiguous memory, where data is stored in one unbroken sequence.

Now imagine those same books scattered across the room: one on the couch, one on the kitchen table, one under the bed. Getting them all requires walking around and picking up each one individually. Same books, same information, but much more work.

When data is contiguous in memory, operations like copying or converting can happen quickly. The computer can grab everything at once. When it’s not contiguous, every piece requires a separate lookup.

Seeing It in Code

NumPy makes this easy to observe. Let’s create an array and check if it’s contiguous:

arr = np.arange(12).reshape(3, 4)
print(arr)
print(f"Contiguous: {arr.flags['C_CONTIGUOUS']}")
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Contiguous: True

The data is laid out in memory exactly as you’d read it: 0, 1, 2, 3, 4, 5… in one unbroken sequence.

Flip vs Rotate

Let’s flip the array horizontally:

flipped = np.flip(arr, axis=1)
print(flipped)
print(f"Contiguous: {flipped.flags['C_CONTIGUOUS']}")
[[ 3  2  1  0]
 [ 7  6  5  4]
 [11 10  9  8]]
Contiguous: False

It says False! Flipping doesn’t actually move any data, it just changes how NumPy reads the existing data. It creates a view. Back to our bookshelf: it’s like we wrote down the books in reverse order, but didn’t actually change where the books are on the shelf. This is perfectly fine until somebody asks us to hand them the stack of books in the reversed order. Before we can do this, we must actually reverse the books and this is the operation that takes time. I am going to move away from the book analogy now, but I hope it has helped think about things up to this point.

Now let’s rotate:

rotated = np.rot90(arr)
print(rotated)
print(f"Contiguous: {rotated.flags['C_CONTIGUOUS']}")
[[ 3  7 11]
 [ 2  6 10]
 [ 1  5  9]
 [ 0  4  8]]
Contiguous: False

Also False, but for a different reason. With flip, the data is still in rows, just read backward. With rot90, the data that should be in rows is actually scattered across columns of the original array.

Look at what rotation does: the first row of output [3, 7, 11] comes from the last column of the input. In memory, those values are 4 elements apart from each other. The second row [2, 6, 10] pulls from the second-to-last column, and again each value is separated by 4 memory positions.

When you call tobytes() on this view, NumPy can’t just stream through memory. It has to jump around, grabbing one byte here, skipping 4, grabbing another. This scattered access pattern is what makes non-contiguous arrays slow to serialize.

OpenCV Returns Contiguous Arrays

OpenCV’s functions return new contiguous arrays instead of views:

import cv2

arr_uint8 = np.arange(12).reshape(3, 4).astype(np.uint8)

flipped_cv2 = cv2.flip(arr_uint8, 1)  # horizontal flip
print(f"cv2.flip contiguous: {flipped_cv2.flags['C_CONTIGUOUS']}")

rotated_cv2 = cv2.rotate(arr_uint8, cv2.ROTATE_90_CLOCKWISE)
print(f"cv2.rotate contiguous: {rotated_cv2.flags['C_CONTIGUOUS']}")
cv2.flip contiguous: True
cv2.rotate contiguous: True

Both stay contiguous because cv2 copies the data into a new, properly-laid-out array.

Performance Comparison

You might expect NumPy views to be faster since they avoid copying. For the view operation itself, that’s true:

big = np.random.randint(0, 256, (1000, 2000), dtype=np.uint8)

%timeit np.rot90(big)
%timeit cv2.rotate(big, cv2.ROTATE_90_CLOCKWISE)
4.96 μs ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
152 μs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But if you need contiguous data afterward (for tobytes(), GPU transfer, etc.), you have to pay the copy cost eventually:

%timeit np.ascontiguousarray(np.rot90(big))
%timeit cv2.rotate(big, cv2.ROTATE_90_CLOCKWISE)
1.16 ms ± 293 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
152 μs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

And for flip operations:

%timeit np.flip(big, axis=1)
%timeit np.ascontiguousarray(np.flip(big, axis=1))
%timeit cv2.flip(big, 1)
1.7 μs ± 5.63 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
555 μs ± 115 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
53 μs ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

cv2.rotate is significantly faster when you need contiguous output. The difference: np.ascontiguousarray has to traverse a non-contiguous strided view, hopping around memory. cv2.rotate does the rotation and copy in a single optimized C++ pass.

PyTorch Behavior

PyTorch has different behavior depending on the operation:

import torch

t = torch.randint(0, 256, (1000, 2000), dtype=torch.uint8)
print(f"Original contiguous: {t.is_contiguous()}")

t_rot = torch.rot90(t)
print(f"After rot90 contiguous: {t_rot.is_contiguous()}")

t_flip = torch.flip(t, [1])
print(f"After flip contiguous: {t_flip.is_contiguous()}")
Original contiguous: True
After rot90 contiguous: False
After flip contiguous: True

torch.flip returns a contiguous copy, while torch.rot90 returns a non-contiguous view (like NumPy).

For rotation timing:

%timeit torch.rot90(t)
%timeit torch.rot90(t).contiguous()
%timeit torch.flip(t, [1])
300 μs ± 66.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.77 ms ± 125 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
285 μs ± 37.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The .contiguous() call adds significant overhead as it copies the strided view into a new contiguous tensor.

The Fix

When I called tobytes() on my rotated image, NumPy had to hop around memory collecting scattered pieces. Each value it needed was in a different memory location rather than sitting in a nice sequential block.

The solution? Use OpenCV instead of NumPy for transformations when you need contiguous output:

# Instead of this (slow when you need contiguous):
rotated = np.rot90(img)
data = rotated.tobytes()  # forced to gather scattered bytes

# Use this:
rotated = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)
data = rotated.tobytes()  # already contiguous, fast

OpenCV’s functions do the transformation and produce contiguous memory in a single optimized C++ pass, about 8x faster than creating a NumPy view and then forcing it contiguous with np.ascontiguousarray().

NumPy’s np.ascontiguousarray() is still useful when you already have a non-contiguous array and can’t go back to change how it was created. But if you control the transformation step, OpenCV is the better choice.

Summary

If you’re doing image processing and need contiguous output:

Library Function Contiguous? Time (1000×2000)
NumPy np.flip No (view) 1.6 μs
NumPy np.flip + ascontiguousarray Yes 557 μs
NumPy np.rot90 No (view) 5 μs
NumPy np.rot90 + ascontiguousarray Yes 1.17 ms
OpenCV cv2.flip Yes (copy) 53 μs
OpenCV cv2.rotate Yes (copy) 150 μs
PyTorch torch.flip Yes (copy) 323 μs
PyTorch torch.rot90 No (view) 328 μs
PyTorch torch.rot90 + .contiguous() Yes 2.32 ms

OpenCV is ~8x faster than NumPy when you need contiguous output.

When chasing performance, the slow operation isn’t always the guilty one. My tobytes() was the victim, not the culprit. The rotation upstream had silently broken the memory layout, and tobytes() was left doing the hard work of reassembling it.

If something seems inexplicably slow, check your contiguity. That .flags['C_CONTIGUOUS'] might just solve your mystery.