Memory Snapshot

Modal can save the state of your Function’s memory right after initialization and restore it directly later, skipping initialization work.

These “memory snapshots” can dramatically improve cold start performance for Modal Functions.

During initialization, your code might read many files from the file system, which is quite expensive. For example, the torch package is hundreds of MiB and requires over 20,000 file operations to load! Such Functions typically start several times faster with memory snapshots enabled.

The memory snapshot feature has two variants. GPU memory snapshots (alpha) provide full GPU access before the snapshot is taken, while CPU memory snapshots do not.

CPU Memory Snapshot

CPU memory snapshots capture the state of a container and save it to disk. This saved snapshot can then be used to quickly restore new containers to the exact same state.

Basic usage

You can enable memory snapshots for your Function with the enable_memory_snapshot=True parameter:

@app.function(enable_memory_snapshot=True)
def my_func():
    print("hello")

Then deploy the App with modal deploy. Memory snapshots are created only for deployed Apps.

When using classes decorated with @cls, @modal.enter() hooks are not included in the snapshot by default. Add snap=True to include them:

@app.cls(enable_memory_snapshot=True)
class MyCls:
    @modal.enter(snap=True)
    def load(self):
        ...

Any code executed in global scope, such as top-level imports, will also be captured by the memory snapshot.

CPU memory snapshots for GPU workloads

CPU memory snapshots don’t support direct GPU memory capture, but GPU Functions can still benefit from memory snapshots through a two-stage initialization process. This involves refactoring your initialization code to run across two separate @modal.enter functions: one that runs before creating the snapshot (snap=True), and one that runs after restoring from the snapshot (snap=False). Load model weights onto CPU memory in the snap=True method, and then move the weights onto GPU memory in the snap=False method. Here’s an example using the sentence-transformers package:

import modal

image = modal.Image.debian_slim().pip_install("sentence-transformers")
app = modal.App("sentence-transformers", image=image)

with image.imports():
    from sentence_transformers import SentenceTransformer

model_vol = modal.Volume.from_name("sentence-transformers-models", create_if_missing=True)

@app.cls(gpu="a10g", volumes={"/models": model_vol}, enable_memory_snapshot=True)
class Embedder:
    model_id = "BAAI/bge-small-en-v1.5"

    @modal.enter(snap=True)
    def load(self):
        # Create a memory snapshot with the model loaded in CPU memory.
        self.model = SentenceTransformer(f"/models/{self.model_id}", device="cpu")

    @modal.enter(snap=False)
    def setup(self):
        self.model.to("cuda")  # Move the model to a GPU!

    @modal.method()
    def run(self, sentences:list[str]):
        embeddings = self.model.encode(sentences, normalize_embeddings=True)
        print(embeddings)

@app.local_entrypoint()
def main():
    Embedder().run.remote(sentences=["what is the meaning of life?"])

if __name__ == "__main__":
    cls = modal.Cls.from_name("sentence-transformers", "Embedder")
    cls().run.remote(sentences=["what is the meaning of life?"])

Even without GPU snapshotting, this workaround reduces the time it takes for Embedder.run to startup by about 3x, from ~6 seconds down to just ~2 seconds.

GPU availability during the memory snapshot phase

If you are using the GPU memory snapshot feature (enable_gpu_snapshot), then GPUs are available within @enter(snap=True).

If you are using memory snapshots without enable_gpu_snapshot, then it’s important to note that GPUs will not be available within the @enter(snap=True) method.

import modal
app = modal.App(image=modal.Image.debian_slim().pip_install("torch"))
@app.cls(enable_memory_snapshot=True, gpu="A10")
class GPUAvailability:
    @modal.enter(snap=True)
    def no_gpus_available_during_snapshots(self):
        import torch
        print(f"GPUs available: {torch.cuda.is_available()}")  # False

    @modal.enter(snap=False)
    def gpus_available_following_restore(self):
        import torch
        print(f"GPUs available: {torch.cuda.is_available()}")  # True

    @modal.method()
    def demo(self):
        print(f"GPUs available: {torch.cuda.is_available()}") # True

Known limitations

The torch.cuda module has multiple functions which, if called during snapshotting, will initialize CUDA as having zero GPU devices. Such functions include torch.cuda.is_available and torch.cuda.get_device_capability. If you’re using a framework that calls these methods during its import phase, it may not be compatible with memory snapshots. The problem can manifest as confusing “cuda not available” or “no CUDA-capable device is detected” errors.

We have found that importing PyTorch twice solves the problem in some cases:


@app.cls(enable_memory_snapshot=True, gpu="A10")
class GPUAvailability:
    @modal.enter(snap=True)
    def pre_snap(self):
        import torch
        ...
    @modal.enter(snap=False)
    def post_snap(self):
        import torch   # re-import to re-init GPU availability state
        ...

In particular, xformers is known to call torch.cuda.get_device_capability on import, so if it is imported during snapshotting it can unhelpfully initialize CUDA with zero GPUs. The workaround for this is to set the XFORMERS_ENABLE_TRITON environment variable to 1 in your modal.Image.

image = modal.Image.debian_slim().pip_install("xformers>=0.28")  # for instance
image = image.env({"XFORMERS_ENABLE_TRITON": "1"})

GPU Memory Snapshot

With our experimental GPU memory snapshot feature, we are able to capture the entire GPU state too. This makes for simpler initialization logic and even faster cold starts.

Pass the additional option experimental_options={"enable_gpu_snapshot": True} to your Function or class to enable GPU snapshotting. These functions have full GPU and CUDA access.

@app.function(
    gpu="a10",
    enable_memory_snapshot=True,
    experimental_options={"enable_gpu_snapshot": True},
)
def my_gpu_func():
    import torch
    print(f"GPUs available: {torch.cuda.is_available()}")  # True

Here’s what the above SentenceTransformer example looks like with GPU memory snapshot enabled:

@app.cls(
    gpu="a10g",
    volumes={"/models": model_vol},
    enable_memory_snapshot=True,
    experimental_options={"enable_gpu_snapshot": True}
)
class Embedder:
    model_id = "BAAI/bge-small-en-v1.5"

    @modal.enter(snap=True)
    def load(self):
        # Create a memory snapshot with the model loaded in GPU memory.
        self.model = SentenceTransformer(f"/models/{self.model_id}", device="cuda")

To achieve even faster cold starts, we recommend warming up your model by running a few forward passes on sample data in the @enter(snap=True) method.

Refer to the code sample here for a more complete example. Our blog post also provides more useful details.

Known limitations

GPU memory snapshots are in alpha. We’ve seen that they can massively reduce cold boot time but we are still exploring their limitations. Try it for yourself and let us know how it goes!

Memory Snapshot FAQ

When are snapshots updated?

Redeploying your Function with new configuration (e.g. a new GPU type) or new code will cause previous snapshots to become obsolete. Subsequent invocations to the new Function version will automatically create new snapshots with the new configuration and code.

Changes to Modal Volumes do not cause snapshots to update. Deleting files in a Volume used during restore will cause restore failures.

I haven’t changed my Function. Why do I still see snapshots being created sometimes?

Modal recaptures snapshots to keep up with the platform’s latest runtime and security changes.

Additionally, you may observe your Function being memory snapshot multiple times during its first few invocations. This happens because memory snapshots are specific to the underlying worker type that created them (e.g. low-level processor details), and Modal Functions run across a handful of worker types.

Snapshots may add a small amount of latency to Function initialization.

CPU-only Functions need around 6 snapshots for full coverage, and Functions targeting a specific GPU (e.g. A100) need 2-3.

How do snapshots handle randomness?

If your application depends on uniqueness of state, you must evaluate your Function code and verify that it is resilient to snapshotting operations. For example, if a variable is randomly initialized and snapshotted, that variable will be identical after every restore, possibly breaking uniqueness expectations of the proceeding Function code.