Stan and OCI Images

2024-03-15

CmdStan is a heavy dependency when building docker/OCI images. On my machine using the latest debian:testing-slim, it requires 1.3G of space.

root@49237339e848:~# du -h -s ~/.cmdstan
1.3G    /root/.cmdstan

Let’s see how small we can make an image to run a Stan model, first whilst using CmdStanR and second by ditching the R interface in favor of a barebones approach.

CmdStanR

The typical compilation pipeline for CmdStan is to go from Stan -> C++ -> Executable Binary. If your Stan model can be pre-compiled as part of the image build, then you can use a multi-stage build to compile the model binary and copy it into a fresh image without the CmdStan libraries.

The only caveat is that we need to include any shared libraries that may be linked into the model binary. We can get a list using ldd. In this case I’ll use the example model found in examples/bernoulli/bernoulli.stan in the CmdStan installation directory.

root@49237339e848:~/.cmdstan/cmdstan-2.34.1/examples/bernoulli# ldd bernoulli
        linux-vdso.so.1 (0x00007ffc979bc000)
        libtbb.so.2 => /root/.cmdstan/cmdstan-2.34.1/stan/lib/stan_math/lib/tbb/libtbb.so.2 (0x000078044df98000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000078044dc00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000078044deb6000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000078044de89000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000078044da1e000)
        /lib64/ld-linux-x86-64.so.2 (0x000078044e1c0000)

The output shows that CmdStan links to a locally bundled copy of the Intel TBB library, the exact path depends on where you’ve installed CmdStan. Regardless of whether your model uses TBB, Stan will always link to it and set the rpath to the local CmdStan directory.

So, we’ll also copy over libtbb.so.2 and I’ll use patchelf to modify the rpath of the model binary to point to the new directory.

All together, we end up with the following Containerfile where again I’m using the example bernoulli model included with CmdStan.

# -*- mode: dockerfile-ts -*-
FROM debian:testing-slim AS base

RUN apt-get update && \
    apt-get install -y --no-install-recommends build-essential ca-certificates curl patchelf && \
    rm -rf /var/lib/apt/lists/*

FROM base AS cmdstan

ARG CMDSTAN_VERSION=2.34.1

RUN curl -LO https://github.com/stan-dev/cmdstan/releases/download/v${CMDSTAN_VERSION}/cmdstan-${CMDSTAN_VERSION}.tar.gz \
    && mkdir -p cmdstan \
    && tar -xzf cmdstan-${CMDSTAN_VERSION}.tar.gz --strip 1 -C cmdstan

WORKDIR /cmdstan

RUN make -j$(nproc) examples/bernoulli/bernoulli && \
    patchelf --set-rpath /usr/local/lib examples/bernoulli/bernoulli && \
    strip -s examples/bernoulli/bernoulli

FROM base

RUN apt-get update && \
    apt-get install -y --no-install-recommends r-base-core && \
    rm -rf /var/lib/apt/lists/*

RUN mkdir /project
WORKDIR /project

RUN Rscript -e "install.packages('cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')))"
COPY analysis.R ./
COPY --from=cmdstan /cmdstan/examples/bernoulli/bernoulli bernoulli
COPY --from=cmdstan /cmdstan/stan/lib/stan_math/lib/tbb/libtbb.so.2 /usr/local/lib/libtbb.so.2

CMD Rscript analysis.R

The analysis.R file that is also being copied into the image is just a basic script that generates some simulated data and runs the pre-compiled model using cmdstanr. However, there are two important things to note. First, we use the exe_file argument to cmdstan_model to specify that our stan model has already been compiled.

Second, cmdstanr will error if there is not a local copy of CmdStan available. This is not because we actually need to call out to CmdStan, but rather the function cmdstan_version, which is used for several version checks, relies on cmdstan_path being properly set. Since we know exactly which version we’ve used to compile our model, we’re going to override the cmdstan_version function within the cmdstanr package namespace to return a constant using the R function assignInNamespace.

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library(cmdstanr))
options(mc.cores = parallel::detectCores())

# Force cmdstanr to not check the version using a locally installed
# copy of CmdStan
assignInNamespace("cmdstan_version", function(...) "2.34.1", ns = "cmdstanr")

data <- list(N = 20, y = rbinom(20, 1, 0.3))

mod <- cmdstan_model(exe_file = "bernoulli")
fit <- mod$sample(data = data, refresh = 0)

fit$summary()

Let’s build our image.

$ podman build --jobs $(nproc) -t cmdstanr -f Containerfile.cmdstanr .
$ podman images
REPOSITORY                TAG           IMAGE ID      CREATED            SIZE
localhost/cmdstanr        latest        e5f6927fe990  3 minutes ago      568 MB
localhost/full-cmdstanr   latest        a3dc48251dde  About an hour ago  1.81 GB
$ podman run stan
Running MCMC with 4 chains, at most 16 in parallel...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.3 seconds.

# A tibble: 2 x 10
  variable    mean  median     sd    mad       q5     q95  rhat ess_bulk
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl>    <dbl>
1 lp__     -12.3   -12.0   0.736  0.325  -13.8    -11.8    1.00    1649.
2 theta      0.227   0.220 0.0871 0.0890   0.0960   0.378  1.00    1402.

The full-cmdstanr:latest image is also derived from debian:testing-slim; however, it does not use a multi-stage build and instead includes a copy of CmdStan installed through install_cmdstan(). When compared to our slimmed down image, a reduction from 1.81 GB to 568 MB is not too bad.

CmdStan

Let’s keep going. We’ll still use a multi-stage pre-compilation of our Stan model, but this time we’ll switch out the base image from Debian to Google’s distroless. We’ll also ditch cmdstanr in favor of running the model binary directly.

The only hitch is that we need to ensure compatible glibc versions between build stage and the distroless base image, which is based on Debian 12, and we’ll need to serialize our input data to json so that is can be read in by our model binary. For the sake of the example, we’ll just use the bernoulli data distributed directly with CmdStan.

# -*- mode: dockerfile-ts -*-

# Our build stage is essentially the same as before
FROM debian:12 as cmdstan

RUN apt-get update && \
    apt-get install -y --no-install-recommends build-essential ca-certificates curl patchelf && \
    rm -rf /var/lib/apt/lists/*

ARG CMDSTAN_VERSION=2.34.1

RUN curl -LO https://github.com/stan-dev/cmdstan/releases/download/v${CMDSTAN_VERSION}/cmdstan-${CMDSTAN_VERSION}.tar.gz \
    && mkdir -p cmdstan \
    && tar -xzf cmdstan-${CMDSTAN_VERSION}.tar.gz --strip 1 -C cmdstan

WORKDIR /cmdstan

RUN make -j$(nproc) examples/bernoulli/bernoulli && \
    patchelf --set-rpath / examples/bernoulli/bernoulli && \
    strip -s examples/bernoulli/bernoulli

FROM gcr.io/distroless/cc-debian12

COPY --from=cmdstan /cmdstan/examples/bernoulli/bernoulli.data.json .
COPY --from=cmdstan /cmdstan/examples/bernoulli/bernoulli .
COPY --from=cmdstan /cmdstan/stan/lib/stan_math/lib/tbb/libtbb.so.2 .

ENTRYPOINT ["/bernoulli", "sample", "num_chains=4", "data", "file=/bernoulli.data.json"]

Building this image now gets the size down to 29.8 MB.

$ podman build -t stan-distroless -f Containerfile.distroless .
$ podman images
REPOSITORY                    TAG           IMAGE ID      CREATED            SIZE
localhost/stan-distroless     latest        5c7cb50cf5c9  1 minutes ago      29.8 MB
localhost/cmdstanr            latest        e5f6927fe990  3 minutes ago      568 MB
localhost/full-cmdstanr       latest        a3dc48251dde  About an hour ago  1.81 GB