Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated image is missing files generated via RUN #3123

Open
clemenskol opened this issue Apr 19, 2024 · 16 comments
Open

Generated image is missing files generated via RUN #3123

clemenskol opened this issue Apr 19, 2024 · 16 comments
Labels
differs-from-docker issue/missing-files kind/bug Something isn't working priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. works-with-docker

Comments

@clemenskol
Copy link

Actual behavior

Files generated via a RUN command should be included in the final image (e.g., regardless of file generation timestamp). This seems not to be the case.

I have generated a minimal Dockerfile to demonstrate this:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11 && \
    ls -l `which python3.11` && \
    python3.11 --version

When building the image, the python3.11 command is not property installed in the generated image, although it's clearly present while building.

My build command:

/kaniko/executor --context /workspace --dockerfile ./Dockerfile --destination <my-repo>:test-tag --snapshot-mode=full --cache=true

The output of the final 2 commands can be seen in the build output:

-rwxr-xr-x 1 root root 6890080 Aug 12  2022 /usr/bin/python3.11
Python 3.11.0rc1

When the generated image is then run, the file is not found (python3.11) simply does not exist.

To test if this has to do with file timestamps, I have done the following modification:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11 && \
    ls -l `which python3.11` && \
    python3.11 --version && \
    touch `which python3.11`

In this case, the python3.11 binary is in the generated image, but since it's not just the binary itself that is missing (but essentially most files installed via apt), the image is completely non-functional:

docker run --rm -ti <my-repo>:test-tag
root@92196457ce8a:/# python3.11 --version
python3.11: error while loading shared libraries: libexpat.so.1: cannot open shared object file: No such file or directory

Note that I have tried various alternatives using or not using --cache and using different --snapshot-mode

Expected behavior

All files are stored in the generated image.

If I build the image using the Dockerfile above via docker buildx build the image works as expected:

docker run --rm -ti <my-repo>:test-tag
root@93076a150249:/# which python3.11
/usr/bin/python3.11
root@93076a150249:/# python3.11 --version
Python 3.11.0rc1

To Reproduce
Steps to reproduce the behavior:

  1. Use Dockerfile above
  2. Build using kankiko using the command above
  3. Launch image, launch python, see failure (python missing or incorrectly installed)

Additional Information

Kaniko version :  v1.22.0
Description Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use --cache flag
Please check if your dockerfile is a multistage dockerfile
@clemenskol
Copy link
Author

FYI, I looked at other tickets with a similar problem (e.g., #2336), but either the root-cause described in those tickets is different or the proposed work-around did not work for me.

I have tried many different work-arounds, and none worked for me (aside from touching every file in the file-system, but this is not an option for me)

@clemenskol
Copy link
Author

clemenskol commented Apr 19, 2024

another observation: if I change this to a multi-stage build AND I do more than just use RUN commands, then it sometimes works. Yes, I execute the same build twice in a row, and it seems I have about a 50% chance to get a working container image

Dockerfile:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as stage
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11
FROM stage as final
ADD .ignore /.ignore

Build command (note that I'm not pushing remotely just to safe roundtrip time, pushing to a remote registry has the same outcome):

/kaniko/executor \
    --context /workspace \
    --dockerfile Dockerfile \
    --destination kamiko-test:3 \
    --no-push \
    --tar-path output.tar \
    --target final \
    -v debug

Test command:

docker image rm kamiko-test:3
docker load -i output.tar
docker run --rm -ti kamiko-test:3 bash -c '/usr/bin/python3.11 --version'

In some cases we get this result (installation worked):

Untagged: kamiko-test:3
Deleted: sha256:c42801f9c6b74e0dd7002f9439d0e2675fddc2070665f5646b0303e5e9277a01
Deleted: sha256:58ee2628caa0ebb2dd0b9ee2893bb7f6a3996ed8b41177a209154b270e2952f5
Deleted: sha256:c6a78351595ae2bb76e7284ec47f720e5b7d7e9f66ffab997d24436d143c491d
e2b5084e6f6a: Loading layer [==================================================>]  49.89MB/49.89MB
e74d10928493: Loading layer [==================================================>]     259B/259B
Loaded image: kamiko-test:3
Python 3.11.0rc1

In other cases we get this result (installated files were not committed to the snapshot/image):

Untagged: kamiko-test:3
Deleted: sha256:ac778b382fa91f37cfb3d35e2d56d0a52531fb42082b7e2226e44858b0167f29
Deleted: sha256:a1a681b7fa20e5528304dfe34897ebac67a8f4ff3ecceaf6774445c6fd37fe18
Deleted: sha256:6262b815a55b0dc3bb6679ac18aa94d9aa3fa1074357640627318925a53d05af
e9f9bcb2687e: Loading layer [==================================================>]  6.344MB/6.344MB
f81778963cd0: Loading layer [==================================================>]     252B/252B
Loaded image: kamiko-test:3
bash: line 1: /usr/bin/python3.11: No such file or directory

As said, it's random and about 50% to have the one or other outcome. And even more weirdly, it seems to alternate if it works or if it fails. As if a cache would corrupt and then uncorrupt itself (note that in these experiments the cache is off).

I have captured the stdout (build command outputs) + stderr (kamiko debug-verbosity logs) from a successful and failing build.
The stdout build command output is essentially identical (aside from the download/timing info from apt)
The stderr kamiko debug output is very different however and once contains the expected binary in one of the logs but not the other

@anoop142
Copy link
Contributor

anoop142 commented Jun 3, 2024

Hi @clemenskol , did you find any workaround?

@clemenskol
Copy link
Author

Hi @clemenskol , did you find any workaround?

unfortunately no. We had to move away from kaniko - it was the only "solution" that worked

@jrevillard
Copy link

Same issue here and I'm pretty sure that we are not the only one having it.....

@anoop142 , anybody able to reproduce on your side ?

@anoop142
Copy link
Contributor

anoop142 commented Jun 6, 2024

Same issue here and I'm pretty sure that we are not the only one having it.....

@anoop142 , anybody able to reproduce on your side ?

For me the basic case that fails is

# Fails
RUN <<EOF
echo "foo" > /home/foo
EOF

RUN grep foo /home/foo

grep: /home/foo: No such file or directory

While this works

# Works
RUN echo "foo" > /home/foo

RUN grep foo /home/foo

Seems like kaniko is skipping layers when EOF is used for RUN.

@jrevillard
Copy link

Ok. At least seems not to be the case of @clemenskol.

As far as I'm concern I do not use EOF too but, if this could be a problem, I'm doing it inside a Gitlab CI job.

Best,
Jérôme

@anoop142
Copy link
Contributor

@jrevillard you are right, skipping EOF command is indeed a different issue #1713.

@aaron-prindle aaron-prindle added issue/missing-files priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. differs-from-docker works-with-docker kind/bug Something isn't working labels Jun 14, 2024
@kakliniew
Copy link

kakliniew commented Aug 28, 2024

I seem to be encountering the same problem. Only in my case one out of dozens of images is broken. A couple of files from the base image are not available in the final image. It looks as if the last layer is not properly snapshotted (size 2mb instead of 150mb; worth mentioning also RUN). All images use the same base image, run on different machines.
The same dockerfile and the same source files can produce the wrong image (replayed gitlab pipeline from same source).
Images are built with kaniko-project/executor:v1.21.1-debug docker image in gitlab pipeline. In the logs when an image is broken, it is missing the part with ignoring socket (rest stays the same):

INFO[1181] Taking snapshot of full filesystem...        
INFO[1199] Ignoring socket signalapp.00, not adding to tar 
INFO[1199] Ignoring socket signalapp.01, not adding to tar 
INFO[1199] Ignoring socket signalapp.02, not adding to tar 
...
INFO[1574] Pushing image to ....

Upgraded to newest kaniko 1.23.2-debug and will observe results.
I don't know what could be the cause, maybe something with the cache, but I don't use any additional flags.

I cant share my dockerfile and base image, but its not multistage. This is very difficult to debug, as it happens quite rarely

@kakliniew
Copy link

Unfortunately, the missing files have further appeared in the latest version of kaniko. I will try to add an image test as the next stage

@fmoessbauer
Copy link

Unfortunately, the missing files have further appeared in the latest version of kaniko.

The biggest issue I see is the lost trust in kaniko. If there is no guarantee, that the filesystem is identical to the one produced by buildx or buildah (at least semantically), I simply can't use kaniko. In production, it is almost impossible to check if all needed files are there or not.

@kakliniew
Copy link

kakliniew commented Sep 19, 2024

I tried to use the --single-snapshot flag, because sometimes the error error building image: error building stage: failed to take snapshot: archive/tar: write too long appeared as described here. Adding the flag didn't help and it also built me an image with missing files. I added RUN ls /file/location (those files that were sometimes missing) at the end of the dockerfile and out of 300 builds they all look fine (except that there has sometimes been a problem with archive/tar: write too long). I will keep observing.

@Silvanoc
Copy link
Contributor

I'm trying to reproduce this issue, does anybody has a very simple reproduction setup. The less files are being written, the easier it will be to diagnose the root-cause. The reported one by installing Python is changing/adding so many files that it is harder to diagnose.

@kakliniew
Copy link

kakliniew commented Oct 14, 2024

--single-snapshot didnt solve the issue. Even added RUN ls /file/location - lists the files during the build, but they are not available in the built image. The final broken image is a lighter - 1.4gb instead of 1.6gb. Those files that are missing are in the source image ( FROM source_image ). This problem occurs once every few dozen builds, so it is difficult to say why it occurs then. Just as randomly, the problem occurs: archive/tar: write too long.

I have changed the pipeline so that I now build the image to ‘candidate’, the next stage(gitlab stage) opens this image and the ls /file/location command, and if everything is ok, I use crane cp source destination to push it to the main tag (crane doesn't change digest). Unfortunately I can't share my dockerfile.

@Silvanoc
Copy link
Contributor

Since I cannot reproduce this issue, I am providing you my tooling trying to reproduce it. It might help you, but you will probably need to adapt it for your system.

I am using NetBSD mtree to get a "snapshot" of the built rootfs from "inside" Kaniko build.

This is the Dockerfile of the image to be built (the one used to report this issue + mtree):

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.12 \
        mtree-netbsd && \
    ls -l `which python3.12` && \
    python3.12 --version && \
    md5sum `which python3.12` > python3.12.md5
COPY ./mtree-excludes /tmp/
RUN \
    mtree \
        -c \
        -x \
        -K md5 > rootfs.built.mtree

This is a script to build the image using Kaniko and then instantiate a container using the built image. It then tries to find out if some files have "disappeared":

#!/usr/bin/env bash

set -eu

TOOL="finch"
IMG="reproduce-kaniko-3123.tar"
CONT_IMG="/workspace/${IMG}"
LOCAL_IMG="${IMG}"

echo ; echo "*********************"
echo "Building the image..." ; echo

"${TOOL}" run \
    -v $PWD:/workspace \
    gcr.io/kaniko-project/executor:latest \
        --dockerfile /workspace/Dockerfile \
        --no-push \
        --context dir:///workspace/ \
        --tar-path "${CONT_IMG}"

echo ; echo "********************"
echo "Loading the image..." ; echo

"${TOOL}" load \
    -i "${LOCAL_IMG}"

echo ; echo "***********************"
echo "Comparing the rootfs..." ; echo

echo "> python3.12 binary checksum as reported by md5sum from Kaniko"
"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        cat /python3.12.md5

"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        cat /rootfs.built.mtree > rootfs.built.mtree

echo ; echo "> python3.12 binary checksum as reported by mtree from Kaniko"
grep -A 1 "^    python3.12  " rootfs.built.mtree | head -n 2

"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        mtree -f /rootfs.built.mtree > rootfs-changes.mtree \
|| true

echo ; echo "****************"
echo "Missing files..." ; echo
grep "^missing: " rootfs-changes.mtree

It runs mtree in the container to find out which changes have happened at file-system level (including timestamps, permissions, MD5 checksum,...). In my system after over 10 runs I haven been able to detect any unexpected changes (apart from Kaniko files being removed).

Let's see if with this help we can get someone to provide some more insights on what is going on...

@Silvanoc
Copy link
Contributor

Silvanoc commented Nov 8, 2024

Assuming we ever get to properly diagnose this issue and get to the root-cause, we even write a patch to fix it... will we ever see a fix getting integrated into Kaniko?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
differs-from-docker issue/missing-files kind/bug Something isn't working priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. works-with-docker
Projects
None yet
Development

No branches or pull requests

7 participants