Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base image based on Alpine #99

Open
dragospopa420 opened this issue Mar 15, 2023 · 7 comments
Open

Base image based on Alpine #99

dragospopa420 opened this issue Mar 15, 2023 · 7 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@dragospopa420
Copy link

dragospopa420 commented Mar 15, 2023

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

The base image is based on Debian which has a much bigger fingerprint than the Alpine Linux.
So I was thinking maybe the included dockerfile can be based on Alpine Linux, for fast deployment and testing
The apify/actor-node-puppeteer-chrome has 2.53gb, my version has 698mb

Motivation

I'm building an infrastructure of spiders based on Crawlee and I wanted to have the fastest possible deployment time.

Ideal solution or implementation, and any additional constraints

FROM node:current-alpine

# Set workdir
WORKDIR /usr/src/app

# Copy just package.json and package-lock.json
# to speed up the build using Docker layer cache.
COPY package*.json ./

# Change rights for package-lock.json
RUN chmod 744 package-lock.json

# Install chromium and it's dependencies, node is also here to be sure that is updated
RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn

# This tells puppeteer to not download chrome again
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && (npm list --omit=dev --all || true) \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./

# Required for Crawlee
ENV CRAWLEE_CHROME_EXECUTABLE_PATH=/usr/bin/chromium-browser
RUN chmod 744 /usr/bin/chromium-browser

# Run the image.
CMD npm start 

Alternative solutions or implementations

No response

Other context

No response

@ivanvs
Copy link

ivanvs commented Mar 21, 2023

Hi @dragospopa420,

I think that docker images are not part of this repo. You should probably raise an issue on this repo since if I understand everything correctly that is repository for docker images for apify.

Source code of the image that you are referencing is here: https://github.com/apify/apify-actor-docker/tree/master/node-puppeteer-chrome

@B4nan B4nan transferred this issue from apify/crawlee Mar 21, 2023
@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 22, 2023
@dragospopa420
Copy link
Author

Thanks @ivanvs .
Thanks @B4nan for transferring the issue
Thanks @mtrunkat

I've also had some time to test this image and seems to perform well. Haven't found anything wrong with it.

@mtrunkat
Copy link
Member

mtrunkat commented Mar 28, 2023

Thanks, @dragospopa420. The image size is currently something we plan to look into.

CC @fnesveda @B4nan, please take a look

@B4nan
Copy link
Member

B4nan commented Mar 28, 2023

I was asking @vladfrangu to take a closer look last week. IIRC the reason why we use ubuntu was supporting chromium, rest of the browsers should be fine with debian?

@dragospopa420
Copy link
Author

I was asking @vladfrangu to take a closer look last week. IIRC the reason why we use ubuntu was supporting chromium, rest of the browsers should be fine with debian?

This image is using Alpine.
Chromium works fine on Alpine. Deployed it in some production environments already.
From what I see Firefox is in the community repo of Alpine and it works properly.

@vladfrangu
Copy link
Member

Super sorry for the late response! The main image that (probably) can't use Alpine is WebKit (Safari). Its good to know that chromium works on alpine, but does chrome work on it too? 👀

@fnesveda
Copy link
Member

I believe the main reason we use Debian in the base images is compatibility with user libraries. Debian uses glibc, while Alpine uses an alternative libc implementation, musl libc, which is not 100% compatible. While musl libc behaves more correctly according to standards, most software is written targeting glibc and all its quirks, and could break when used with musl libc (or would have to be recompiled at least). So I would recommend staying with Debian for these compatibility reasons.

I believe most of the size difference between the image produced by @dragospopa420's Dockerfile and what we have is down to other differences:

  • Chrome is much larger than Chromium
  • for some reason we have multiple installations of Chrome in the image
  • we also have multiple node_modules in the folder, some of which seem unnecessary
  • we have some operations split between multiple steps, which creates multiple layers and each layer's diff adds to the image size (e.g. the chrome.deb download, install and removal should be in the same step ideally)

You can use the great dive tool to inspect the images layer by layer and see what's taking up most of the size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

6 participants