Running puppeteer and headless chrome on AWS lambda with Serverless

Running chrome headless on AWS lambda is a problem that can be sliced in many ways. And it is well-documented as well. But things become a little bit complicated when you introduce puppeteer and serverless framework. Lately I’ve had to go through that exercise and figured out a few solutions and optimizations on my own.

Bundling headless chrome with serverless

If you try to bundle the headless chrome version that comes with puppeteer into your serverless deployment zip, you will soon notice that you will soon exceed the 50MB limit AWS sets. The solution to this problem comes in the form of an excellent serverless plugin - serverless-plugin-chrome that will package a reasonably-sized dist of headless chrome.

Therefore, let’s define this in your serverless.yml plugins section:

plugins:
  - serverless-plugin-chrome

At the same time, you want exclude the chromium dist that comes with puppeteer from your bundle. You can do that in the package section of your serverless.yml. (NOTE: at the time, the latest version of puppeteer keeps chromium at this path. This can change in the future - so be sure to ls and double check.)

package:
  exclude:
    - node_modules/puppeteer/.local-chromium/** # exclude puppeteer chrome if exists

Starting headless chrome and getting puppeteer to connect with it

To launch chrome and connect puppeteer in our project, we’ll use @serverless-chrome/lambda package. (NOTE: you will need to install this as we import it in our project, although this is encapsulated in the aforementioned serverless plugin as well).

Our objective is to use this package and get the debugger url of the running chrome process.

import * as launchChrome from "@serverless-chrome/lambda";
import * as request from "superagent";

const getChrome = async () => {
  const chrome = await launchChrome();

  const response = await request
    .get(`${chrome.url}/json/version`)
    .set("Content-Type", "application/json");

  const endpoint = response.body.webSocketDebuggerUrl;

  return {
    endpoint,
    instance: chrome
  };
};

export default getChrome;

There we go. We now have a utility which will give us the chrome process endpoint and a reference to the instance. Now let’s connect to puppeteer in our handler code.

const chrome = await getChrome();

const browser = await puppeteer.connect({
  browserWSEndpoint: chrome.endpoint
});

const page = await browser.newPage(); // and we go...

That’s it. Now puppeteer can connect to the correct chrome dist. Note that you will do this instead of puppeteer.launch().

Cleaning up after our handler

After we’re done, we have to tear down the chromium process. This is because in AWS Lambda, Stateless Doesn’t Mean No State!. In other words, lambda will not try to cold boot your handler each and everytime it runs - reusing container instances from previous. Therefore, our handler code should be idempotent. Not exiting the lambda process - I found - will lead to incosistent results across multiple executions.

Our clean-up code will look like this:

await browser.close();
setTimeout(() => chrome.instance.kill(), 0);

The setTimeout over there is a hack to defer killing the chrome process to the end of the execution stack in the javascript thread. This is because in certain *nix (I’ve tested this with macOS) environments killing the chrome instance tends to kill the node process which launched the chrome instance before the chrome instance gets killed. This will lead to the chrome instance being an orphan, and producing incosistent results in subsequent runs. Therefore, I’d recommend you including this to avoid any head-scratching when you or someone else does some local testing on it.

Furthere considerations for your serverless lambda function

  1. Make sure your serverless function has sufficient memory to execute headless chrome. I recommend 1536MB. If your handler does a lot of stuff, you may have to try it out with a higher limit. Generally, higher the limit, faster the execution.

  2. Introduce a reasonable timeout for your function. In case headless chrome crashes for whatever reason, and puppeteer does not exit, you will get billed for the hanging process.

functions:
  my-headless-chrome-expriment:
    handler: lib/my-headless-chrome-expriment/handler.exec
    timeout: 15
    memorySize: 1536
  1. Make sure you invoke chrome, only for the functions that need it by introducing custom.chrome segment that will be read by serverless-plugin-chrome.
custom:
  chrome: # chrome plugin only enabled for these functions
    functions:
      - my-headless-chrome-expriment
Written on June 10, 2018