-

Deterministic Deployments

with Node.Js

October 10 2015, Desenzano

Massimiliano Mantione

About Myself

An enthusiast software engineer

Passionate about languages and compilers

Worked in the V8 team in Google

Overall, worked on JIT compilers for +7 years

Now working on scalable, fault tolerant,
web facing distributed systems

Javascript and me

I started as a Javascript hater

When I sow Javascript for the 1st time (in 1995) I vowed that I would never touch such an abomination, even with a 10 feet pole

Eventually, I changed my mind, but...

Back to the actual topic!

Warning:

Intense trip ahead!

We'll see...

What an artifact repository is
(and why we need it)

How docker works internally

How we can do better than that!

And even something relevant to node.js land!

Deployments

They should be...

Reliable

Fast

Always successful

Reliable

Most of all, deterministic

Deploying more times the same build
must produce the same result

Fast

It means efficient

Should not waste bandwidth (time)

Should not waste disk (space)

Determinism means...

When you deploy a build more than once, you are sure to deploy the same files every time

Wait... did we say build?

Do we have a

build process?

Usually with node.js you do not need a build step, right?

Javascript is immediately executable...

(we all know that front end code does need a build)

But what about npm install?

Suppose we have a

build process...

Is npm install part of the build?

Or is it part of the deployment?

Let's define build and deployment

Build

Transforming code so that it can be executed

Must happen every time the developer touches the code
(to test it)

Should be followed by a QA process
(to throughly test it!)

Deployment

Transferring the result of a build to other machines
where it wiil be executed...

...and then actually running that code

QA should already have happened (especially if the deployment is to production servers)

Decouple build and deploy

Put a QA phase in the middle

Assume the QA lasts days
(validation on a staging server)

How can you put in production the same build you tested in staging?

How can you put the same build on other nodes if you want dynamic scaling?

You store the result of the build somewhere

Then we need...

An artifact
repository!

Do you...

Use a source code repository?

Check build artifacts
into it?
BAD IDEA!

Use an artifact repository?

Feel the need of an artifact repository?

Deterministic deployments

are easy

Store your builds in an artifact repository

Transfer them to the production machine...

...and then start them

Most of all,
stick to that!

Does it
make sense...

To perform build steps during the deployment?

No, it does not

It can be non deterministic

It will be slow

Say you understand that you need an artifact repository

Which one?

An artifact repository...

Looks a lot like a key-value store

You store build results into it

You identify them by name@version (a unique key)

You can fetch them when and where you need them

In a node-centric world

If all your artifacts are npm modules,
a private npm repo could be a good artifact repository

Otherwise, make npm install part of the build!

and you still need an artifact repository...

You need to make npm install

deterministic

anyway

Use npm shrinkwrap
and eventually greenkeeper.io

If npm install is part of the build,
what is the artifact repository?

What about docker?

We all love it!

But what are we using it for?

Docker is amazing!

It brings determinism to deployment

It makes it feasible to deploy a new "VM"
every time you need to start something

The initial state of each container is fully under your control

It is the ultimate artifact repository!

Say goodbye to...

  • was this package on the production machine?
  • at which version was it?

Yes, it is amazing

HOWEVER

While running containers is blazingly fast,
preparing them is a chore

Say that I must wait for
docker build to complete...

...one more time!

Should I also wait...

...for docker pull?

Docker is "almost there"

Running is fast

Deployment is deterministic

Can we fix its build issues?

Understanding
Docker images

What's in a docker image?

You put there a mix of...

the initial OS image

every "tech package" you need
(nodejs, nginx, redis...)

your own code
(build artifacts)

eventually, a "state" volume

finally, maybe configuration files

A docker image is a

merkel tree

A way to uniquely identify a piece of immutable data

Each piece of data is identified by its cryptographic hash

If the data is composed of other pieces, it is cheaper to hash their hashes instead of their contents!

Linear example

A list of messages

Each item is:

  • the message data
  • the list of previous messages

Each item's hash should be the hash of both

For the list of previous messages,
hashing the hash is enough!

The actual list

M1
  data: "Hi"
  prev: null
  hash: H1 = H("Hi", null)
M2
  data: "Welcome"
  prev: M1
  hash: H2 = H("Welcome", H1)
M3
  data: "Thank you"
  prev: M2
  hash: H3 = H("Thank you", H2)
  

Use hashes as identifiers

H1
  data: "Hi"
  prev: null
  hash: H1 = H("Hi", null)
H2
  data: "Welcome"
  prev: H1
  hash: H2 = H("Welcome", H1)
H3
  data: "Thank you"
  prev: H2
  hash: H3 = H("Thank you", H2)
  
  

An alternative list

H1
  data: "Hi"
  prev: null
  hash: H1 = H("Hi", null)
H2a
  data: "Wat?"
  prev: H1
  hash: H2a = H("Wat?", H1)
H3a
  data: "Sorry..."
  prev: H2a
  hash: H3a = H("Sorry...", H2a)
  

Let's merge them...

H1 {data: "Hi", prev: null}            // Start
  H2 {data: "Welcome", prev: H1}       // 1s2 list
    H3 {data: "Thank you", prev: H2}   // ... (1st)
  H2a {data: "Wat?", prev: H1}         // 2nd list
    H3a {data: "Sorry...", prev: H2a}  // ... (2nd)

It's a tree!

This is what

docker does

Docker images (and containers) use layered file systems
(AUFS, BTRFS, Device Mapper, Overlayfs, VFS)

Each layer represents the result of one step in the docker build
(which means one line in the docker file)

Docker hashes Dockerfile "RUN" lines!
(this gives us a Merkel tree)

Hopefully those lines give deterministic results

Except when they don't...

File system layers...

Can get large

Can waste space deleted by subsequent layers

Do not share space well between images

Most of all, they can have only one parent

This is because they model
changes and not contents

Git also uses

merkel trees

Except that they are DAGs
(Directly Acyclic Graphs)

The Git object store represents trees

Each tree is the content of a git revision

However, trees can share subtrees

This is because they model
contents and not changes

(changes are represented in the revision graph)

The advantage of hashing content...

...is that you can share more content!

Is there a way to put this to our advantage?

Let's recap...

Docker would be an ideal artifact repository

however

Docker run is fast

Docker build and pull are slow

Could we do
push and pull
less often?

Frequency of change of deployed contents

underlying OS image: weeks
(for security updates)

technology stack: months
(new versions of nodejs, nginx...)

our own artifacts: minutes!
(every time you run the result of a build)

should we handle them all in the same way?

What if we handled our build artifacts with a Git-like content addressable storage distinct from the docker one?

A different approach

Tho distinct artifact repositories

Use docker images for the OS and the "tech stack"

Use a Git-like repository for build artifacts

Mount the build artifact as a read-only volume in a docker image when you deploy it

Using git as

artifact repository

Create a new branch for every commit
(so you can forget history)

Do shallow clones to create the client repo
(so you don't clone it all)

Pull and checkout at every deploy

Mount the checked out directory as a read only volume into the docker container

Let's go deeper!

We are still wasting space on the servers

The artifact files are both on the local git repo and in every checked out copy

Can we have a repository that
shares every identical
file instance?

ARTS

(ARTifact Store)

A Merkel DAG based artifact store

Checked out files are hard links to files in the local repo

ARTs implementation

still experimental

Modular backends

Implemented local files, remote S3 and leveldb

Each backend is about 100 lines of code

It already works!

ARTs commands

archive

pull

checkout

copy - trim - gc

list - remove - check

Takeaway

We have seen...

Why we need an artifact repository

How docker works internally
(Merkel trees)

How we can do better than that!
(Merkel DAGs hashing pure content)

And that implementing it is not that hard!

BTW

We are hiring!

Come and talk to me
or email jobs@hyperfair.com

That's All, Folks

code, docs and slides are on github

twitter: @M_a_s_s_i, #metascript

Thanks for following!