-
October 10 2015, Desenzano
Massimiliano Mantione
Passionate about languages and compilers
Worked in the V8 team in Google
Overall, worked on JIT compilers for +7 years
Now working on scalable, fault tolerant,
web facing distributed systems
I started as a Javascript hater
When I sow Javascript for the 1st time (in 1995) I vowed that I would never touch such an abomination, even with a 10 feet pole
Eventually, I changed my mind, but...
Back to the actual topic!
We'll see...
What an artifact repository is
(and why we need it)
How docker works internally
How we can do better than that!
And even something relevant to node.js land!
Reliable
Fast
Always successful
Most of all, deterministic
Deploying more times the same build
must produce the same result
It means efficient
Should not waste bandwidth (time)
Should not waste disk (space)
When you deploy a build more than once, you are sure to deploy the same files every time
Wait... did we say build?
Usually with node.js you do not need a build step, right?
Javascript is immediately executable...
(we all know that front end code does need a build)
But what about npm install?
Is npm install part of the build?
Or is it part of the deployment?
Let's define build and deployment
Transforming code so that it can be executed
Must happen every time the developer touches the code
(to test it)
Should be followed by a QA process
(to throughly test it!)
Transferring the result of a build to other machines
where it wiil be executed...
...and then actually running that code
QA should already have happened (especially if the deployment is to production servers)
Put a QA phase in the middle
Assume the QA lasts days
(validation on a staging server)
How can you put in production the same build you tested in staging?
How can you put the same build on other nodes if you want dynamic scaling?
You store the result of the build somewhere
Use a source code repository?
Check build artifacts
into it? BAD IDEA!
Use an artifact repository?
Feel the need of an artifact repository?
Store your builds in an artifact repository
Transfer them to the production machine...
...and then start them
Most of all,
stick to that!
To perform build steps during the deployment?
It can be non deterministic
It will be slow
Looks a lot like a key-value store
You store build results into it
You identify them by name@version (a unique key)
You can fetch them when and where you need them
If all your artifacts are npm modules,
a private npm repo could be a good artifact repository
Otherwise, make npm install part of the build!
and you still need an artifact repository...
Use npm shrinkwrap
and eventually greenkeeper.io
If npm install is part of the build,
what is the artifact repository?
But what are we using it for?
It brings determinism to deployment
It makes it feasible to deploy a new "VM"
every time you need to start something
The initial state of each container is fully under your control
It is the ultimate artifact repository!
Say goodbye to...
While running containers is blazingly fast,
preparing them is a chore
Say that I must wait for
docker build to complete...
...one more time!
Should I also wait...
...for docker pull?
Running is fast
Deployment is deterministic
Can we fix its build issues?
You put there a mix of...
the initial OS image
every "tech package" you need
(nodejs, nginx, redis...)
your own code
(build artifacts)
eventually, a "state" volume
finally, maybe configuration files
A way to uniquely identify a piece of immutable data
Each piece of data is identified by its cryptographic hash
If the data is composed of other pieces, it is cheaper to hash their hashes instead of their contents!
A list of messages
Each item is:
Each item's hash should be the hash of both
For the list of previous messages,
hashing the hash is enough!
M1
data: "Hi"
prev: null
hash: H1 = H("Hi", null)
M2
data: "Welcome"
prev: M1
hash: H2 = H("Welcome", H1)
M3
data: "Thank you"
prev: M2
hash: H3 = H("Thank you", H2)
H1
data: "Hi"
prev: null
hash: H1 = H("Hi", null)
H2
data: "Welcome"
prev: H1
hash: H2 = H("Welcome", H1)
H3
data: "Thank you"
prev: H2
hash: H3 = H("Thank you", H2)
H1
data: "Hi"
prev: null
hash: H1 = H("Hi", null)
H2a
data: "Wat?"
prev: H1
hash: H2a = H("Wat?", H1)
H3a
data: "Sorry..."
prev: H2a
hash: H3a = H("Sorry...", H2a)
H1 {data: "Hi", prev: null} // Start
H2 {data: "Welcome", prev: H1} // 1s2 list
H3 {data: "Thank you", prev: H2} // ... (1st)
H2a {data: "Wat?", prev: H1} // 2nd list
H3a {data: "Sorry...", prev: H2a} // ... (2nd)
Docker images (and containers) use layered file systems
(AUFS, BTRFS, Device Mapper, Overlayfs, VFS)
Each layer represents the result of one step in the docker build
(which means one line in the docker file)
Docker hashes Dockerfile "RUN" lines!
(this gives us a Merkel tree)
Hopefully those lines give deterministic results
Except when they don't...
Can get large
Can waste space deleted by subsequent layers
Do not share space well between images
Most of all, they can have only one parent
This is because they model
changes and not contents
Except that they are DAGs
(Directly Acyclic Graphs)
The Git object store represents trees
Each tree is the content of a git revision
However, trees can share subtrees
This is because they model
contents and not changes
(changes are represented in the revision graph)
Is there a way to put this to our advantage?
Docker would be an ideal artifact repository
however
Docker run is fast
Docker build and pull are slow
underlying OS image: weeks
(for security updates)
technology stack: months
(new versions of nodejs, nginx...)
our own artifacts: minutes!
(every time you run the result of a build)
should we handle them all in the same way?
What if we handled our build artifacts with a Git-like content addressable storage distinct from the docker one?
Tho distinct artifact repositories
Use docker images for the OS and the "tech stack"
Use a Git-like repository for build artifacts
Mount the build artifact as a read-only volume in a docker image when you deploy it
Create a new branch for every commit
(so you can forget history)
Do shallow clones to create the client repo
(so you don't clone it all)
Pull and checkout at every deploy
Mount the checked out directory as a read only volume into the docker container
We are still wasting space on the servers
The artifact files are both on the local git repo and in every checked out copy
Can we have a repository that
shares every identical
file instance?
A Merkel DAG based artifact store
Checked out files are hard links to files in the local repo
still experimental
Modular backends
Implemented local files, remote S3 and leveldb
Each backend is about 100 lines of code
It already works!
archive
pull
checkout
copy - trim - gc
list - remove - check
We have seen...
Why we need an artifact repository
How docker works internally
(Merkel trees)
How we can do better than that!
(Merkel DAGs hashing pure content)
And that implementing it is not that hard!
Come and talk to me
or email jobs@hyperfair.com
code, docs and slides are on github
twitter: @M_a_s_s_i, #metascript