Runtime All-Hands June 2017 Summary

All of Mozilla met in San Francisco last week for a work week. Unlike the last few All-Hands, we spent the week mostly informally and not in meetings — hacking in rooms together on near-term work.

The Runtime engineering team was focused on landing patches for Quantum Flow, Quantum DOM and Quantum Networking efforts. We had exciting changes related to Speedometer v2, both in improving how we measure and landing key patches. The Security Engineering team invited the Tor Project to join and deep dive into the Android version of the browser (based on Fennec, and called OrFox). The rest of the Runtime team was landing patches, reconnecting with colleagues across the org, and making exciting, measurable progress toward a great launch of Firefox 57.

I asked several team leads to send me their highlights from the week. I’ve summarized this below. If I missed something that was important to you, please get in touch.

Project Quantum highlights

“Watching my laptop race HTTP network queries against the disk cache and seeing that it was choosing the right transactions to have the network actually be faster.” -Patrick McManus

  • QF team fixed 26 Quantum Flow bugs since last Friday, June 23
  • Landed (preffed off, going to do a pref experiment for rollout) budget-based background tab throttling (meta bug)
  • Joel Maher and his “army of automation” has helped correct Speedometer reporting.
  • Got a bunch of people from different teams in a room and figured out the easiest/best architecture for supporting the moz-page-thumbs protocol in e10s (i.e. the protocol that supports everything you see when you open a new tab). Same, for nsITraceableListener support (which is must for 57: needed to support the NoScript addon).
  • Incremental table sweeping bug fixes landed that should reduce GC pause times.
  • Byte code cache landed and is on for 5% of Nightly population — this project was in progress for more than a year.
  • We now have a name for almost every runnable in Firefox.

Security/Privacy Highlights

“At Mozilla all hands this week. They are excited to work with us.” –Mike Perry, Tor Project

  • Tor Browser for Android was updated during the workweek to be based on Firefox 52 (from 45). The update is in QA now.
  • Patch written (and being rewritten) for constant blinding in the JIT.
  • A [patch for integrating Tor into Focus was hacked up][8] for discussion.
  • Got the TLS Canary (tool for testing changes to our crypto stack on Alexa-top-100 websites) running in TaskCluster.
  • Had first successful use of OneCRL administrative workflow

Other Runtime Highlights

“The culture of focusing on performance is in effect! Performance was a big part of every discussion and review.” -Andrew Overholt

  • “Making my first interoperable handshake and encrypted data for Mozilla’s IETF QUIC.” -Patrick McManus
  • [JavaScript classes][9] are done and fully optimized.
  • [GeckoView example now being tested][9] in automation.
  • Added security certificate information to GeckoView for use in PWA and Custom Tabs.
  • Taught a bunch of people how to profile at the two Quantum Flow profiler office hours sessions.

Thanks everyone for a productive week!

TaskCluster 2016Q2 Retrospective

The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.

We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.

We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.

We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.

A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.

This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.

Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.

I can’t wait to see what this team accomplishes in Q3!

Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!

Things we did this quarter:

  • initial investigation and timing data around using sccache for linux builds
  • released update for sccache to allow working in a more modern python environment
  • created taskcluster managed s3 buckets with appropriate policies
  • tested linux builds with patched version of sccache
  • tested docker-worker on packet.net for on hardware testing
  • worked with jmaher on talos testing with docker-worker on releng hardware
  • created livelog plugin for taskcluster-worker (just requires tests now)
  • added reclaim logic to taskcluster-worker
  • converted gecko and gaia in-tree tasks to use new v2 treeherder routes
  • Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting
  • move docs, schemas, references to https
  • refactor documentation site into tutorial / manual / reference
  • add READMEs to reference docs
  • switch from a * certificate to a SAN certificate for taskcluster.net
  • increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration
  • use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes
  • allow non-employees to login with Okta, improve authentication experience
  • named temporary credentials
  • use npm shrinkwrap everywhere
  • enable coalescing
  • reduce the artifact retention time for try jobs (to reduce S3 usage)
  • support retriggering via the treeherder API
  • document azure-entities
  • start using queue dependencies (big-graph-scheduler)
  • worked with NSS team to have tasks scheduled and displayed within treeherder
  • Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)
  • added hg fingerprint verification to decision task
  • Responded and deployed patches to security incidents discovered in q2
  • taskcluster-stats-collector running with signalfx
  • most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor
  • Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine
  • QEMU/KVM engine for taskcluster-worker
  • Implemented Task Group Inspector
  • Organized efforts around front-end tooling
  • Re-wrote and generalized the build process for taskcluster-tools and future front-end sites
  • Created the Migration Dashboard
  • Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site
  • First Windows tasks in production – NSS builds running on Windows 2012 R2
  • Windows Firefox desktop builds running in production (currently shown on staging treeherder)
  • new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)
  • many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs
  • CI cleanup https://travis-ci.org/taskcluster
  • support for relative definitions in jsonschema2go
  • schema/references cleanup

Paying down technical debt

  • Fixed numerous issues/requests within mozilla-taskcluster
  • properly schedule and retrigger tasks using new task dependency system
  • add more supported repositories
  • Align job state between treeherder and taskcluster better (i.e cancels)
  • Add support for additional platform collection labels (pgo/asan/etc)
  • fixed retriggering of github tasks in treeherder
  • Reduced space usage on workers using docker-worker by removing temporary images
  • fixed issues with gaia decision task that prevented it from running since March 30th.
  • Improved robustness of image creation image
  • Fixed all linter issues for taskcluster-queue
  • finished rolling out shrinkwrap to all of our services
  • began trial of having travis publish our libraries (rolled out to 2 libraries now. talking to npm to fix a bug for a 3rd)
  • turned on greenkeeper everywhere then turned it off again for the most part (it doesn’t work with shrinkwrap, etc)
  • “modernized” (newer node, lib-loader, newest config, directory structure, etc) most of our major services
  • fix a lot of subtle background bugs in tc-gh and improve logging
  • shared eslint and babel configs created and used in most services/libraries
  • instrumented taskcluster-queue with statistics and error reporting
  • fixed issue where task dependency resolver would hang
  • Improved error message rendering on taskcluster-tools
  • Web notifications for one-click-loaner UI on taskcluster-tools
  • Migrated stateless-dns server from tutum.co to docker cloud
  • Moved provisioner off azure storage development account
  • Moved our npm package to a single npm organization

TaskCluster Platform Team: Q1 retrospective

TaskCluster Platform team did a lot of foundational work in Q1, to set the stage for some aggressive goals in Q2 around landing new OS support and migrating as fast as we can out of Buildbot.

The two big categories of work we had were “Moving Forward” — things that move TaskCluster forward in terms of developing our team and adding cool features, and “Paying debt” — upgrading infra, improving security, cleaning up code, improving existing interfaces and spinning out code into separate libraries where we can.

As you’ll see, there’s quite a lot of maintenance that goes into our services at this point. There’s probably some overlap of features in the “paying debt” section. Despite a little bit of fuzzyness in the definitions, I think this is an interesting way to examine our work, and a way for us to prioritize features that eliminate certain classes of unpleasant debt paying work. I’m planning to do a similar retrospective for Q2 in July.

I’m quite proud of the foundational work we did on taskcluster-worker, and it’s already paying off in rapid progress with OS X support on hardware in Q2. We’re making fairly good progress on Windows in AWS as well, but we had to pay down years of technical debt around Windows configuration to get our builds running in TaskCluster. Making a choice on our monitoring systems was also a huge win, paying off in much better dashboarding and attention to metrics across services. We’re also excited to have shipped the “Big Graph Scheduler”, which enables cross-graph dependencies and arbitrarily large task graphs (previous graphs were limited to about 1300 tasks). Our team also grew by 2 people – we added Dustin Mitchell, who will continue to do all kinds of work around our systems, focus on security-related issues and will ship a new intree configuration in Q2, and Eli Perelman, who will focus on front end concerns.

The TaskCluster Platform team put the following list together at the start of Q2.

Moving forward:

  • Kicked off and made excellent progress on the taskcluster-worker, a new worker with more robust abstractions and our path forward for worker support on hardware and AWS (the OS X worker implementation currently in testing uses this)
  • Shipped task.dependencies in the queue and will be shipping the rest of the “big graph scheduler” changes just in time to support some massive release promotion graphs
  • Deployed the first sketch for monitoring dashboard
  • Shipped login v3 (welcome, dustin!)
  • Rewrote and tested a new method for mirroring data between AWS regions (cloud-mirror)
  • Researched a monitoring solution and made a plan for Q2 rollout of signalFX
  • Prototyped and deployed aggregation service: statsum (and client for node.js)
  • Contributed to upstream open source tools and libraries in golang and node ecosystem
  • Brought bstack and rthijssen up to speed, brought Dustin onboard!
  • Working with both GSoC and Outreachy, and Mozilla’s University recruiting to bring five interns into our team in Q2/Q3

Paying debt:

  • Shipped better error messages related to schema violations
  • Rolled out formalization of error messages: {code: “…”, message: “…”, details: {…}}
  • Sentry integration — you see an 5xx error with an incidentId, we see it too!
  • Automatic creation of sentry projects, and rotation of credentials
  • go-got — simple HTTP client for go with automatic retries
  • queue.listArtifacts now takes a continuationToken for paging
  • queue.listTaskGroup refactored for correctness (also returns more information)
  • Pre-compilation of queue, index and aws-provisioner with babel-compile (no longer using babel-node)
  • One-click loaners, (related work by armenzg and jmaher to make loaners awesome: instructions + special start mode)
  • Various UI improvements to tools.taskcluster.net (react.js upgrade, favicons, auth tools, login-flow, status, previous taskIds, more)
  • Upgrade libraries for taskcluster-index (new config loader, component loader)
  • Fixed stateless-dns case-sensitivity (livelogs works with DNS resolvers from Germans ISPs too)
  • Further greening of travis for our repositories
  • Better error messages for insufficient scope errors
  • Upgraded heroku stack for events.taskcluster.net (pulse -> websocket bridge)
  • Various fixes to automatic retries in go code (httpbackoff, proxy in docker-worker, taskcluster-client-go)
  • Moved towards shrinkwrapping all of the node services (integrity checks for packages)
  • Added worker level timestamps to task logs
  • Added metrics for docker/task image download and load times
  • Added artifact expiration error handling and saner default values in docker-worker
  • Made a version jump from docker 1.6 to 1.10 in production (included version upgrades of packages and kernel, refactoring of some existing logic)
  • Improved taskcluster and treeherder integration (retrigger errors, prep for offloading resultset creation to TH)
  • Rolling out temp credential support in docker-worker
  • Added mach support for downloading task image for local development
  • Client support for temp credentials in go and java client
  • JSON schema cleanups
  • CI cleanup (all green) and turning off circle CI
  • Enhancements to jsonschema2go
  • Windows build work by rob and pete for getting windows builds migrated off Buildbot
  • Added stability levels to APIs

[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.

[berlin] TaskCluster Platform: A Year of Development

Back in September, the TaskCluster Platform team held a workweek in Berlin to discuss upcoming feature development, focus on platform stability and monitoring and plan for the coming quarter’s work related to Release Engineering and supporting Firefox Release. These posts are documenting the many discussions we had there.

Jonas kicked off our workweek with a brief look back on the previous year of development.

Prototype to Production

In the last year, TaskCluster went from an idea with a few tasks running to running all of FirefoxOS aka B2G continuous integration, which is about 40 tasks per minute in the current environment.

Architecture-wise, not a lot of major changes were made. We went from CloudAMQP to Pulse (in-house RabbitMQ). And shortly, Pulse itself will be moving it’s backend to CloudAMQP! We introduced task statuses, and then simplified them.

On the implementation side, however, a lot changed. We added many features and addressed a ton of docker worker bugs. We killed Postgres and added Azure Table Storage. We rewrote the provisioner almost entirely, and moved to ES6. We learned a lot about babel-node.

We introduced the first alternative to the Docker worker, the Generic worker. We for the first time had Release Engineering create a worker, the Buildbot Bridge.

We have several new users of TaskCluster! Brian Anderson from Rust created a system for testing all Cargo packages for breakage against release versions. We’ve had a number of external contributors create builds for FirefoxOS devices. We’ve had a few Github-based projects jump on taskcluster-github.

Features that go beyond BuildBot

One of the goals of creating TaskCluster was to not just get feature parity, but go beyond and support exciting, transformative features to make developer use of the CI system easier and fun.

Some of the features include:

Features coming in the near future to support Release

Release is a special use case that we need to support in order to take on Firefox production worload. The focus of development work in Q4 and beyond includes:

  • Secrets handling to support Release and ops workflows. In Q4, we should see secrets.taskcluster.net go into production and UI for roles-based management.
  • Scheduling support for coalescing, SETA and cache locality. In Q4, we’re focusing on an external data solution to support coalescing and SETA.
  • Private data hosting. In Q4, we’ll be using a roles-based solution to support these.

TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.

Publications

Our team published a few blog posts and videos this quarter:

TaskCluster migration: about the Buildbot Bridge

Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.

I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

  • Allows a graceful transition
  • We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
  • It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
  • One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

There are three services:

  • TC Listener: responds to things happening in TC
  • BuildBot Listener: responds to BB events
  • Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.

BBB has a small database that associates build requests with TC taskids and runids.

BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.

The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).

Taskcluster Listener

Reacts to events coming from TC Pulse exchanges.

Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.

Buildbot Listener

Responds to events coming from the BB Pulse exchanges.

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

  • Runs on a timer – every 60 seconds
  • Reclaims tasks: need to do this every 30-60 minutes
  • Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

  • A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

BB creates a Build. BBListener receives buildstarted event, claims the Task.

Reflector reclaims the Task while the Build is running.

Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.

  • Build fails initially, succeeds upon retry

(500 from hg – common reason to retry)

Same through Reflector.

BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.

BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.

Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.

  • Task exceeds deadline before Build starts

Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve

QUESTIONS:

  • TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

  • How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

  • Replacing the scheduler what does that mean exactly?

    • Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
    • Remove all scheduling from BuildBot and Hg polling

Roll-out plan

  • Connected to the Alder branch currently
  • Replacing some of the Alder schedulers with TaskGraphs
  • All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!

Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.

TaskCluster migration: a “hello, world” for worker task creator

On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

  • You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
  • There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.