GPU Process in Beta 53

A few months ago I reported on an experiment I had run in Firefox Nightly 53 where I compared stability (ie. crashes) between users with and without GPU Process. Since then we’ve fixed some bugs and are now days away from releasing GPU Process in Firefox 53. In anticipation of this moment I wanted to make sure we were ready so I organized a repeat of the previous experiment but this time on our Beta users.

As anticipated the results were different than we saw on Nightly but still favourable. I’d like to highlight some of those results in this post.

Anyone with Windows 7 Platform Update or later with a graphics card that has not previously been blacklisted, and who have multi-process enabled are supported. As it stands this represents approximately 25% of users. Since GPU Process is enabled by default for these users, the experiment randomly selects half of these users and turns off GPU Process by flipping a pref. There is a small percentage of noise in the data (+/- ~0.5%) since flipping the pref doesn’t actually turn off GPU Process until the first restart following the pref flip.

17% fewer driver related crashes

In the grand scheme of things graphics driver crashes represented about 3.17% of all crashes reported. For those with GPU Process enabled the percentage was much lower (2.81%) compared to those with GPU Process disabled (3.54%). While these numbers are low it’s worth noting that all users are not created equal. Some users may experience more driver-related crashes than others based on a multitude of factors (modernity of OS and driver updates, hardware, and mixture of third-party software, etc). It is conceivable, although not provable by this experiment, that the impact of this change would be more noticeable to users more prone to driver related issues. It’s also worth noting that we managed to make this change without introducing any new driver related crashes in the UI process which means Firefox should be much less prone to crashing entirely because of an interaction with the driver, although content may still be affected.

22% fewer Direct3D related crashes

Another category of crashes that sees improvements from GPU Process is D3D related crashes. This category of crashes typically involves hardware accelerated content on the web. In the past we’d see these occurring in the UI process which resulted in Firefox crashing completely. Now, with GPU Process we see about a 1/5th reduction in these crashes and those that remain tend not to happen in the UI process anymore. The end-user impact is that you might have to reload a page but Firefox dies less often.

11% fewer Direct3D accelerated video crashes

More stable hardware accelerated video another interesting benefit of GPU Process. We see about 11% fewer DXVA (DirectX Video Accelerator) related crashes in the test group with GPU Process enabled than the test group with it disabled. The end result of this should be slightly fewer crashes which take down the browser when viewing hardware accelerated video, think sites like YouTube.

Top crashes

Looking at the topcrash charts from Socorro shows expected movement in overall crash volumes. Browser topcrashes are down approximately 10% overall when GPU Process is enabled. Meanwhile, Content topcrashes are up approximately 8% and Plugin topcrashes are down 18%. Topcrashes in the GPU process only account for 0.13% of the overall topcrash volume. These numbers are in line with what we expected based on prior testing.

Telemetry

All of the previous metrics are based on anonymous data we receive from Socorro, ie. crash reports submitted by users. This data is extremely useful in digging into very specific details about crashes but it biases towards users who submit crash reports. Telemetry gives us less detailed information but is better at determining the broader impact of features since it represents a broader user population.

For the purposes of the experiment I compared the overall crash rate for each test group, where crash rate is defined as the number of crashes per 1,000 hours of aggregate browser usage across the entire test group population. The findings from the experiment are interesting in that we saw a slight increase in the Browser process crash rate (up 5.9% in the Enabled cohort), a smaller increase in the Content process crash rate (up 2.5% in the Enabled cohort), and a large decrease in the Plugin crash rate (down 20.6% in the Enabled cohort).

Overall, however, comparing the Enabled cohort to Firefox 52.0.2 does show more expected results with 0.51% lower browser crash rate, 18.1% higher content crash rate, and 18% lower plugin crash rate. It’s also worth noting that the crash rate for GPU Process crashes is very low, relatively, with just 0.07 crashes per 1,000 usage hours. Put in other terms that’s one GPU Process crash every 14,285 hours compared to one Browser process crash every 353 hours.

Conclusions

In the end I think we have accomplished our goal: introducing a GPU process (a foundational piece of the Quantum project) without regressing overall product stability. We’ve reduced the overall volume of some categories of graphics-related crashes while making others less prone to taking down the entire browser. One of the primary fears was that in doing so we’d introduce new ways to take down the browser but I’ve not yet found evidence that this has happened.

Of course the numbers themselves don’t tell the whole story. Now begins a deeper investigation of which crashes (ie. signatures) have changed significantly so that we can improve GPU Process further, but I think we have an excellent foundation to build from.

And of course, watching what happens as we roll out to users in Release with Firefox 53 in the coming days.

GPU Process Experiment Results

Update: It was pointed out that it was hard to know what the charts measure specifically due to unlabeled axes. For all charts measuring crashes (ie. not the percentage charts) the Y-Axis represents crash rate, where crash rate for Telemetry data is defined as “crashes per 1000 hours of usage” and crash rate for Socorro data is defined as “crashes per number of unique installations”. The latter really only applies to the Breakdown by Vendor chart and the vendor bars in the Breakdown of GPU Process Crashes chart. The x-axis for all charts is the date.

GPU Process has landed with 4% fewer crash reports overall!

  • 1.2% fewer Browser crashes
  • 5.6% fewer Content crashes
  • 5.1% fewer Shutdown crashes
  • 5.5% greater Plugin crashes
  • 45% fewer GPU driver crash reports!

Thanks to David Anderson, Ryan Hunt, George Wright, Felipe Gomes, Jukka Jylänki
Data sources available below

Several months ago Mozilla’s head of Platform Engineering, David Bryant, wrote a post on Medium detailing a project named Quantum. Built on the foundation of quality and stability we’d built up over the previous year, and using some components of what we learned through Servo, this project seeks to enable developers to “tap into the full power of the underlying device”. As David points out, “it’s now commonplace for devices to incorporate one or more high-performance GPUs”.

It may surprise you to learn that one of these components has already landed in Nightly and has been there for several weeks: GPU Process.  Without going in to too much technical detail this basically adds a separate Firefox process set aside exclusively for things we want the GPU (graphics processing unit) to handle.

I started doing quality assurance with Mozilla in 2007 and have seen a lot of pretty bad bugs over the years. From subtle rendering issues to more aggressive issues such as the screen going completely black, forcing the user to restart their computer. Even something as innocuous as playing a game with Firefox running in the background was enough to create a less than desirable situation.

Unfortunately many of these issues stem from interactions with drivers which are out of our control. Especially in cases where users are stuck with older, unmaintained drivers. A lot of the time our only option is to blacklist the device and/or driver. This forces users down the software rendering path which often results in a sluggish experience for some content, or missing out altogether on higher-end user experiences, all in an effort to at least stop them from crashing.

While the GPU Process won’t in and of itself prevent these types of bugs, it should enable Firefox to handle these situations much more gracefully. If you’re on Nightly today and you’re using a system that qualifies (currently Windows 7 SP1 or later, a D3D9 capable graphics card, and whitelisted for using multi-process Firefox aka e10s), you’ve probably had GPU Process running for several weeks and didn’t even notice. If you want to check for yourself it can be found in the Graphics section of the about:support page. To try it out do something that normally requires the video card (high quality video, WebGL game, etc) and click the Terminate GPU Process button — you may experience a slight delay but Firefox should recover and continue without crashing.

Before I go any further I would like to thank David Anderson, Ryan Hunt, and George Wright for doing the development work to get GPU Process working. I also want to thank Felipe Gomes and Jukka Jylänki for helping me work through some bugs in the experimentation mechanism so that I could get this experiment up and running in time.

The Experiment

As a first milestone for GPU Process we wanted to make sure it did not introduce a serious regression in stability and so I unleashed an experiment on the Nightly channel. For two weeks following Christmas, half of the users who had GPU Process enabled on Nightly were reverted to the old, one browser + one content process model. The purpose of this experiment was to measure the anonymous stability data we receive through Telemetry and Socorro, comparing this data between the two user groups. The expectation was that the stability numbers would be similar between the two groups and the hope was that GPU Process actually netted some stability improvements.

Now that the experiment has wrapped up I’d like to share the findings. Before we dig in I would like to explain a key distinction between Telemetry and Socorro data. While we get more detailed data through Socorro (crash signatures, graphics card information, etc), the data relies heavily on users clicking the Report button when a crash occurs; no reports = no data. As a result Socorro is not always a true representation of the entire user population. On the other hand, Telemetry gives us a much more accurate representation of the user population since data is submitted automatically (Nightly uses an opt-out model for Telemetry).  However we don’t get as much detail, for example we know how many crashes users are experiencing but not necessarily which crashes they happen to be hitting.

I refer to both data sets in this report as they are each valuable on their own but also for checking assumptions based on a single source of data. I’ve included links to the data I used at the end of this post.

As a note, I will the terminology “control” to refer to those in the experiment who were part of the control group (ie. users with GPU Process enabled) and “disabled” to refer to those in the experiment who were part of the test group (ie. users with GPU Process disabled). Each group represents a few thousand Nightly users.

Daily Trend

To start with I’d like to present the daily trend data. This data comes from Socorro and is graphed on my own server using the MetricsGraphics.js framework. As you can see, day to day data from Socorro can be quite noisy. However when we look at the trend over time we can see that overall the control group reported roughly 4% fewer crashes than those with GPU Process disabled.

Trend in daily data coming from Socorro. [source]

Breakdown of GPU Process Crashes

Crashes in the GPU Process itself compare favourably across the board, well below 1.0 crashes per 1,000 hours of usage, and much less than crash rates we see from other Firefox processes (I’ll get into this more below). The following chart is very helpful in illustrating where our challenges might lie and may well inform roll-out plans in the future. It’s clear to see that Windows 8.1 and AMD hardware are the worst of the bunch while Windows 7 and Intel is the best.

Vendor data comes from Socorro crash reports, all other data comes from Telemetry [source]

Breakdown by Process Type

Of course, the point of GPU Process is not just to see how stable the process is itself but also to see what impact it has on crashes in other processes. Here we can see that stability in other processes is improved almost universally by 5%, except for plugin crashes which are up by 5%.

A negative percentage represents an improvement in the Control group as it compares to the Disabled group while a positive percentage represents a regression. All data comes from Telemetry [source]

GPU Driver Crashes

One of the areas we expected to see the biggest wins was in GPU driver crashes. The theory is that driver crashes would move to the GPU Process and no longer take down the entire browser. The user experience of drivers crashing in the GPU Process still needs to be vetted but there does appear to be a noticeable impact with driver crash reports being reduced overall by 45%.

All data comes from Socorro crash reports [source]

Breakdown by Platform

Now we dig in deeper to see how GPU Process impacts stability on a platform level. Windows 8.0 with GPU Process disabled is the worst especially when Plugin crashes are factored in while Windows 10 with GPU Process enabled seems to be quite favourable overall.

A negative percentage represents an improvement in the Control group as it compares to the Disabled group while a positive percentage represents a regression. All data comes from Telemetry [source]

Breakdown by Architecture

Breaking the data down by architecture we can see that overall 64-bit seems to be much more stable overall than 32-bit. 64-bit sees improvement across the board except for plugin process crashes which regress significantly. 32-bit sees an inverse effect albeit at a smaller scale.

A negative percentage represents an improvement in the Control group as it compares to the Disabled group while a positive percentage represents a regression. Data comes from Telemetry [source]

Breakdown by Compositor

Taking things down to the compositor level we can see that D3D11 performs best overall, with or without the GPU Process but does seem to benefit from having it enabled. There is a significant 36% regression in Plugin process crashes though which needs to be looked at — we don’t see this with D3D9 nor Basic compositing. Looking at D3D9 itself seems to carry a regression as well in Content and Shutdown crashes. These are challenges we need to address and keep in mind as we determine what to support as we get closer to release.

A negative percentage represents an improvement in the Control group as it compares to the Disabled group while a positive percentage represents a regression. Data comes from Telemetry [source]

Breakdown by Graphics Card Vendor

Looking at the data broken down by graphics card vendors there is significant improvement across the board from GPU Process with the only exception being a 6% regression in Browser crashes on AMD hardware and a 12% regression in Shutdown crashes on Intel hardware. However, considering this data comes from Socorro we cannot say that these results are universal. In other words, these are regressions in the number of crashes reported which do not necessarily map one-to-one to the number of crashes occurred.

A negative percentage represents an improvement in the Control group as it compares to the Disabled group while a positive percentage represents a regression. Data comes from Telemetry [source]

Signature Comparison

As a final comparison, since some of the numbers above varied quite a lot between the Control and Test groups, I wanted to look at the top signatures reported by these separate groups of users to see where they did and did not overlap.

This first table shows the signatures that saw the greatest improvement. In other words these crashes were much less likely to be reported in the Control group. Of note there are multiple signatures related to Adobe Flash and some related to graphics drivers in this list.

Crashes reported less frequently with GPU Process. Signatures highlighted green are improved by more than 50%, blue is 10-50% improvement, and yellow is 0-10% improvement. Data comes from Socorro crash reports [source].
This next table shows the inverse, the crashes which were more likely to be reported with GPU Process enabled. Here we see a bunch of JS and DOM related signatures appearing more frequently.

Crashes reported more frequently with GPU Process. Signatures in red are more than 50% worse, orange are 10-50% worse, and yellow are 0-10% worse. Data comes from Socorro crash reports. [source]
These final tables break down the signatures that didn’t show up at all in either of the cohorts. The top table represents crashes which were only reported when GPU Process was enabled, while the second table are those which were only reported when GPU Process was disabled. Of note there are more signatures related to driver DLLs in the Disabled group and more Flash related signatures in the Enabled group.

Crashes which show up only when GPU Process is enabled, or only when disabled. Data comes from Socorro crash reports. [source]

Conclusions

In this first attempt at a GPU process things are looking good — I wouldn’t say we’re release ready but it’s probably good enough to ship to a wider test audience. We were hoping that stability would be on-par overall with most GPU related crashes moving to the GPU Process and hopefully being much more recoverable (ie. not taking down the browser). The data seems to indicate this has happened with a 5% win overall. However this has come at a cost of a 5% regression in plugin stability and seems to perform worse under certain system configurations once you dig deeper into the data. These are concerns that will need to be evaluated.

It’s worth pointing out that this data comes from Nightly users, users who trend towards more modern hardware and more up to date software. We might see swings in either direction once this feature reaches a wider, more diverse population. In addition this experiment only hit half of those users who qualified to use GPU Process which, as it turns out, is only a few thousand users. Finally, this experiment only measured crash occurrences and not how gracefully GPU Process crashes — a critical factor that will need to be vetted before we release this feature to users who are less regression tolerant.

As I close out this post I want to take another moment to thank David Anderson, Ryan Hunt, and George Wright for all the work they’ve put in to making this first attempt at GPU Process. In the long run I think this has the potential to make Firefox a lot more stable and faster than previous Firefox versions but potentially the competition as well. It is a stepping stone to what David Bryant calls the “next-generation web platform”. I also want to thank Felipe Gomes and Jukka Jylänki for their help getting this experiment live. Both of them helped me work through some bugs, despite the All-hands in Hawaii and Christmas holidays that followed — without them this experiment might not have happened in time for the Firefox 53 merge to Aurora.


If you made it this far, thank you for reading. Feel free to leave me feedback or questions in the comments. If you think there’s something I’ve reported here that needs to be investigated further please let me know and I’ll file a bug report.

Crash rate data by architecture, compositor, platform, process c/o Mozilla Telemetry
Vendor and Topcrash data c/o Mozilla Socorro

 

Visualizing Crash Data in Bugzilla

Since joining the Platform Graphics team as a QA engineer several months ago I’ve dabbled in visualizing Graphics crash data using the Socorro supersearch API and the MetricsGraphics.js visualization library.

After I gained a better understanding of the API and MG.js I set up a github repo as sandbox to play around with visualizing different types of crash data. Some of these experiments include a tool to visualize top crash signatures for vendor/device/driver combinations, a tool to compare crash rates between Firefox and Fennec, a tool to track crashes from the graphics startup test, and a tool to track crashes with a WebGL context.

Top crash dashboard for our most common device/driver combination (Intel HD 4000 w/driver 10.18.10.4276)
Top graphics crash for Intel HD 4000 w/driver 10.18.10.4276

Fast forward to June,  I had the opportunity to present some of this work at the Mozilla All-hands in London. As a result of this presentation I had a fruitful conversation with Benoit Girard, fellow engineer on the Graphics team. We talked about integrating my visualization tool with Bugzilla by way of a Bugzilla Tweaks add-on; this would both improve the functionality of Bugzilla and improve awareness of my tool. To my surprise this was actually pretty easy and I had a working prototype within 24 hours.

Since then I’ve iterated a few times, fixing some bugs based on reviews for the AMO Editors team. With version 0.3 I am satisfied enough to publicize it as an experimental add-on.

Bugzilla Socorro Lens (working title) appends a small snippet into the Crash Signatures field of bug reports, visualizing 365 days worth of aggregate crash data for the signatures in the bug.  With BSL installed it becomes more immediately evident when a crash started being reported, if/when it was fixed, how the crash is trending, or if the crash is spiking; all without having to manually search Socorro.

Socorro snippet integrate on https://bugzil.la/1241921
Socorro snippet integration on https://bugzil.la/1241921

Of course if you want to see the data in Socorro you can. Simply click a data-point on the visualization and a new tab will be opened to Socorro showing the crash reports for that date. This is particularly useful when you want to see what may be driving a spike.

At the moment BSL is an experimental add-on. I share it with you today to see if it’s useful and collect feedback. If you encounter a bug or have a feature request I invite you to submit an issue on my github repo. Since this project is a learning experience for me, as much as it is a productivity exercise, I am not accepting pull requests at this time. I welcome your feedback and look forward to improving my coding skills by resolving your issues.

You can get the add-on from addons.mozilla.org.

[Update] Nicholas Nethercote informed me of an issue where the chart won’t display if you have the “Experimental Interface” enabled in Bugzilla. I have filed an issue in my github repo and will take a look at this soon. In the meantime, you’ll have to use the default Bugzilla interface to make use of this add-on. Sorry for the inconvenience.

Reducing the NVIDIA Blacklist

We recently relaxed our graphics blocklist for users with NVIDIA hardware using certain older graphics drivers. The original blocklist entry blocked all versions less than v8.17.11.8265 due to stability issues with NVIDIA 182.65 and earlier. We recently learned however that the first two numbers in the version string indicate platform version and the latter numbers refer to the actual driver version. As a result we were inadvertently blocking newer drivers on older platforms.

We have since opened up the blacklist for versions newer than 8.15.11.8265 (Win XP) and 8.16.11.8265 (Vista/Win7) via bug 1284322, effectively drivers released beyond mid-2009.This change only exists on Nightly currently but we expect it to ride the trains unless some critical regression is discovered.

If you are triaging bugs and user feedback, or are engaging with users on social media, please keep an eye out for users with NVIDIA hardware. If the user does have NVIDIA hardware please have them check the Graphics section of about:support to confirm if they are using a driver version that was previously blocked. If they are try to help them get updated to the most recent driver version. If the issue persists, have them disable hardware acceleration to see if the issue goes away.

The same goes if you are a user experiencing quality issues (crashes, hangs, black screening, checkerboarding, etc) on NVIDIA hardware with these drivers. Please make sure your drivers are up to date.

In either case, if the issue persists please file a bug so we can investigate what is happening.

Feel free to email me if you have any questions.

Thank you for your help!

London Calling

I’m happy to share that I will be hosting my first ever All-hands session, Graphics Stability – Tools and Measurements (12:30pm on Tuesday in the Hilton Metropole), at the upcoming Mozilla All-hands in London. The session is intended to be a conversation about taking a more collaborative approach to data-driven decision-making as it pertains to improving Graphics stability.

I will begin by presenting how the Graphics team is using data to tackle the graphics stability problem, reflecting on the problem at hand and the various approaches we’ve taken to date. My hope is this serves as a catalyst to lively discussion for the remainder of the session, resulting in a plan for more effective data-driven decision-making in the future through collaboration.

I am extending an invitation to those outside the Graphics team, to draw on a diverse range of backgrounds and expertise. As someone with a background in QA, data analysis is an interesting diversion (some may call it an evolution) in my career — it’s something I just fell in to after a fairly lengthy and difficult transitional period. While I’ve learned a lot recently, I am an amateur data scientist at best and could certainly benefit from more developed expertise.

I hope you’ll consider being a part of this conversation with the Graphics team. It should prove to be both educational and insightful. If you cannot make it, not to worry, I will be blogging more on this subject after I return from London.

Feel free to reach out to me if you have questions.

9 Years

Today I pause to reflect.

9 years ago last Saturday was my first day at Mozilla as an intern. It was also the first time I set foot in Silicon Valley, the first time I set foot in California, even the first time I set foot on the west coast of North America. It was a day of many emotions: anticipation, excitement, fear, and pride.

Taking big risks is not something that comes easy to me. Deciding to hop on a plane and travel to California not knowing what my future held. Deciding only a few months earlier to take a risk with my career and venture in to open source software. Deciding just two years before to risk leaving a stable career in the Canadian military to earn a software development degree at Seneca College. These are things that took a lot of personal courage, and I suspect much courage for my parents as well.

Mozilla has changed a lot since that early Spring day in 2007 but so have I. Through all the things I’ve experienced. The ups and downs. The re-orgs and pivots. The flamewars. The burnout. The one thing that hasn’t changed is the people of Mozilla — they are my rock.

They provided guidance and mentorship when I needed it most. They helped me learn from my mistakes, celebrate my victories, and saved me from burning out on more than one occasion. In short, I would not be here today without them.

Some of my best personal relationships are those I’ve developed at Mozilla. I hold the highest of respect for these people and while many have moved on I still think of them frequently. I strive to be what they were to me: a mentor in work and a mentor in life.

If you’re reading this and we’ve interacted at all in the past, I write this for you. Every experience we’ve shared, every discussion we’ve had, it changes us. This I believe fundamentally.

When people ask me why I work for Mozilla (and why I’ll continue to for the foreseeable future), the answer is simple: the people. To me they are like family, nothing else matters.

 

The Testday Brand

Over the last few months I’ve been surveying people who’ve participated in testdays. The purpose of this effort is to develop an understanding of the current “brand” that testdays present. I’ve come to realize that our goal to “re-invigorate” the testdays program was based on assumptions that testdays were both well-known and misunderstood. I wanted to cast aside that assumption and make gains on a new plan which includes developing a positive brand.

The survey itself was quite successful as I received over 200 responses, 10x what I normally get out of my surveys. I suspect this was because I kept it short, under a minute to complete; something I will keep in mind for the future.

Who Shared the Most?

testday-responses

When looking out who responded most, the majority were unaffiliated with QA (53%). Of the 47% who were affiliated with QA nearly two thirds were volunteers.

How do these see themselves?

testday-mozillians

When looking at how respondents self-identified, only people who identified as volunteers did not self-identify as a Mozillian. In terms of vouching, people affiliated with QA seem to have a higher proportion of vouched Mozillians than those unaffiliated with QA. This tells me that we need to be doing a better job of converting new contributors into Mozillians and active into vouched Mozillians.

What do they know about Testdays?

testday-familiarity

When looking at how familiar people are with the concept of testdays, people affiliated with QA are most aware while people outside of QA are least aware. No group of people are 100% familiar with testdays which tells me we need to do a better job of educating people about testdays.

What do they think about Testdays?

testday-keywords

Most respondents identified testdays with some sort of activity (30%), a negative feeling (22%), a community aspect (15%), or a specific product (15%). Positive characteristics were lowest on the list (4%). This was probably the most telling question I asked as it really helps me see the current state of the brand of testdays and not just for the responses I received. Reading between the lines, looking for what is not there, I can see testdays relate poorly to anything outside the scope of blackbox testing on Firefox (eg. automation, services, web qa, security qa, etc).

Where do I go from here?

1. We need to diversify the testday brand to be more about testing Firefox and expand it to enable testing across all areas in need.

2. We need to solve some of the negative brand associations by making activities more understandable and relevant, by having shorter events more frequently, , and by rewarding contributions (even those who do work that doesn’t net a bug).

3. We need to teach people that testdays are about more than just testing. Things like writing tests, writing new documentation, updating and translating existing documentation, and mentoring newcomers is all part of what testdays can enable.

4. Once we’ve identified the brand we want to put forward, we need to do a much better job of frequently educating and re-educating people about testdays and the value they provide.

5. We need to enable testdays to facilitate converting newcomers into Mozillians and active contributors into vouched Mozillians.

My immediate next step is to have the lessons I’ve learned here integrated into a plan of action to rebrand testdays. Rest assured I am going to continue to push my peers on this, to be an advocate for improving the ways we collaborate, and to continually revisit the brand to make sure we aren’t losing sight of reality.

I’d like to end with a thank you to everyone who took the time to respond to my survey. As always, please leave a comment below if you have any interesting insights or questions.

Thank you!

Report on Recognition

Earlier this year I embarked on a journey to investigate how we could improve participation in the Mozilla QA community. I was growing concerned that we were failing in one key component of a vibrant community: recognition.

Before I could began, I needed to better understand Mozilla QA’s recognition story. To this end I published a survey to get anonymous feedback and today, I’d like to share some of that feedback.

Profile of Participants

The first question I asked was intended to profile the respondents in terms of how long they’d been involved with Mozilla and whether they were still contributing.

recognition-activityThis revealed that we have a larger proportion of contributors who’ve been involved for more than a couple of years. I think what this indicates we need to be doing a better job of developing long-term relationships with new contributors.

recognition-teamsWhen asked which projects contributors identified with, 100% of respondents identified as being volunteers with the Firefox QA team. The remaining teams breakdown fairly evenly between 11% and 33%. I think this indicates most people are contributing to more than one team, and that teams at the lower end of the scale have an excellent opportunity for growth.

Recognizing Recognition

The rest of the questions were focused more on evaluating the forms of recognition we’ve employed in the past.

recognition-forms

When looking at how we’ve recognized contributors it’s good to see that everyone is being recognized in some form or another, in many cases receiving multiple forms of recognition. However I suspect the results are somewhat skewed (ie. people who haven’t been recognized are probably long gone and did not respond to the survey). In spite of that, it appears that seemingly simple things, like being thanked in a meeting, are well below what I’d expect to see.

recognition-likelihoodWhen looking at the impact of being recognized, it seems that more people found recognition to be nice but not necessarily a motivation for continuing to contribute. 44% found recognition to be either ineffective or very ineffective while 33% found it to be either effective or very effective. This could point to a couple of different factors, either our forms of recognition are not compelling or people are motivated by the work itself. I don’t have a good answer here so it’s probably worth following up.

What did we learn?

After all said and done, here is what I learned from doing this survey.

1. We need to be focused on building long-term relationships. Helping people through their first year and making sure people don’t get lost long-term.

2. Most people are contributing to multiple projects. We should have a framework in place that facilitates contribution (and recognition of contribution) across QA. Teams with less participation can then scale more quickly.

4. We need to be more proactive in our recognition, especially in its simplest form. There is literally no excuse for not thanking someone for work done.

5. People like to be thanked for their work but it isn’t necessarily a definitive motivator for participation. We need to learn more about what drives individuals and make sure we provide them whatever they need to stay motivated.

6. Recognition is not as well “baked-in” to QA as it is with other teams — we should work with these teams to improve recognition within QA and across Mozilla.

7. Contributors find testing to be difficult due to inadequate description of how to test. In some cases, people spend considerable amounts of time and energy figuring out what and how to test, presenting a huge hurdle to newcomers in particular. We should make sure contribution opportunities are clearly documented so that anyone can get involved.

8. We should be engaging with Mozilla Reps to build a better, more regional network of QA contributors, beginning with giving local leaders the opportunity to lead.

Next Steps

In closing, I’d like to thank everyone who took the time to share their feedback. The survey remains open if you missed the opportunity. I’m hoping this blog post will help kickstart a conversation about improving recognition of contributions to Mozilla QA. In particular, making progress toward solving some of the lessons learned.

As always, I welcome comments and questions. Feel free to leave a comment below.

Cheers!

Ninety Days with DOM

Last quarter marked a fairly significant change in my career at Mozilla. I spent most of the quarter adjusting to multiple re-orgs which left me as the sole QA engineer on the DOM team. Fortunately, as the quarter wraps up I feel like I’ve begun to adjust to my new role and started to make an impact.

Engineering Impact

My main objective this quarter was to improve the flow of DOM bugs in Bugzilla by developing and documenting some QA processes. A big part of that work was determining how I was going to measure impact, and so I decided the most simple way to do that was to take the queries I was going to be working with and plot the data into Google docs.

The solution was fairly primitive and lacked the ability to feed into a dashboard in any meaningful way, but as a proof of concept it was good enough. I established a baseline using the week-by-week numbers going back a couple of years. What follows is a crude representation of these figures and how the first quarter of 2015 compares to the now three years of history I’ve recorded.

Volume of unresolved Regressions & Crashes
dom.regressions-vs-crashes.unresolved.alltime.2015q1Regressions +55%, Crashes +188% since 2012

Year-over-Year trend in Regressions and Crashes
dom.regressions-vs-crashes.unresolved.annual.2015q1Regressions +9%, Crashes +68% compared to same time last year.

Regressions and Crashes in First Quarters
dom.regressions-vs-crashes.unresolved.quarterly.2015q1Regressions -0.6%, Crashes +19% compared to previous 1st Quarters

Resolution Rate of Regressions and Crashes
dom.regressions-vs-crashes.fixrate.2015q190% of Regressions resolved (+2.5%), 80% of Crashes resolved (-7.0%)

Change in Resolution Rate compared to total Volume
dom.regressions-vs-crashes.volume.2015q1Regression resolution +2.5%, Crash resolution -6.9%, Total volume +68%

I know that’s a lot of data to digest but I believe they show embedding QA with the DOM team is having some initial success.

It’s important to acknowledge the DOM team for maintaining a very high resolution rate (90% for regressions, 80% for crashes) in the face of aggressive gains in total bug volume (68% in three years). They have done this largely on their own with minimal assistance from QA over the years, giving us a solid foundation from which we could build.

For DOM regressions I focused on making existing bug reports actionable with less focus on filing new regression bugs; this has been a two part effort. The first being focused on finding regression windows for known regression bugs, the second being focused on converting unconfirmed bugs into actionable regression reports. I believe this is why we see a marginal increase in the regression resolution rate (+0.4% last quarter).

For DOM crashes I focused on filing previously unreported crashes (basically anything above a 1% report threshold). Naturally this has led to an increase in reports but has also led to some crashes being fixed that wouldn’t have been otherwise. Overall the crash resolution rate declined by 2.6% last quarter but I believe this should ultimately lead to a more stable product in the future.

The Older Gets Older

The final chart below plots the age of unresolved DOM bugs week over week which currently sits at 542 days; an increase of 4.8% this past quarter and 241% since January 1, 2012. I include it here not as a visualization of impact but as a general curiosity.

Median Age of Unresolved DOM Bugs
dom.regressions-vs-crashes.fixrate.2015q1Median age is 542 days, +4.8% last quarter, +241% since 2012

I have not yet figured out what this means in terms of overall quality or whether it’s something we need to address. I suspect recently reported bugs tend to get fixed sooner since they tend to be more immediately visible than older bugs. A fact that is likely common to most, if not all components in Bugzilla. It might be interesting to see how this breaks down in terms of the age of the bugs being fixed.

What’s Next

My plan for the second quarter is to identify a subset of these to take outside of Google Docs and convert into a proof of concept dashboard. I’m hoping my peers on the DOM team can help me identify at least a couple that would be both interesting and useful. If it works out, I’d like to aim for expanding this to more Bugzilla components later in the year so more people can benefit.

If you share my interest and have any insights please leave a comment below.

As always, thank you for reading.

[UPDATE: I decided to quickly mock up a chart showing the age breakdown of bugs fixed this past quarter. As you can see below, younger bugs account for a much greater proportion of the bugs being fixed, perhaps expectedly.]

Screen Shot 2015-04-06 at 3.56.33 PM

New Beginnings

…or trying to adapt to the inevitability of change.


Change is a reality of life common to all things; we all must adapt to change or risk obsolescence. I try to look at change as a defining moment, an opportunity to reflect, to learn, and to make an impact. It is in these moments that I reflect on the road I’ve traveled and attempt to gain clarity of the road ahead. This is where I find myself today.

How Did I Get Here?

In my younger days I wasted years in college jumping from program to program before eventually dropping out. I obviously did not know what I wanted to do with my life and I wasn’t going to spend thousands of dollars while I figured it out. This led to a frank and difficult discussion with my parents about my future which resulted in me enlisting in the Canadian military. As it happens, this provided me the space I needed to think about what I wanted to do going forward, who I wanted to be.

I served for three years before moving back to Ontario to pursue a degree in software development at the college I left previously. I had chosen a path toward working in the software industry. I had come to terms with a reality that I would likely end up working on some proprietary code that I didn’t entirely care for, but that would pay the bills and I would be happier than I was as a soldier.

After a couple of years following this path I met David Humphrey, a man who would change my life by introducing me to the world of open source software development. On a whim, I attended his crash-course, sacrificing my mid-semester week off. It was here that discovered a passion for contributing to an open source project.

Up until this point I was pretty ignorant about open source. I had been using Linux for a couple years but I didn’t identify it as “open source”; it was merely a free as in beer alternative to Windows. At this point I hadn’t even heard of Mozilla Firefox. It was David who opened my eyes to this world; a world of continuous learning and collaboration, contributing to a freer and more open web. I quickly realized that choosing this path was about more than a job opportunity, more than a career; I was committing myself to world view and my part to play in shaping it.

Over the last eight years I have continued to follow this path, from volunteering nights at school, through internships, a contract position, and finally full-time employment in 2010.

Change is a way of life at Mozilla

Since I began my days at Mozilla I have always been part of the same team. Over the years I have seen my team change dramatically but it has always felt like home.

We started as a small team of specialists working as a cohesive unit on a single product. Over time Mozilla’s product offering grew and so did the team, eventually leading to multiple sub-teams being formed. As time moved on and demands grew, we were segmented into specialized teams embedded on different products. We were becoming more siloed but it still felt like we were all part of the QA machine.

This carried on for a couple of years but I began to feel my connection to people I no longer worked with weaken. As this feeling of disconnectedness grew, my passion for what I was working on decreased. Eventually I felt like I was just going through the motions. I was demoralized and drifting.

This all changed for me again last year when Clint Talbert, our newly appointed Director and a mentor of mine since the beginning, developed a vision for tearing down those silos. It appeared as though we were going to get back to what made us great: a connected group of specialists. I felt nostalgic for a brief moment. Unfortunately this would not come to pass.

Moving into 2015 our team began to change again. After “losing” the B2G QA folks to the B2G team in 2014, we “lost” the Web and Services QA folks to the Cloud Services team. Sure the people were still here but it felt like my connection to those people was severed. It then became a waiting game, an inevitability that this trend would continue, as it did this week.

The Road Ahead

Recently I’ve had to come to terms with the reality of some departures from Mozilla.  People I’ve held dear for, and sought mentorship from, for many years have decided to move on as they open new chapters in their lives. I have seen many people come and go over the years but those more recently have been difficult to swallow. I know they are moving on to do great things and I’m extremely happy for them, but I’ll also miss them intensely.

Over the years I’ve gone from reviewing add-ons to testing features to driving releases to leading the quality program for the launch of Firefox Hello. I’ve grown a lot over the years and the close relationships I’ve held with my peers are the reason for my success.

Starting this week I am no longer a part of a centralized QA team, I am now the sole QA member of the DOM engineering team. While this is likely one of the more disruptive and challenging changes I’ve ever experienced, it’s also exciting to me.

Overcoming the Challenge

As I reflect on this entire experience I become more aware of my growth and the opportunity that has been presented. It is an opportunity to learn, to develop new bonds, to impact Mozilla’s mission in new and exciting ways. I will remain passionate and engaged as long as this opportunity exists. However, this change does not come without risk.

The greatest risk to Mozilla is if we are unable to maintain our comradery, to share our experiences, to openly discuss our challenges, to engage participation, and to visualize the broader quality picture. We need to strengthen our bonds, even as we go our separate ways. The QA team meeting will become ever more important as we become more decentralized and I hope that it continues.

Looking Back, Looking Forward

I’ve experienced a lot of change in my life and it never gets any less scary. I can’t help but fear reaching another “drifting point”. However, I’ve also learned that change is inevitable and that I reach my greatest potential by adapting to it, not fighting it.

I’m entering a new chapter in my life as a Mozillian and I’m excited for the road ahead.