I am a Quality Engineer at Mozilla, contributing since 2006, currently focused on improving Graphics quality in Gecko. I currently reside in British Columbia but grew up Ontario. I enjoy hiking and photography in my spare time.
If you use the Bugzilla Socorro Lens add-on, please uninstall it and read on…
Just over a year ago I wrote a post about an add-on I created for Firefox dubbed Bugzilla Socorro Lens. In case you’ve never used it, it basically injects a chart (powered by MetricsGraphics.js) into the Crash Signatures section of a bug report. As someone who does bug triage frequently it is extremely useful to be able to see at a glance the activity of a particular crash. Anecdotally, I’ve received a lot of compliments about the add-on which have reinforced my belief in how useful it is to a certain group of Mozillians.
Over the past year usage of my add-on has grown 64%, recently peaking at 115 daily active users. I spent last quarter fixing a bunch of nagging bugs and even found time to work on a couple of longstanding feature requests from my most engaged users.
This quarter I have embarked on a journey of integrating this functionality natively into Bugzilla. With help from Dylan Hardison should be live on Tuesday October 31, 2017. I am hopeful this will put the power of Bugzilla Socorro Lens in the hands of many more people while adding minimal overhead to bugzilla.mozilla.org itself.
Please uninstall the Bugzilla Socorro Lens add-on as I will no longer be maintaining it. Thank you everyone who provided compliments, feedback and bug reports over the last year. You helped me turn an accidental pet project into something that is useful for the many who use it, allowing me to contribute to Mozilla in a way I’d never foreseen.
Going forward I will be focused on making fixes and improvements to the integrated product. Bugs Welcome.
[Opinions expressed herein are my own and have not been endorsed by Mozilla]
This week Mozillians from around the world are taking part in the biannual tradition of descending on a city for a week-long all-hands meeting. In this case the location is San Francisco. Unfortunately I’m not participating in this tradition for the first time in 9 years.
So what, What are you really missing?
On the surface it’s basically a 4-day meeting marathon but to me the meetings are the least valuable part of the week. To me these events are about relationships. All-hands is an opportunity to meet friends new and old, to share stories of survival and to laugh in the face of our failures. It is a time to talk candidly with leadership, to participate in real-time discussions with people I normally speak with over IRC or e-mail, and to talk with people I never get a chance to otherwise. Sharing stories in person is the best thing about these events.
This sounds awesome, why would you not go?
I want to begin by crediting Mozilla for standing up to the cultural regression that is occurring. They came out against Trump’s travel ban and have taken extra steps to ensure legal support during our travels and security at the All-hands. I commend them for taking these steps but it is still not enough to persuade me to risk crossing the border.
First and foremost I know some Mozillians have been barred from going simply because of their background. I chose not to go out in protest of Trump’s draconian policies and in solidarity with those of my peers who haven’t been given the same choice.
Secondly I am concerned that the attitudes in the Whitehouse have given license to bigotry. As someone who identifies as a member of the LGBTQ community I am worried about a regressive, draconian executive order being signed targeting this community. If this were to happen I would fully expect there to be protests in San Francisco. This would become a distraction from everything I’d come to accomplish as I wouldn’t think twice about joining in these protests, possibly resulting in my deportation.
Finally I want to avoid US border guards. I know myself, I respect authority as long as they respect my dignity and treat me as a human-being. I will not hesitate to fight back and make my voice heard if I feel mistreated. The outcome of which is likely to be detainment and/or being escorted out. Statistically the potential of this happening is low but it’s not zero. I have chosen to avoid this situation altogether. I’m just not willing to put myself nor Mozilla in this position.
My experience at the US border has never been a positive one. Whether I was traveling for business and had the proper clearance, or if I was just heading down to the US for vacation, US Border guards have made me feel progressively less welcome in their country (clearly I’m not alone). Under Trump this has reached a point where I can’t be bothered anymore. Life is short and there’s many more welcoming countries I’ve yet to explore.
What about the next all-hands?
I always look forward to these gatherings. It’s often a rollercoaster of emotions and I always leave more tired but more re-energized than when I arrived. I often rediscover my passion to fight for the open web. It pains me to be missing out but I know it’d pain me more if I’d gone knowing some of my peers were being deprived of this opportunity due to an untenable political situation. I hope this is a mere blip and that I can one day join my friends in the US but for now it is terre sans homme, for business and for pleasure.
A few months ago I reported on an experiment I had run in Firefox Nightly 53 where I compared stability (ie. crashes) between users with and without GPU Process. Since then we’ve fixed some bugs and are now days away from releasing GPU Process in Firefox 53. In anticipation of this moment I wanted to make sure we were ready so I organized a repeat of the previous experiment but this time on our Beta users.
As anticipated the results were different than we saw on Nightly but still favourable. I’d like to highlight some of those results in this post.
Anyone with Windows 7 Platform Update or later with a graphics card that has not previously been blacklisted, and who have multi-process enabled are supported. As it stands this represents approximately 25% of users. Since GPU Process is enabled by default for these users, the experiment randomly selects half of these users and turns off GPU Process by flipping a pref. There is a small percentage of noise in the data (+/- ~0.5%) since flipping the pref doesn’t actually turn off GPU Process until the first restart following the pref flip.
17% fewer driver related crashes
In the grand scheme of things graphics driver crashes represented about 3.17% of all crashes reported. For those with GPU Process enabled the percentage was much lower (2.81%) compared to those with GPU Process disabled (3.54%). While these numbers are low it’s worth noting that all users are not created equal. Some users may experience more driver-related crashes than others based on a multitude of factors (modernity of OS and driver updates, hardware, and mixture of third-party software, etc). It is conceivable, although not provable by this experiment, that the impact of this change would be more noticeable to users more prone to driver related issues. It’s also worth noting that we managed to make this change without introducing any new driver related crashes in the UI process which means Firefox should be much less prone to crashing entirely because of an interaction with the driver, although content may still be affected.
22% fewer Direct3D related crashes
Another category of crashes that sees improvements from GPU Process is D3D related crashes. This category of crashes typically involves hardware accelerated content on the web. In the past we’d see these occurring in the UI process which resulted in Firefox crashing completely. Now, with GPU Process we see about a 1/5th reduction in these crashes and those that remain tend not to happen in the UI process anymore. The end-user impact is that you might have to reload a page but Firefox dies less often.
11% fewer Direct3D accelerated video crashes
More stable hardware accelerated video another interesting benefit of GPU Process. We see about 11% fewer DXVA (DirectX Video Accelerator) related crashes in the test group with GPU Process enabled than the test group with it disabled. The end result of this should be slightly fewer crashes which take down the browser when viewing hardware accelerated video, think sites like YouTube.
Looking at the topcrash charts from Socorro shows expected movement in overall crash volumes. Browser topcrashes are down approximately 10% overall when GPU Process is enabled. Meanwhile, Content topcrashes are up approximately 8% and Plugin topcrashes are down 18%. Topcrashes in the GPU process only account for 0.13% of the overall topcrash volume. These numbers are in line with what we expected based on prior testing.
All of the previous metrics are based on anonymous data we receive from Socorro, ie. crash reports submitted by users. This data is extremely useful in digging into very specific details about crashes but it biases towards users who submit crash reports. Telemetry gives us less detailed information but is better at determining the broader impact of features since it represents a broader user population.
For the purposes of the experiment I compared the overall crash rate for each test group, where crash rate is defined as the number of crashes per 1,000 hours of aggregate browser usage across the entire test group population. The findings from the experiment are interesting in that we saw a slight increase in the Browser process crash rate (up 5.9% in the Enabled cohort), a smaller increase in the Content process crash rate (up 2.5% in the Enabled cohort), and a large decrease in the Plugin crash rate (down 20.6% in the Enabled cohort).
Overall, however, comparing the Enabled cohort to Firefox 52.0.2 does show more expected results with 0.51% lower browser crash rate, 18.1% higher content crash rate, and 18% lower plugin crash rate. It’s also worth noting that the crash rate for GPU Process crashes is very low, relatively, with just 0.07 crashes per 1,000 usage hours. Put in other terms that’s one GPU Process crash every 14,285 hours compared to one Browser process crash every 353 hours.
In the end I think we have accomplished our goal: introducing a GPU process (a foundational piece of the Quantum project) without regressing overall product stability. We’ve reduced the overall volume of some categories of graphics-related crashes while making others less prone to taking down the entire browser. One of the primary fears was that in doing so we’d introduce new ways to take down the browser but I’ve not yet found evidence that this has happened.
Of course the numbers themselves don’t tell the whole story. Now begins a deeper investigation of which crashes (ie. signatures) have changed significantly so that we can improve GPU Process further, but I think we have an excellent foundation to build from.
And of course, watching what happens as we roll out to users in Release with Firefox 53 in the coming days.
Update: It was pointed out that it was hard to know what the charts measure specifically due to unlabeled axes. For all charts measuring crashes (ie. not the percentage charts) the Y-Axis represents crash rate, where crash rate for Telemetry data is defined as “crashes per 1000 hours of usage” and crash rate for Socorro data is defined as “crashes per number of unique installations”. The latter really only applies to the Breakdown by Vendor chart and the vendor bars in the Breakdown of GPU Process Crashes chart. The x-axis for all charts is the date.
GPU Process has landed with 4% fewer crash reports overall!
Several months ago Mozilla’s head of Platform Engineering, David Bryant, wrote a post on Medium detailing a project named Quantum. Built on the foundation of quality and stability we’d built up over the previous year, and using some components of what we learned through Servo, this project seeks to enable developers to “tap into the full power of the underlying device”. As David points out, “it’s now commonplace for devices to incorporate one or more high-performance GPUs”.
It may surprise you to learn that one of these components has already landed in Nightly and has been there for several weeks: GPU Process. Without going in to too much technical detail this basically adds a separate Firefox process set aside exclusively for things we want the GPU (graphics processing unit) to handle.
I started doing quality assurance with Mozilla in 2007 and have seen a lot of pretty bad bugs over the years. From subtle rendering issues to more aggressive issues such as the screen going completely black, forcing the user to restart their computer. Even something as innocuous as playing a game with Firefox running in the background was enough to create a less than desirable situation.
Unfortunately many of these issues stem from interactions with drivers which are out of our control. Especially in cases where users are stuck with older, unmaintained drivers. A lot of the time our only option is to blacklist the device and/or driver. This forces users down the software rendering path which often results in a sluggish experience for some content, or missing out altogether on higher-end user experiences, all in an effort to at least stop them from crashing.
While the GPU Process won’t in and of itself prevent these types of bugs, it should enable Firefox to handle these situations much more gracefully. If you’re on Nightly today and you’re using a system that qualifies (currently Windows 7 SP1 or later, a D3D9 capable graphics card, and whitelisted for using multi-process Firefox aka e10s), you’ve probably had GPU Process running for several weeks and didn’t even notice. If you want to check for yourself it can be found in the Graphics section of the about:support page. To try it out do something that normally requires the video card (high quality video, WebGL game, etc) and click the Terminate GPU Process button — you may experience a slight delay but Firefox should recover and continue without crashing.
Before I go any further I would like to thank David Anderson, Ryan Hunt, and George Wright for doing the development work to get GPU Process working. I also want to thank Felipe Gomes and Jukka Jylänki for helping me work through some bugs in the experimentation mechanism so that I could get this experiment up and running in time.
As a first milestone for GPU Process we wanted to make sure it did not introduce a serious regression in stability and so I unleashed an experiment on the Nightly channel. For two weeks following Christmas, half of the users who had GPU Process enabled on Nightly were reverted to the old, one browser + one content process model. The purpose of this experiment was to measure the anonymous stability data we receive through Telemetry and Socorro, comparing this data between the two user groups. The expectation was that the stability numbers would be similar between the two groups and the hope was that GPU Process actually netted some stability improvements.
Now that the experiment has wrapped up I’d like to share the findings. Before we dig in I would like to explain a key distinction between Telemetry and Socorro data. While we get more detailed data through Socorro (crash signatures, graphics card information, etc), the data relies heavily on users clicking the Report button when a crash occurs; no reports = no data. As a result Socorro is not always a true representation of the entire user population. On the other hand, Telemetry gives us a much more accurate representation of the user population since data is submitted automatically (Nightly uses an opt-out model for Telemetry). However we don’t get as much detail, for example we know how many crashes users are experiencing but not necessarily which crashes they happen to be hitting.
I refer to both data sets in this report as they are each valuable on their own but also for checking assumptions based on a single source of data. I’ve included links to the data I used at the end of this post.
As a note, I will the terminology “control” to refer to those in the experiment who were part of the control group (ie. users with GPU Process enabled) and “disabled” to refer to those in the experiment who were part of the test group (ie. users with GPU Process disabled). Each group represents a few thousand Nightly users.
To start with I’d like to present the daily trend data. This data comes from Socorro and is graphed on my own server using the MetricsGraphics.js framework. As you can see, day to day data from Socorro can be quite noisy. However when we look at the trend over time we can see that overall the control group reported roughly 4% fewer crashes than those with GPU Process disabled.
Breakdown of GPU Process Crashes
Crashes in the GPU Process itself compare favourably across the board, well below 1.0 crashes per 1,000 hours of usage, and much less than crash rates we see from other Firefox processes (I’ll get into this more below). The following chart is very helpful in illustrating where our challenges might lie and may well inform roll-out plans in the future. It’s clear to see that Windows 8.1 and AMD hardware are the worst of the bunch while Windows 7 and Intel is the best.
Breakdown by Process Type
Of course, the point of GPU Process is not just to see how stable the process is itself but also to see what impact it has on crashes in other processes. Here we can see that stability in other processes is improved almost universally by 5%, except for plugin crashes which are up by 5%.
GPU Driver Crashes
One of the areas we expected to see the biggest wins was in GPU driver crashes. The theory is that driver crashes would move to the GPU Process and no longer take down the entire browser. The user experience of drivers crashing in the GPU Process still needs to be vetted but there does appear to be a noticeable impact with driver crash reports being reduced overall by 45%.
Breakdown by Platform
Now we dig in deeper to see how GPU Process impacts stability on a platform level. Windows 8.0 with GPU Process disabled is the worst especially when Plugin crashes are factored in while Windows 10 with GPU Process enabled seems to be quite favourable overall.
Breakdown by Architecture
Breaking the data down by architecture we can see that overall 64-bit seems to be much more stable overall than 32-bit. 64-bit sees improvement across the board except for plugin process crashes which regress significantly. 32-bit sees an inverse effect albeit at a smaller scale.
Breakdown by Compositor
Taking things down to the compositor level we can see that D3D11 performs best overall, with or without the GPU Process but does seem to benefit from having it enabled. There is a significant 36% regression in Plugin process crashes though which needs to be looked at — we don’t see this with D3D9 nor Basic compositing. Looking at D3D9 itself seems to carry a regression as well in Content and Shutdown crashes. These are challenges we need to address and keep in mind as we determine what to support as we get closer to release.
Breakdown by Graphics Card Vendor
Looking at the data broken down by graphics card vendors there is significant improvement across the board from GPU Process with the only exception being a 6% regression in Browser crashes on AMD hardware and a 12% regression in Shutdown crashes on Intel hardware. However, considering this data comes from Socorro we cannot say that these results are universal. In other words, these are regressions in the number of crashes reported which do not necessarily map one-to-one to the number of crashes occurred.
As a final comparison, since some of the numbers above varied quite a lot between the Control and Test groups, I wanted to look at the top signatures reported by these separate groups of users to see where they did and did not overlap.
This first table shows the signatures that saw the greatest improvement. In other words these crashes were much less likely to be reported in the Control group. Of note there are multiple signatures related to Adobe Flash and some related to graphics drivers in this list.
This next table shows the inverse, the crashes which were more likely to be reported with GPU Process enabled. Here we see a bunch of JS and DOM related signatures appearing more frequently.
These final tables break down the signatures that didn’t show up at all in either of the cohorts. The top table represents crashes which were only reported when GPU Process was enabled, while the second table are those which were only reported when GPU Process was disabled. Of note there are more signatures related to driver DLLs in the Disabled group and more Flash related signatures in the Enabled group.
In this first attempt at a GPU process things are looking good — I wouldn’t say we’re release ready but it’s probably good enough to ship to a wider test audience. We were hoping that stability would be on-par overall with most GPU related crashes moving to the GPU Process and hopefully being much more recoverable (ie. not taking down the browser). The data seems to indicate this has happened with a 5% win overall. However this has come at a cost of a 5% regression in plugin stability and seems to perform worse under certain system configurations once you dig deeper into the data. These are concerns that will need to be evaluated.
It’s worth pointing out that this data comes from Nightly users, users who trend towards more modern hardware and more up to date software. We might see swings in either direction once this feature reaches a wider, more diverse population. In addition this experiment only hit half of those users who qualified to use GPU Process which, as it turns out, is only a few thousand users. Finally, this experiment only measured crash occurrences and not how gracefully GPU Process crashes — a critical factor that will need to be vetted before we release this feature to users who are less regression tolerant.
As I close out this post I want to take another moment to thank David Anderson, Ryan Hunt, and George Wright for all the work they’ve put in to making this first attempt at GPU Process. In the long run I think this has the potential to make Firefox a lot more stable and faster than previous Firefox versions but potentially the competition as well. It is a stepping stone to what David Bryant calls the “next-generation web platform”. I also want to thank Felipe Gomes and Jukka Jylänki for their help getting this experiment live. Both of them helped me work through some bugs, despite the All-hands in Hawaii and Christmas holidays that followed — without them this experiment might not have happened in time for the Firefox 53 merge to Aurora.
If you made it this far, thank you for reading. Feel free to leave me feedback or questions in the comments. If you think there’s something I’ve reported here that needs to be investigated further please let me know and I’ll file a bug report.
Since joining the Platform Graphics team as a QA engineer several months ago I’ve dabbled in visualizing Graphics crash data using the Socorro supersearch API and the MetricsGraphics.js visualization library.
After I gained a better understanding of the API and MG.js I set up a github repo as sandbox to play around with visualizing different types of crash data. Some of these experiments include a tool to visualize top crash signatures for vendor/device/driver combinations, a tool to compare crash rates between Firefox and Fennec, a tool to track crashes from the graphics startup test, and a tool to track crashes with a WebGL context.
Fast forward to June, I had the opportunity to present some of this work at the Mozilla All-hands in London. As a result of this presentation I had a fruitful conversation with Benoit Girard, fellow engineer on the Graphics team. We talked about integrating my visualization tool with Bugzilla by way of a Bugzilla Tweaks add-on; this would both improve the functionality of Bugzilla and improve awareness of my tool. To my surprise this was actually pretty easy and I had a working prototype within 24 hours.
Since then I’ve iterated a few times, fixing some bugs based on reviews for the AMO Editors team. With version 0.3 I am satisfied enough to publicize it as an experimental add-on.
Bugzilla Socorro Lens (working title) appends a small snippet into the Crash Signatures field of bug reports, visualizing 365 days worth of aggregate crash data for the signatures in the bug. With BSL installed it becomes more immediately evident when a crash started being reported, if/when it was fixed, how the crash is trending, or if the crash is spiking; all without having to manually search Socorro.
Of course if you want to see the data in Socorro you can. Simply click a data-point on the visualization and a new tab will be opened to Socorro showing the crash reports for that date. This is particularly useful when you want to see what may be driving a spike.
At the moment BSL is an experimental add-on. I share it with you today to see if it’s useful and collect feedback. If you encounter a bug or have a feature request I invite you to submit an issue on my github repo. Since this project is a learning experience for me, as much as it is a productivity exercise, I am not accepting pull requests at this time. I welcome your feedback and look forward to improving my coding skills by resolving your issues.
[Update] Nicholas Nethercote informed me of an issue where the chart won’t display if you have the “Experimental Interface” enabled in Bugzilla. I have filed an issue in my github repo and will take a look at this soon. In the meantime, you’ll have to use the default Bugzilla interface to make use of this add-on. Sorry for the inconvenience.
We recently relaxed our graphics blocklist for users with NVIDIA hardware using certain older graphics drivers. The original blocklist entry blocked all versions less than v188.8.131.5265 due to stability issues with NVIDIA 182.65 and earlier. We recently learned however that the first two numbers in the version string indicate platform version and the latter numbers refer to the actual driver version. As a result we were inadvertently blocking newer drivers on older platforms.
We have since opened up the blacklist for versions newer than 184.108.40.20665 (Win XP) and 220.127.116.1165 (Vista/Win7) via bug 1284322, effectively drivers released beyond mid-2009.This change only exists on Nightly currently but we expect it to ride the trains unless some critical regression is discovered.
If you are triaging bugs and user feedback, or are engaging with users on social media, please keep an eye out for users with NVIDIA hardware. If the user does have NVIDIA hardware please have them check the Graphics section of about:support to confirm if they are using a driver version that was previously blocked. If they are try to help them get updated to the most recent driver version. If the issue persists, have them disable hardware acceleration to see if the issue goes away.
The same goes if you are a user experiencing quality issues (crashes, hangs, black screening, checkerboarding, etc) on NVIDIA hardware with these drivers. Please make sure your drivers are up to date.
In either case, if the issue persists please file a bug so we can investigate what is happening.
I’m happy to share that I will be hosting my first ever All-hands session, Graphics Stability – Tools and Measurements (12:30pm on Tuesday in the Hilton Metropole), at the upcoming Mozilla All-hands in London. The session is intended to be a conversation about taking a more collaborative approach to data-driven decision-making as it pertains to improving Graphics stability.
I will begin by presenting how the Graphics team is using data to tackle the graphics stability problem, reflecting on the problem at hand and the various approaches we’ve taken to date. My hope is this serves as a catalyst to lively discussion for the remainder of the session, resulting in a plan for more effective data-driven decision-making in the future through collaboration.
I am extending an invitation to those outside the Graphics team, to draw on a diverse range of backgrounds and expertise. As someone with a background in QA, data analysis is an interesting diversion (some may call it an evolution) in my career — it’s something I just fell in to after a fairly lengthy and difficult transitional period. While I’ve learned a lot recently, I am an amateur data scientist at best and could certainly benefit from more developed expertise.
I hope you’ll consider being a part of this conversation with the Graphics team. It should prove to be both educational and insightful. If you cannot make it, not to worry, I will be blogging more on this subject after I return from London.