Switching Normandy to use OIDC

Posted on 2016-05-03 in
  • mozilla
  • , and
  • normandy

Recently Normandy switched from authenticating users ourselves with boring username and passwords to using Mozilla's OIDC SSO to authenticate users more securely.

Normandy is a web service that holds a lot of influence over Firefox. Because of this, we have had a list of security features we've been working through. One of the big items on this list was to not store passwords, and do authentication of users ourselves.

We chose to use OIDC for this, primarily because it is the new hotness as far as authenticating Mozillians. It can use many sources of authentication, including Mozilla's LDAP servers, the canonical source of employee user data. This is exactly what we want to use for Normandy.


Normandy is a Django app, so we initially explored doing the integration with OIDC directly in the app. The idea would be to use an existing OIDC library to authenticate users with the Mozilla OIDC SSO, and correlate that to the existing users of the system via email address.

Unfortunately, we weren't able to get any of the libraries to work for us. The major problems we ran into were incompatibilities with something in our stack (Python 3.6 or OIDC specifically) or the implementation being too complex.

Instead we chose an easier process. Normandy is fronted by Nginx, which does some work with caching and logging. Our operations team has an Nginx access-proxy integration that works with our Nginx frontend. It passes authentication details to our app via the HTTP header Remote-User. This solution was much easier to implement: essentially we flipped a flag in Puppet, and we started getting authentication headers.

Changes to Django

Of course, sending the headers isn't enough. We also have to configure the app to read those headers and act accordingly. We did this with Django's RemoteUserBackend. This works by adding a middleware that annotates all requests with information about the authentication header, and an authentication backend that reads that information to sign a user in or out. If a user is authenticated via the Remote-User header, but does not exist in the database, the backend automatically creates the user and signs them in.

The default settings worked well for us. The only modification we needed was to tie it into our logging and settings systems. A simplified version of the changes is to add RemoteUserMiddleware to the MIDDLEWARE setting, and adding RemoteUserBackend to AUTHENTICATION_BACKENDS. You can see the full changes in this pull request.

Changes to Nginx

To implement the Nginx part of this, we modified the configuration to perform authentication via OIDC with Mozilla's SSO. That was implemented with lua-resty-openidc.

When an HTTP request comes in to Nginx with a url covered by OIDC authentication, Nginx checks if the request has cookies that already authenticate it via OIDC. If it does not, Nginx redirects the request to Mozilla's SSO to perform authentication, which then redirects the user back to Nginx with authentication tokens to log the user in. Nginx validates these tokens, and then proxies the request to Normandy with the Remote-User header to set.

Importantly, Nginx also strips any value of Remote-User that external users try to use. This way we don't allow users to sign in as any user simply by passing a HTTP header. That would be bad.


The OIDC claim information identifies a user by email address and that's what gets passed to Normandy in the Remote-User header. The RemoteUserBackend authenticates users by matching that header to the username field of Django User models. Normandy has very few users, and all of them are Mozilla employees, so we know they all have LDAP emails. We wrote a migration to copy our users email addresses from to User.username to accommodate this.

Here is a slightly abbreviated version of the migration:

from django.conf import settings
from django.db import migrations

def email_to_username(apps, schema_editor):
    Copy emails to usernames for all users.
    User = apps.get_model('auth', 'User')
    for user in User.objects.all():
            user.username =

def remove_email_from_username(apps, schema_editor):
    Copy emails to usernames for all users.
    User = apps.get_model('auth', 'User')
    for user in User.objects.all():
        if '@' in user.username:
            user.username = user.username.split('@')[0]

class Migration(migrations.Migration):

    dependencies = [

    operations = [
        migrations.RunPython(email_to_username, remove_email_from_username),


All or Nothing

One of the major challenges in this system is that the rules for whether a user needs to be authenticated are not are necessarily very simple. Nginx can't really implement application level logic to decide if a user needs authenticated or not.

Before this system, we would allow certain views to be accessed by both authenticated users and anonymous users. We then used Django's permission models to decide if a user was allowed to do what they were trying to do. For example, the Normandy recipe listing page would allow an anonymous user to see the list of recipes, and an authenticated user to create a new recipe if they were in the correct group.

This isn't something we could do with Nginx. We could protect certain part of the site by URL, and it had to be all or nothing: Either a user was authenticated on that portion of the site, or the authentication header would never be passed, and all users would be anonymous. This turned out to be a minor annoyance for us, but I could imagine it being a huge problem for other sites.

We have two kinds of servers. One is read-only, and the other is read-write. The read-write version is only accessible over VPN, and only by Mozilla employees. It was easy to simply make the entire read-write server require authentication. Mixing authentication on one server would be challenging, because you'd have to carefully design your URL structure to separate authenticated and unauthenticted parts of the site.

Non-Browser Usage

The authentication flow outlined above relies heavily on having a web browser and a human around. We haven't figured out how to authenticate non-human users, such as shell scripts that use curl to automate requests to the API to make repetitive changes.

This is a minor use case that for now we've simply dropped. Some day in the future we may re-visit it and try to figure out a better work flow for these kind of changes.

Overall, the migration to using Auth0 has gone well, and we didn't have any major problems deploying it. We had to give up some control over authentication of users, but in exchange we have very easy user management and better security.

Using https:// for GitHub with Multi Factor Authentication

Posted on 2015-10-12 in
  • github
  • ,
  • config
  • , and
  • mfa

One way to authenticate with GitHub is using SSH. Your remote URLs look like This means you get to re-use all your SSH tools to deal with authentication. Agents, key pass phrases and forwarding are all great tools.

The down side to all this is that you have to use all your SSH tools to deal with authentication. That's actually kind of annoying for automation and day-to-day sanity.

Another option to authenticate with GitHub is to use HTTPS URLs and use HTTP basic authentication. For this, you use remote URLs like The main advantage of this is that the URL is useful even without authentication. You can pull from this URL without authorizing (assuming you are working public repos). This means automation and tooling need fewer secrets, which is great.

HTTPS URLs are also a great way to teach people, since they have a lower barrier of entry. Anyone can type in a user name and password. Explaining SSH so that that a newbie can get started using GitHub is no fun.

Tricks to make HTTPS easier

Typing in usernames and passwords every time is really annoying though. There are also some other issues to work out. Here is what I've discovered to make things even better:

Saving passwords

Most desktop environments have a way to store secrets. I use Gnome Keyring.1 Git knows how to tie into secret-storage systems like this, but it needs some help first. In /usr/share/git/credentials there are several directories with tools to hook up Git to several secret-storage systems. For me there is gnome-keyring, netrc, osxkeychain, and wincred. To set them up, invoke make in the appropriate directory. If all goes well, the credential helper will be built, and git will start remembering passwords for you. Yay!

This assumes that you are using one of these secret storage mechanisms. If you aren't, I assume you're the kind of person that could either set one up, or read through git-credential-gnome-keyring (it's only 400 lines), and adapt it to use whatever secret store you want.

Hub defaults

Hub sets up SSH remotes by default. This makes sense, since it is preferred in the community, but it rubs me the wrong way. This can be easily remedied with

git config --global hub.protocol https

Multi Factor Authentication

The biggest problem with this scheme is dealing with MFA (aka 2fa, 2 factor authentication). The HTTP basic authentication system used here means your URLs are actually transformed into something like The username and password are included right in the URL.

There is no place in this scheme for a MFA code. So we cheat. GitHub supports using "personal access tokens" for authentication. These are long hexadecimal strings. They are easily revocable, and can be scoped to only certain permissions. Because of this, GitHub will treat them as username, password, and MFA code all in one.

To get one, follow these steps

  1. Go to the Tokens page of your GitHub settings
  2. Click "Generate a new token" in the upper right
  3. Give it a description. I use "for git+https".
  4. Choose permissions to give it. You probably want repo and gist.2
  5. Click "Generate Token"
  6. Immediately use the token.

After you leave the page, GitHub will never show you that token again. You shouldn't need it either. You should put it in a secret store somewhere (like Gnome Keyring). Don't try and memorize it or write it down somewhere. These tokens are for robots.

For the purposes of this guide, paste it into the username field when you try and git push and get prompted to log in. If you have set up the credential helpers above, Git will remember the URL and never bug you again. Git will make a URL like

Be careful when copy/pasting that token. The webpage tends to put extra spaces on either side of the token for me, so make sure to only get the hex characters.


The tokens from above are also great for automation. If you need to, say, automatically push to GitHub from Travis-CI, you can (in a secure way) give Travis-CI a remote URL that includes a Personal access token. Then the automation can push, and you don't have to deal with SSH keys. You can also easily revoke the token later in case something goes wrong.

Don't re-use these tokens. Use one for pushing via HTTPS on your laptop. Use a different one for each automation task you have. Use a different one on different computers. They are easy to generate, so go wild.

Also remember to occasionally clean out old unused tokens.

  1. Yes, yes, Gnome is sort of a dirty word to some people. But really, Gnome Keyring is quite lightweight. Its dependencies (on Arch) are gcr, and libcap-ng. It doesn't pull in the entire Gnome ecosystem (at least, it shouldn't).

    If you use KDE, I can't help you. 

  2. Did you know that Gists are actually backed by git? They all have URLs on the right side that you can use to clone them, edit offline and push back to. Much nicer than using the online editor for heavy tasks. 

Node.js static file build steps in Python Heroku apps

Posted on 2015-09-02 in
  • heroku
  • ,
  • python
  • , and
  • node

I write a lot of webapps. I like to use Python for the backend, but most frontend tools are written in Node.js. LESS gives me nicer style sheets, Babel lets me write next-generation JavaScript, and NPM helps manage dependencies nicely. As a result, most of my projects are polyglots that can be difficult to deploy.

Modern workflows have already figured this out: Run all the tools. Most READMEs I've written lately tend to look like this:

$ git clone
$ cd git
$ pip install -r requirements.txt
$ npm install
$ gulp static-assets
$ python ./ runserver

I like to deploy my projects using Heroku. They take care of the messy details about deployment, but they don't seem to support multi-language projects easily. There are Python and Node buildpacks, but no clear way of combining the two.

Multi Buildpack

GitHub is littered with attempts to fix this by building new buildpacks. The problem is they invariable fall out of compatibility with Heroku. I could probably fix, but then I'd have to maintain them. I use Heroku to avoid maintaining infrastructure; custom buildpacks are one step forward, but two steps back.

Enter Multi Buildpack, which runs multiple buildpacks at once.

It is simple enough that it is unlikely to fall out of compatibility. Heroku has a fork of the project on their GitHub account, which implies that it will be maintained in the future.

To configure the buildpack, first tell Heroku you want to use it:

$ heroku buildpacks:set

Next, add a .buildpacks file to your project that lists the buildpacks to run:


Buildpacks are executed in the order they're listed in, allowing later buildpacks to use the tools and scripts installed by earlier buildpacks.

The Problem With Python

There's one problem: The Python buildpack moves files around, which makes it incompatible with the way the Node buildpack installs commands. This means that any asset compilation or minification done as a step of the Python buildpack that depends on Node will fail.

The Python buildpack automatically detects a Django project and runs ./ collectstatic. But the Node environment isn't available, so this fails. No static files get built.

There is a solution: bin/post_compile! If present in your repository, this script will be run at the end of the build process. Because it runs outside of the Python buildpack, commands installed by the Node buildpack are available and will work correctly.

This trick works with any Python webapp, but lets use a Django project as an example. I often use Django Pipeline for static asset compilation. Assets are compiled using the command ./ collectstatic, which, when properly configured, will call all the Node commands.


export PATH=/app/.heroku/node/bin:$PATH
./ collectstatic --noinput

Alternatively, you could call Node tools like Gulp or Webpack directly.

In the case of Django Pipeline, it is also useful to disable the Python buildpack from running collectstatic, since it will fail anyways. This is done using an environment variable:

heroku config:set DISABLE_COLLECTSTATIC=1

Okay, so there is a little hack here. We still had to append the Node binary folder to PATH. Pretend you didn't see that! Or don't, because you'll need to do it in your script too.

That's it

To recap, this approach:

  1. Only uses buildpacks available from Heroku
  2. Supports any sort of Python and/or Node build steps
  3. Doesn't require vendoring or pre-compiling any static assets


Changes to my key

Posted on 2015-06-29 in
  • gpg
  • , and
  • email
Hash: SHA256

Changes to my key

My gpg key [0][1] changed today.  Previously, there were 3 uids on the key:

1. Mike Cooper <>
2. Michael Cooper <>
3. <>

I have changed the UIDs on the key to

1. Mike Cooper <>
2. Mike Cooper <>
3. <>

The email is my primary email now.

- --Mike Cooper
Version: GnuPG v2


An IRC Server in Rust, part 1

Posted on 2015-06-08 in
  • rust
  • ,
  • irc
  • ,
  • code
  • , and
  • networking

Rust is pretty cool. I don't hate writing it, like I would hate writing C++. I still get the performance benefits of a low level language. Plus it gives me the chance to work in a modern strongly typed language, without having to wrap my head around Haskell.

Of course the best way to learn a new language is to go off and write something in it[citation needed]. A project I've been wanting to work on lately involves a custom made IRC server (I doubt any of the existing IRCds could do what I want out of the box). There will be another blog post with all the details about this, but for now I'd like to talk about my experience with building an IRC server in Rust.

How does IRC even work?

Before I can implement IRC in anything though, I need to know how it works. In theory, [RFC 2812][] should specify the line-level protocol, and should be enough to write a full client or server. However, for a few reasons, it isn't enough to just read the RFC.

  1. The RFC was written a long time ago, and isn't exactly a modern presentation of a protocol. It focuses on some things that I wouldn't expect.

  2. I'm lazy, and I'm just skimming the RFC and reading the parts that seem relevant. This makes it hard to get an over-arching picture of the protocol.

  3. No one actually implements the spec. For example, irssi doesn't send the right parameters for the USER command (according to this RFC). The RFC defines the parameters as the system username, the mode of the user (as a number), an unused parameter, and the user's real name. Irssi sends USER mythmon mythmon localhost :Unknown. You made notice that "mythmon" is not a valid mode, which should be a number like 0 or 8. Yay, irssi.

So just reading the RFC isn't going to work. I need better ways. I could do something like bust out Wireshark and analyzing bytes on the wire and blah blah blah. There is an easier way though. IRC is a really simple protocol, and I can just type it out by hand, with a little discipline.

Enter nc. nc is netcat, and is approximately the simplest possible networking tool there is. We'll be using 2 of its modes, listening mode and client mode.i

There are two major kinds of netcat, and they aren't really compatible. There is one that is a part of the GNU utils, and one that is part of the BSD utils. For most utilities, I prefer GNU utils, but in this case I think the BSD version is more useful. In the examples below, I'm using the version of nc from the Arch package openbsd-netcat.

First, lets see what a client does when it first connects to a server. To do that, I'll use nc -l 6667 to act as a server, and connect irssi to it. Here is what irssi said:

NICK mythmon
USER mythmon mythmon localhost :Unknown

I like Weechat more than irssi normally, but irssi is a bit simpler to just connect to random servers, so I'm using it here.

Hmm. I'm not sure what to do with that. Lets see what a real IRC server says when I say those things to it. I'll use nc 6667 as a client to do this. After firing it up I'll type what irssi sent to me. The lines that start with > are what I type, and the lines that start with < are what Freenode sends back.

I've cut off a lot of output here. Freenode is actually quite noisy, and sent back about 50 lines, which aren't interesting to the point I'm making here.

> NICK mythmon
> USER mythmon mythmon localhost :Unknown
< 001 mythmon :Welcome to the freenode Internet Relay Chat Network mythmon
< 375 mythmon :- Message of the Day -
< 372 mythmon :- Welcome to in Stockholm, SE.
< 372 mythmon :- Thanks to for sponsoring
< 376 mythmon :End of /MOTD command.
< :mythmon MODE mythmon :+i

A combination of reading this and reading the RFC teaches me that there are about 4 parts of an IRC message:

  1. The prefix is the first space-separated word, if it starts with a colon. This is optional. It represents the source of the message, and is optional.

  2. The command is the first word after the prefix (or the first word if there is no prefix). Some commands are easy, like NICK (which changes a user's nick). Others are just numbers, so I'll have to consult the RFC for those. Though reading through the output, I bet 375 is "start of MOTD", 372 is "MOTD continues", and 376 is "end of MOTD".

  3. The parameters are every word between the command and the trail (or the end of the message). They number of these varies based on the command.

  4. The trail is a parameter that starts with a colon and continues to the end of the line. If it exists, it is the last parameter.

I suspect the important parts of this output were the 001 command, which I'm going to guess means something like "start of connection". So I'll start up a server again, connect irssi, and send a welcome message. This is nc -l 6667 again. Then I connected irssi, typed /j #test, sent a message in that channel, switched back to netcat and sent a response from an imaginaty user, and finally typed /disconnect in irrsi.

< NICK mythmon
< USER mythmon mythmon localhost :Unknown
> :localhost 001 mythmon :Welcome!
< MODE mythmon +i
< WHOIS mythmon
< PING localhost
> PONG :localhost
< JOIN #test
< PRIVMSG #test :hello?
> :localhost PRIVMSG #test :it works!
< QUIT :leaving

Cool. Now I can speak IRC. On a lark I loaded up nc in client mode again. I connected to Freenode, joined a channel, responded to ping, and sent a message. I can totally IRC from netcat. Wooh!

Now In Rust

I fumbled around for a while trying to use mio, because async IO is good, right? Well mio isn't really ready (the docs leave a lot to be desired) and is much lower level than I want to write for this project. So for now I'll stick with the easy blocking IO model.

At first I had a single threaded program that could listen on a port, accept a connection, and talk to one client at a time. That won't do: for an IRC server I'll need lots of incoming connection. I looked around for something like select to pull data from multiple TCPStreams at once. No luck. So That means one thread per client. Sad, but it will work.

Luckily, Rust makes threading really really easy, and surprisingly safe. Seriously, this is the best parallelism I've ever seen from a standard library. Nice job Rust.

The relevant bit of code looked something like this at one point:

let listener = TcpListener::bind("").unwrap_or_else(|e| {panic!(e) });

for stream in listener.incoming() {
    let stream = stream.unwrap();
    thread::spawn(move || {
        handle_client(stream).unwrap_or_else(|e| { panic!(e) });

That handily spawns a new thread for every connection, passing the connection object between the threads in a thread safe way.

The .unwrap_or_else(|e| { panic!(e) }) bits are the best way I've found of dealing with errors that should terminate. Just calling .unwrap() will close the program, but leaves a lot to be desired in terms of debugging. It doesn't really tell you what happened, just that someone called panic inside unwrap of something. .unwrap_or_else(|e| { panic!(e) }) still panics, tearing down the thread, but it also puts the error in the right function so I can see what is failing, and sometimes e is even something useful.

Now handle_client can spin around with something like this, collecting lines from the client and doing things with them:

fn handle_client(mut stream: TcpStream) -> Result<()> {
    let reader = BufReader::new(try!(stream.try_clone()));
    for line in reader.lines() {
        let line = line.unwrap_or_else(|e| { panic!(e) });
        // do something with the line

Awesome. That wasn't too bad. BufReader is a nice wrapper around things that implement Read. It buffers things providing the ability to do things like read until the next line, or get a iterator over the lines. Great.

I really like Rust's traits system. Being able to use the same tool (BufReader) on anything that implements Read (files, network streams, more things I haven't found yet) is really neat.

After this I wrote some IRC parsing stuff. I tried to use rusty-peg, a crate that lets me make a real parser from a PEG grammar, but it didn't work very well. Actually at all. I couldn't get it to work. So I wrote a not-very good not-really-a-parser that just splits on spaces and does some other stuff. It works on valid data. I haven't tested invalid data yet.

I just noticed that rusty-peg just released a version 0.4. I'll have to go try it out again, if anything interesting changed.

Then I discovered that I needed even more threads. About twice as many. I'll get back to that in another blog post. The spoiler version is that there is a select-like thing available, but only over channels. And TcpStreams aren't channels. So I had to make a TcpStream-to-channel converter, and that had to run in a separate thread. sigh.

Wrap Up

I learned that networking in Rust is an immature field, and that I'll have to do a lot more work than I would in something like Python or Node. Still, it isn't as bad as it would be in C or C++, and I'm getting better at Rust.

In the future, I'll write some blog posts about easily converting objects to and from strings, parsing IRC messages, and communicating between threads with channels.

systemd, tmux, and WeeChat

Posted on 2015-02-15 in
  • arch
  • ,
  • config
  • ,
  • systemd
  • ,
  • weechat
  • , and
  • tmux

Today, edunham posted a recipe for starting screen+irssi at boot using rc.local. That's pretty cool (and useful!), but it doesn't fit into my set up very well. I know that some time this week, my VPS provider will be doing maintenance and rebooting my shell server, so it seemed like a good time set up a more automatic persistent IRC.

My setup is different in 3 key ways: I don't use screen, I don't use irssi, and I don't (want to) use rc.local. Instead I've got tmux, WeeChat, and systemd. I figure these three things are roughly equivalent, so I set off to try and apply the same idea to my setup.

Step 1: Systemd user sessions

I could run this as root and hard code my user into the init files, or be clever and try and make the user configurable, and somehow allow for multiple user sessions (I'm not the only one to use this box) and blah blah blah. This all sounds not quite right. This is a service just for me, right? It should run as me! Luckily, systemd has a user mode that works well for this. On my VPS, which is running an up-to-date Arch installation, this Just Worked™. Systemd had already helpfully created user sessions where needed. Awesome.

To see if this was true, I used this command:

$ pgrep -fau $(whoami) systemd
261 /usr/lib/systemd/systemd --user

Step 2: Making a service file

After some Googling, I figured out that I should "just" be able to start a tmux session with WeeChat from a systemd service file. This is straight forward after you know how to do it, but the docs are a bit hard to wade through.

In the end, I learned that I need a .service unit file in my systemd user configuration directory, that looks like this:


Description=Weechat IRC Client (in tmux)

ExecStart=/usr/bin/tmux -2 new-session -d -s irc /usr/bin/weechat
ExecStop=/usr/bin/tmux kill-session -t irc


Some of this is self explanatory. The syntax is INI style, so the things in square brackets are section titles. I'm not going to explain Description, ExecStart, or ExecStop. I hope they are obvious.

The others are a bit more subtle. From top to bottom:

  • Type=oneshot: This tells systemd that this service file will run a command, and that command will run and do something and then exist. Because tmux will "own" the process outside of the systemd controlled cgroup. It will get executed by the tmux server. At least, as I understand it.
  • RemainAfterExit=yes: This tells systemd that the above is sort of a lie. In particular, it indicates that the service should be kept in the "running" state even though systemd can't tell that something is running. In particular, this lets us stop the service nicely using the ExecStop definition. This simply tells systemd what to do when we "enable" this unit. In systemd parlance, enable usually means "start at boot", but it doesn't have to. In this case, it means start at boot.

The ExecStart and ExecStop lines start and stop a decently customized tmux session to run WeeChat in, respectively.

Step 3: Headdesk

The above unit file took a while to figure out. For one, the -d flag to tmux isn't entirely obvious. That tells tmux to start the session detached, which is important since systemd won't get it a TTY to put anything on.

The Type=oneshot and RemainAfterExit=yes was particularly hard to find. I eventually found someone else's systemd user unit files that started a program in tmux, and they used it. Yay for Googling.

The final piece in the puzzle was the most frustrating. I started the service and systemd claimed it was running, and pgrep agreed. But tmux refused to connect with the error "no sessions".

I'm not sure why this is the case, but this post on a systemd mailing list led me to try running $ sudo loginctl enable-linger $(whoami). That worked! Now when I start the session, it creates a session I can actually attach to. Win!

All together now

After putting all the above in place, I can now do:

# Enable the service to start at boot time.
$ systemctl --user enable weechat.service
Created symlink from /home/mythmon/.config/systemd/user/
    to /home/mythmon/.config/systemd/user/weechat.service.
# Start the service.
$ systemctl --user start weechat.service
# Attach to the new tmux session.
$ tmux attach -t irc

I was presented with a brand new instance of WeeChat. Success!

I still have some automation to do within WeeChat, like connecting to some servers that require passwords and joining all the right channels, but this seemed like the harder part to me, so I'm glad to have it out of the way.

Pulse Audio, Spotify, and flat volume

Posted on 2015-02-11 in
  • arch
  • ,
  • config
  • ,
  • spotify
  • , and
  • pulseaudio

Spotify is a pretty cool music service, and I really enjoy Pulse Audio, especially when combined with tools like pasystray and pavucontrol. Together they have a bit of an annoying "feature" though.

Spotify is smart enough to link its internal volume meter with Pulse's stream volume for Spotify. However, for some reason, Spotify also links the stream's volume to the master volume for the sound device Spotify is on. Blech! That means that when you grab the volume slider in Spotify, everything on the system gets louder.

This is, I am told, a "feature" of Pulse Audio called "flat volumes", and it is supposed to do better things. In practice though, I find it doesn't work and I'd rather just control volume myself. The fix for this, luckily, is simple.

To turn this off, in Arch Linux at least, is pretty simple. In /etc/pulseaudio/daemon.conf there is line like

; flat-volums = yes

The ; is a comment marker in this file. Uncomment that line and change the value to no. Now there is no silly linking of volume meters, and I can control volumes independently.

This has been bugging me for a while, but not enough to do anything. The straw that broke the camels back is my new USB soudn card. For reasons I don't understand, the internal sound card on my Thinkpad dock doesn't work in Linux. This is pretty easy to work around. Get a USB sound card, plug that into the dock, and plug my headphones into that.

This works nicely, except the particular USB sound card I got is apparently a bit wonky. If I grab the volume meter in Pulse Audio and lower it below about 36%, the sound cuts off. Apparently that is as quiet as it will go.

Unfortunatly, having Spotify and the sound card both set to 36% volume is too loud. The solution is to slide Spotify down to around 20%, which sounds good. Except sliding Spotify down with flat-volume = yes also lowers the sound card's master volume, making it cut to mute.

The fix above lets me set Spotify's volume independently, so now I can get a comfortable listening volume on my headphones through my dock. Yay!

TiVo Slide in Arch Linux

Posted on 2015-01-07 in
  • arch
  • , and
  • config

I've got a media computer in my living room hooked up to a TV. It runs XBMC Kodi. It's pretty nice, except it's hard to find a good controller for it. Wireless keyboards work, but are awfully awkward on the couch.

After some searching, I decided my weapon of choice would be a TiVo Slide remote. It's (technically) Bluetooth, so no pesky line of sight issues, has a secondary IR mode for controlling other devices (TV volume and power), has a slide out keyboard, and all of this in something roughly the size of your average TV remote.

Trouble is, this is designed to work with TiVos. It can work on normal computers, but it's a bit of pain. Here I'm going to write down what I learned getting it to work on my Arch Linux media computer.

I'm using an older version of the remote. I'm not sure if these instructions would work with the "TiVo Slide Pro Remote" that you can buy today. The main physical difference is that the Pro puts a circle of buttons in the middle of the slide out keyboard, where as mine puts them on the left side of the slide out keyboard.

A lot of this information has been adapted from this page on the Kodi wiki.

Step 1: Connecting it

Earlier I said that the Slide was technically Bluetooth. That's because although you can put it in pairing mode and try and connect with it, it's going to be a losing battle. The included dongle also is technically Bluetooth, but trying to use it as such has only ended in sadness for me.

Fortunately, the dongle has a fallback mode that makes it act as a plain ordinary HID device. In this mode, it handles the Bluetooth pairing and connection with the remote all on it's own, and it even comes pre-paired out of the box. This is a much nicer way of using the device. Unfortunately, the only way I can find to trigger this fallback mode is to totally disable Bluetooth on the host. (Lucky for me, I have no need of Bluetooth on the box in question).

So, the first order of business is a modprobe blacklist:


blacklist btusb
blacklist bluetooth

You'll either need to reboot or rmmod these modules if they are loaded. If you plan on rebooting save it for later, it will be easiest if you do it at the end.

After this plugging in the Slide's dongle will make it show up as a plain keyboard. Testing with xev reveals that a bunch of the buttons work, generating the right key codes. Good keys to test are any alphabet key on the slide out section, and the arrow keys. You may notice however that many keys don't work, like Select, or the A B C D buttons. This is because these buttons generate scan codes that are outside the normal range that X11 understands.

Step 2: Remapping weird scancodes

To bring the scancodes the Slide generates down into the range X11 understands, we'll use udev's hwdb. This is the tool that remaps scan code to keys for lots of weird keyboards. It compiles it's info from files in various places, including /etc/udev/hwdb.d/*.hwdb. That's where we are going to put things to configure it:


# Tivo Slide
 KEYBOARD_KEY_000C0041=enter     # select
 KEYBOARD_KEY_000C006C=f2        # A (Yellow)
 KEYBOARD_KEY_000C006B=f3        # B (Blue)
 KEYBOARD_KEY_000C0069=f4        # C (Red)
 KEYBOARD_KEY_000C006A=f5        # D (Green)
 KEYBOARD_KEY_000C006D=f6        # Zoom
 KEYBOARD_KEY_000C0082=f7        # Input
 KEYBOARD_KEY_000C0083=f8        # Enter
 KEYBOARD_KEY_000C008D=f9        # Guide
 KEYBOARD_KEY_000C009C=f10       # Chup
 KEYBOARD_KEY_000C009D=f11       # Chdn
 KEYBOARD_KEY_000C00B1=playpause # Pause
 KEYBOARD_KEY_000C00B2=record    # Record
 KEYBOARD_KEY_000C00F5=stop      # Slow

I'm not sure if the spaces before the KEYBOARD_KEY lines are needed. All the other hwdb files had them, so I kept them. Feel free to change the result keys (the ones after the right) to better match what you want to use the remote for.

After this file is in replace, you can need to regenerate hwdb.bin, which is the file that udev actually. To do that run udevadm hwdb --update. Then to reload these rules, run udevadm trigger. Now the keys listed above should work when tested in xev. Yay!

Step 3: Configuring XBMC Kodi.

This part is a normal keymap configuration for Kodi. Anyone who has mapped keys in Kodi before should be familiar with this. In your .xbmc directory there should be a userdata/keymaps/ directory. Put this file there:


<!-- Tivo Slide -->
      <button id="f200">ActivateWindow(home)</button> <!-- Tivo key -->
      <f1>Select</f1> <!-- Select key -->
      <f2></f2> <!-- A / Yellow -->
      <f3></f3> <!-- B / Blue -->
      <f4>ContextMenu</f4> <!-- C / Red -->
      <f5></f5> <!-- D / Green -->
      <f6>AspectRatio</f6> <!-- Zoom key -->
      <f8>Select</f8> <!-- Lower enter key -->
      <f9>FullScreen</f9> <!-- Guide Key -->
      <f10>PageUp</f10> <!-- channel up key -->
      <f11>PageDown</f11> <!-- channel down key -->
      <f12>Info</f12> <!-- Guide key -->
      <prev_track>Back</prev_track> <!-- "instant replay" key -->
      <home></home> <!-- Live TV key -->
      <delete>System.LogOff</delete> <!-- Clear key -->

      <prev_track>SmallStepBack</prev_track> <!-- "instant replay" key -->
      <next_track>StepForward</next_track> <!-- -> key -->
      <f2>ActivateWindow(osdaudiosettings)</f2> <!-- A (Yellow) -->
      <f3>ActivateWindow(videobookmarks)</f3> <!-- B (Blue) -->
      <f4>ActivateWindow(SubtitleSearch)</f4> <!-- C (Red) -->
      <f5>ActivateWindow(osdvideosettings)</f5> <!-- D (Green) -->

      <f5>ToggleWatched</f5> <!-- D / Green -->

This is what I use at home, but I'm still tweaking it. Feel free to customize it however you like. You won't break anything.

For more information about keymaps in Kodi, you can see the keymap wiki page.

Github Pages, Travis, and Static Sites

Posted on 2014-09-01 in
  • github
  • ,
  • travis
  • ,
  • static
  • , and
  • meta

I recently switched my blog to being hosted on GitHub Pages instead of hosting the static site myself. Along with this change, I was able to automate the rendering and updating of the site, thanks to GitHub webhooks, and Travis-CI. As always, I'm using wok for the rendering and management of the site.

My work flow now looks like this:

  1. Write a post, edit something, change a template, etc.
  2. git commit.
  3. git push.
  4. Wait for the robots to do my bidding.

It is ideal.


For this to work, there are a few things that are needed. First, and most fundamental, the site needs to be fully static. A pile of HTML, CSS, JS, images, etc. Nothing server side at all. Otherwise it can't be hosted on GitHub Pages.

Next, the site has to be stored on GitHub, and it can't be the account's main GHPage repository. For example, I cannot use the repository mythmon/ This is because GHPages treats the branches in that repository differently. It should be possible to set up this this work flow on a repository like this, but I won't go into it here.

The master branch will be where the source of the site is, the parts a human edits. The gh-pages branch will hold the rendered output, and be generated automatically.

Finally, the site needs to be easy to render on Travis. This usually means that all the tools are easy to install with pip or npm or another package manager, and the process of rendering the output from a checkout of the site can be scripted. Any wok sites should fit these requirements.

Part 1 - Automation.

Before I can ask the robots to do my bidding, I have to automate the process they are going to be doing. Two commands are needed, one to build the site, and one to commit the new version and push it to GitHub.

My site uses wok, which is a Python library. Because of this, I wanted a Python task runner to automate the build process. It may have been overkill, but I used Invoke. Here is my invoke script,, with explanation interspersed.

If you're unfamiliar with Invoke, it is gives a nice way to define tasks, run shell commands, and run Python code. Make, shell scripts, Gulp, or any other task runner would work just as well.

Here's the code. First, some imports.

import os
from contextlib import contextmanager
from datetime import datetime

from invoke import task, run

These two constants are used to clone and push the repository. GH_REF is the repository's remote URL, without any protocol, and GH_TOKEN will be a GitHub authorization token from the environment. More on this in a bit.

GH_REF = ''
GH_TOKEN = os.environ.get('GH_TOKEN')

This is a simple context manager that lets me change into a directory, run some commands, then safely change out of it. I'm likely reinventing the wheel here.

def cd(newdir):
    print 'Entering {}/'.format(newdir)
    prevdir = os.getcwd()
        print 'Leaving {}/'.format(newdir)

Here is the first of the three tasks defined. This one makes sure that the output directory is in the right state. It should work even if the directory already exists, if it wasn't a git repo, if it has stray file lying around, or even if it is on the wrong commit or branch. This is generally useful outside of Travis as well.

def make_output():
    if os.path.isdir('output/.git'):
        with cd('output'):
            run('git reset --hard')
            run('git clean -fxd')
            run('git checkout gh-pages')
            run('git fetch origin')
            run('git reset --hard origin/gh-pages')
        run('rm -rf output')
        run('git clone https://{} output'.format(GH_REF))

This next task simply renders the site. It sets up the output directory by calling the above task, and then calls then trigger wok, the site renderer. Nice and simple

def build():

This last task is the bit that actually publishes to GitHub in a safe, secure, and automated way.

def publish():
    if not GH_TOKEN:
        raise Exception("Probably can't push because GH_TOKEN is blank.")

    with cd('output'):

The first thing it does is configure a user for git. GitHub won't accept pushes without user information, so I put some fake information here.

        run('git config ""')
        run('git config "Travis Build"')

Next it git adds ll the files in the output directory. The --all fag will deal with new files being added, old files being changed, and old files being deleted. It won't commit anything in the .gitignore, if you have one.

        run('git add --all .')

Now, make the commit. I thought for a while about what to put in the git commit message. At first I was going to put a timestamp, but I realized that git will do that for me already. Future improvements might note what commits this version of the site was built from.

        run('git commit -am "Travis Build"')

Finally, the script needs to push the resulting commit up to the gh-pages branch of GitHub, so it will be served. The first problem I faced was how to authenticate with GitHub to do this. The second was how to do that without revealing any secrets.

The solution to the first problem was GitHub token auth. By using the HTTPS protocol, and putting the token in the authentication section of the URL, I can push to any GitHub repo that token has access to.

The problem with this is that git prints out the remote when you push. Since my token is in the URL, which is the remote name in this case, it was printing secrets out in Travis logs! The solution is to hide git's output. It seems obvious in retrospect, but I revealed two tokens this way in Travis logs (they were immediately revoked, of course).

        # Hide output, since there will be secrets.
        cmd = 'git push https://{}@{} gh-pages:gh-pages'
        run(cmd.format(GH_TOKEN, GH_REF), hide='both')

To run these tasks, I use invoke build and invoke publish, to build and publish the site, respectively.

Part 2 - Travis

As you can tell, the bulk of the work is in the automation. A lot of thought went into the 60 or so lines of code above. Now that it is automated, it is easy to make the robots do the rest. I chose Travis for my automation.

I went to the Travis site, set up the repository for the site, and tweaked a few settings. In particular, I turned off the "Build pushes" option, because it isn't useful to me. There isn't any risk of revealing secrets in PRs, because Travis doesn't decrypt the secrets in PR builds. The other setting I tweaked is to turn on "Buildon if .travis.yml is present". Since I was doing all this work on a branch, I didn't want my master branch to be making builds happen, and I think this is a generally good setting to set on Travis.

So that Travis knows what tools it needs, I added a requirements.txt file to my repository, which Travis understands how to use, if you set the language to Python. Then I added a .travis.yml to tell Travis how to build my site.

I wrapped the secure token here. And yes, it is fine. It is encrypted, as I'll explain below.

language: python
- '2.7'
- invoke build
- invoke publish
    secure: LXYt0XENsCV58GD2g2jB27Hil9O80DXdnyM6palKLNcYa7z/hqvqtkCwW9Wmj5jqLXj

Basic simple stuff. Since the The script command (invoke build) builds the site, using the big script above. Similarly, the after success command invoke publish uploads the site to GitHub (but only if the site actually builds).

Woah, hold on. What's all this junk at the end? That "junk" is the magic to safely pass secrets to Travis builds in a public repository. It is a line that looks like GH_TOKEN=abcdef1234567890, encrypted using a public key for which Travis holds the private key. In "safe" builds (builds on my repo that are not from PRs) Travis will decrypt that token and provide it to the build. The invoke script then picks up the environment variable and uses it when it pushes to GitHub. Pretty slick.

To generate this encrypted line, I used the [Travis CLI tool][traviscli] like this:

$ travis encrypt --add
Reading from stdin, press Ctrl+D when done

That is, I ran the command, typed the name of the environement variable, followed by an equals sign, and then the value, then I pressed enter, and then Ctrl+D. This is a normal interactive read from stdin. After doing that, my .travis.yml file contained the encrypted string, and I was ready to commit it.

I got the value for that environment variable from GitHub's personal API token generator.

Part 3 - DNS

I could be done now. At this point, when I push a new version of my site to GitHub, it fires a webhook, Travis builds my site, pushes it back to GitHub, and then GitHub serves it with GitHub Pages.

This didn't work for me for a couple reasons. First, I like my URL, and didn't really want to change. Second, my site assumes it as at the root of the server, and can't deal with GitHub's insistence of putting this site at The content of this site is there, but it's all unstyled, because of broken links to the CSS, and none of the links work. Maybe someday I'll fix this.

So I have to do some DNS tricks and tell GH pages to expect another domain name for my site.

Telling GitHub.

So that GitHub knows what site to serve when someone visits, I had to add a CNAME file to the gh-pages branch of the site. Luckily with wok that was pretty easy. I made the file media/CNAME which wok put in the root of the gh-pages branch and gave it the contents It takes some minutes for GitHub to recognize this change, but after that it works well.

Setting up DNS

You may have noticed that I say www.mythmon.comthere, instead of the nice`. I would prefer the latter, but it isn't to be with GHPages.

The recommended way to use custom DNS with GHPages is to make whatever domain name that should serve the site use a CNAME to So for me, I have IN CNAME in a Bind config. The problem is, according to RFC 1912, "A CNAME record is not allowed to coexist with any other data." Since the root of a domain has to have some other records (NS, SOA, possible MX or TXT), you can't have a CNAME to GitHub at the root. :(.


The problem with this is that I have used to reference my blog in the past, and cool URIs don't change. So I needed to find a way to make the old adadress work.

First I tried making have an A record to the IPs of the GHPages servers. This isn't recommended, but it does work if that is the primary DNS name of the site. However, since in the CNAME file above, I wrote down (with the recommended DNS CNAME setup), this didn't work. It gave "No such domain" errors. Bummer.

The solution I ended up going with is less nice. I pointed the root record at the server I used to host the site on, which is still running nginx. I put this in my Nginx config:

server {
    listen       80;
    return 301 $scheme://$request_uri;

This causes Nginx to serve permanent redirects to the correct url, preserving any path information. Not the best experience, but it works.

That's it.

Now the site works. It gets served from a fast CDN, I don't have to worry about re-rendering the site, and I get to make blog posts with git. The robots do the tedious work for me. It is ideal.

If you have any comments or questions, I'm @mythmon on Twitter.

Tracking Deploys in Git Log.

Posted on 2013-11-18 in
  • tools
  • ,
  • sumo
  • , and
  • git

Knowing what is going on with git and many environments can be hard. In particular, it can be hard to easily know where the server environments are on the git history, and how the rest of the world relates to that. I've set up a couple interlocking gears of tooling that help me know whats going on.


One thing that I love about GitHub is it's network view, which gives a nice high level overview of branches and forks in a project. One thing I don't like about it is that it only shows what is on GitHub, and is a bit light on details. So I did some hunting, and I found a set of git commands that does a pretty good at replicating GitHub's network view.

$ git log --graph --all --decorate

I have this aliased to git net. Let's break it down:

  • git log - This shows the history of commits.
  • --graph - This adds lines between commits showing merging, branching, and all the rest of the non-linearity git allows in history.
  • --all - This shows all refs in your repo, instead of only your current branch.
  • --decorate - This shows the name of each ref net to each commit, like "origin/master" or "upstream/master".

This isn't that novel, but it is really nice. I often get asked what tool I'm using for this when I pull this up where other people can see it.

Cron Jobs

Having all the extra detail in my view of git's history is nice, but it doesn't help if I can only see what is on my laptop. I generally know what I've commited (on a good day), so the real goal here is to see what is in all of my remotes.

In practice, I only have this done for my main day-job project, so the update script is specific to that project. It could be expanded to all my git repos, but I haven't done that. To pull this off, I have this line in my crontab:

*/10 * * * * python2 /home/mythmon/src/kitsune/scripts/

I'll get to the details of this script in the next section, but the important part is that it runs git fetch --all for the repo on question. To run this from a cronjob, I had to switch all my remotes to using https protocol for git instead of ssh, since my SSH keys aren't unlocked. Git knows the passwords to my http remotes thanks to it's gnome-keychain integration, so this all works without user interaction.

This has the result of keeping git up to date on what refs exist in the world. I have my teammate's repos as remotes, as well as our central master. This makes it easier for me to see what is going on in the world.

Deployment Refs

The last bit of information I wanted to see in my local network is the state of deployment on our servers. We have three environments that run our code, and knowing what I'm about to deploy is really useful. If you look in the screenshot above, you'll notice a couple refs that are likely unfamiliar: deployed/state and deployed/prod, in green. This is the second part of the script I mentioned above.

As a part of the SUMO deploy process, we put a file on the server that contains the current git sha. This script read that file, and makes local references in my git repo that correspond to them

Wait, creates git refs from thin air? Yeah. This is a cool trick my friend Jordan Evans taught me about git. Since git's references are just files on the file system, you can make new ones easily. For example, in any git repo, the file .git/refs/heads/master contains a commit sha, which is how git knows where your master branch is. You could make new refs by editing these files manually, creating files and overwriting them to manipulate git. That's a little messy though. Instead we should use git's tools to do this.

Git provides git update-ref to manipulate refs. For example, to make my deployment refs, I run something like git update-ref refs/heas/deployed/prod 895e1e5ae. The last argument can be any sort of commit reference, including HEAD or branch names. If the ref doesn't exist, it will be created, and if you want to delete a ref, you can add -d. Cool stuff.

All Together Now

Now finally the entire script. Here I am using an git helper that I wrote that I have ommited for space. It works how you would expect, translating git.log(all=True, 'some-branch' to git log --all some-branch. I made a gist of it for the curious.

The basic strategy is to get fetch all remotes, then add/update the refs for the various server environments using git update-rev. This is run on a cron every few minutes, and makes knowing what is going on a little easier, and git in a distributed team a little nicer.

#!/usr/bin/env python

import os
import re
import subprocess

import requests

repo_dir = "{HOME}/src/kitsune".format(**os.environ)
environments = {
    'dev': '',
    'stage': '',
    'prod': '',

def main():
    cdpath = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')

    git = Git()

    for env_name, revision_url in environments.items():
            cur_rev = git.rev_parse('deployed/' + env_name).strip()
        except subprocess.CalledProcessError:
            cur_rev = None
        new_rev = requests.get(revision_url).text.strip()

        if cur_rev != new_rev:
            print 'updating ' + env_name, cur_rev[:8], new_rev[:8]
            git.update_ref('refs/heads/deployed/' + env_name, new_rev)

if __name__ == '__main__':

That's It

The general idea is really easy:

  1. Fetch remotes often.
  2. Write down deployment shas.
  3. Actually look at it all.

The fact that it requires a little bit of cleverness, and a bit of git magic along the way means it took some time figure out. I think it was well worth it though.

Localized search on SUMO

Posted on 2013-08-07 in
  • projects
  • ,
  • sumo
  • ,
  • mozilla
  • , and
  • elasticsearch

My primary project at work is SUMO, the Firefox support site. It consists of a few parts, including a wiki, a question/answer forum, and a customized Twitter client for helping people with Firefox. It is also a highly localized site, with support in the code for over 80 languages. We don't have a community in all of those languages, but should one emerge, we are ready to embrace it. In other words, we take localization seriously.

Until recently, however, this embrace of multilingual coding didn't extend to our search engine. Our search engine (based on ElasticSearch) assumed that all of the wiki documents, question and answers, and forum posts were in English, and applied English based tricks to improve search. No more! On Monday, I flipped the switch to enable locale-specific analyzer and query support in the search engine, and now many languages have improved search. Here, I will explain just what happened, and how we did it.


Currently, we use two major tricks to improve search: stemming and stop words. These help the search engine behave in a way that is more consistent with how we understand language, generally.


Stemming is recognizing that words like "jump", "jumping", "jumped", and "jumper" are all related. They all stem from the common word "jump". In our search engine, this is done by enabling the ElasticSearch Snowball analyzer, which uses the Porter stemming algorithmm.

Unfortunately, Porter is English specific, because it stems algorithmically based on patterns in English, such as removing trailing "ing", "ed" or "er". The algorithm is much more complicated, but the point is, it really only works for English.

Stop Words

Stop words are words like "a", "the", "I", or "we" that generally carry little information in regards to a search engine's behavior. ES includes a list of these words, and removes them intelligently from search queries.


ES is actually a very powerful system that can be used for many different kinds of search tasks (as well as other data slicing and dicing). One of the more interesting features that make it more than just full text search are it's analyzers. There are many built in analyzers, and there are ways to recombine analyzers and parts of analyzers to build custom behavior. If you really need something special, you could even write a plugin to add a new behavior, but that requires writing Java, so lets not go there.

The goal of Analysis is to take a stream of characters and create a stream of tokens out of them. Stemming and stop words are things that can play into this process. These modifications to analysis actually change the document that gets inserted into the ES index, so we will have to take that into account later. If we insert a document contains "the dog jumped" into the index, it would get indexed as something like

    {"token": "dog", "start": 4, "end": 7},
    {"token": "jump", "start": 8, "end": 14}

This isn't really what ES would return, but it is close enough. Note how the tokens inserted are post-analysis version, that include the changes made by the stop words and stemming token filters. That means the analysis process is languages specific, so we need to change the analyzer depending on the language. Easy, right? Actually, yes. This consists of a few parts.

Choosing a language

SUMO is a Django app, so in, we define a map of languages to ES analyzers, like this (except with a lot more languages):

    'en-US': 'snowball',
    'es': 'snowball-spanish',

Note: snowball-spanish is simply the normal Snowball analyzer with an option of {"language": "Spanish"}.

Then we use this helper function to pick the right language based on a locale, with a fallback. This also takes into account the possibility that some ES analyzers are located in plugins which may not be available.

def es_analyzer_for_locale(locale, fallback="standard"):
  """Pick an appropriate analyzer for a given locale.

  If no analyzer is defined for `locale`, return fallback instead,
  which defaults to ES analyzer named "standard".
  analyzer = settings.ES_LOCALE_ANALYZERS.get(locale, fallback)

  if (not settings.ES_USE_PLUGINS and
          analyzer in settings.ES_PLUGIN_ANALYZERS):
      analyzer = fallback

  return analyzer


Next, the mapping needs to be modified. Prior to this change, we explicitly listed the analyzer for all analyzed fields, such as the document content or document title. Now, we leave off the analyzer, which causes it to use the default analyzer.

Finally, we can set the default analyzer on a per document basis, by setting the _analyzer field when indexing it into ES. This ends up looking something like this (this isn't the real code, because the real code is much longer for uninteresting reasons):

def extract_document(obj):
    return {
        'document_title': obj.title,
        'document_content': obj.content,
        'locale': obj.locale,
        '_analyzer': es_analyzer_for_locale(obj.locale),


This is all well and good, but what use is an index of documents if you can't query it correctly? Lets consider an example. If there is a wiki document with a title "Deleting Cookies", and a user searches for "how to delete cookies", here is what happens:

First, the document would have been indexed and analyzed, producing this:

    {"token": "delet", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}

So now, if we try and query "how to delete cookies", nothing will match! That is because we need to analyze the search query as well (ES does this by default). analyzing the search query results in:

    {"token": "how", "start": 0, "end": 3},
    {"token": "delet", "start": 7, "end": 13},
    {"token": "cooki", "start": 14, "end": 21}

Excellent! This will match the document's title pretty well. Remember the ElasticSearch doesn't enforce that 100% of the query matches. It simply finds the best one available, which can be confusing in edge cases, but in the normal case it works out quite well.

There is an issue though. Let's try this example in Spanish. Here is the document title "Borrando Cookies", as analyzed by our analysis process from above.

    {"token": "borr", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}

and the search "como borrar las cookies":

    {"token": "como", "start": 0, "end": 4},
    {"token": "borrar", "start": 6, "end": 11},
    {"token": "las", "start": 12, "end": 15},
    {"token": "cooki", "start": 16 "end": 23}

... Not so good. In particular, 'borrar', which is another verb form of 'Borrando' in the title, got analyzed as English, and so didn't get stemmed correctly. It won't match the token borr that was generated in the analysis of the document. So clearly, searches need to be analyzed in the same way as documents.

Luckily in SUMO we know what language the user (probably) wants, because the interface language will match. So if the user has a Spanish interface, we assume that the search is written in Spanish.

The original query that we use to do searches looks something like this much abbreviated sample:

    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies"

The new query includes an analyzer field on the text match:

    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies",
                "analyzer": "snowball-spanish"

This will result in the correct analyzer being used at search time.


This took me about three weeks off and on to develop, plus some chats with ES developers on the subject. Most of that time was spent researching and thinking about what the best way to do localized search was. Alternatives include having lots of fields, like document_title_es, document_title_de, etc, which seems icky to me, or using multifields to achieve a similar result. Another proposed example idea was to use different ES indexes for each language. Ultimately I decided in the approach outlined above.

For the implementation, modifying the indexing method to insert the right data into ES was the easy part, and I knocked it out in an afternoon. The difficult part was modifying our search code, working with the library we use to interact with ES to get it to support search analyzers, testing everything, and debugging the code that broke when this change was made. Overall, I think that the task was easier than we had expected when we wrote it down in our quarterly goals, and I think it went well.

For more nitty-gritty details, you can check out the two commits to the mozilla/kitsune repo that I made these changes in: 1212c97 and 0040e6b.

The Crimson Twins

Posted on 2013-06-13 in
  • projects
  • ,
  • crimson twins
  • , and
  • mozilla

Crimson Twins is a project that I started at Mozilla to power two ambient displays that we mounted on the wall in the Mountain View Mozilla office. We use it to show dashboards, such as the current status of the sprint or the state of GitHub. We also use it to share useful sites, such as posting a YouTube video that is related to the current conversation. Most of the time that I pay attention to the Twins, is when my coworkers and I post amusing animated GIFs to distract amuse the rest of the office.

Technical Details

Crimson Twins is a Node.JS app that is currently running on Mozilla's internal PaaS, an instance of Stackato. It uses for real time two way communication between the server and the client. uses WebSockets, long polling, Flash sockets, or a myriad of other techniques.

The architecture of the system is that there are a small number of virtual screens, which represent targets to send content to. Each client connected to the server can choose one or more of these virtual screens to display. A client can be any modern web browser, and I have used Firefox for Desktop and Android, Safari on iOS, and Chrome on desktop without any trouble.

Because of this setup, remote Mozillians can connect to the server and load up the same things that are shown on the TVs in the office. Put another way, Crimson Twins is remotie friendly, and people can play along at home.

Content Handling

Content is displayed with one of two mechanisms. For images, the content is loaded as a background of a div. Originally img tags were used, but it was difficult to style them this way. The switch to divs made it much easier to zoom the image to full screen without using JavaScript.

For content that is not images, a sandboxed iframe is used. This allows most sites to be shown with Crimson Twins, and the sandboxing prevents malicious sites from hijacking the Crimson Twins container1. This means that sites that disallow framing cannot be used with the system, but after much brainstorming we have yet to find a satisfactory way to get around this. Luckily most sites don't worry about iframes, so this isn't normally a huge annoyance.

For every URL that is sent to the screens, the sever first makes a HEAD request to the requested resource. A few things happen with this information. First, it is used to determine if the URL is an image or not, by examining content type. Second, it examines the headers to find things, such as X-Frames-Options: deny, server errors, or malformed URLs; it provides useful error messages if something like this happens.

Additionally, the requested URLs can go through various transformations. For example, if a link to an Imgur page is posted, the server will transform the URL into the URL for the image on the page. A link to an XKCD comic page will query the XKCD API for the URL of the image for that comic. This mechanism also allows for black listing of various content.

What's with the name?

The name of the project is a little silly, and is (I'm told) a reference to the old G.I. Joe cartoons. One of the enemies in the show were the twins Tomax and Xamot, collectively known as the Crimson Twins. It was proposed humorously, but I decided to keep it, and now I rarely think about the cartoon series anymore.

It has enough related names that the related projects that have sprung up have been easy to name, such as the Crimson Guard Commanders, the IRC name for one of the bots that interfaces with the API; and Extensive Enterprises, a web based camera-to-Imgur-to-IRC-to-CrimsonTwins roundabout way to post photos to the screens.

The Future

Crimson Twins has been proposed to be used as ambient displays in the public areas of various Mozilla offices, and as a general purpose manager for driving screens. To this end it is probably going to grow features such as remote control of clients, a more powerful API, and features to make it easy to manage remotely.

CrimsonTwins is open source, and can be found on GitHub. Pull requests are welcome, and if you want to chat about it, you can find me as mythmon in

bots on

  1. Due to bug 785310, Firefox allows sandboxed iframes with scripts enabled to directly access the parent document, which is a violation of the spec. Hopefully this bug will be fixed in the near future. 

Malicious Git Server

Posted on 2013-06-03 in
  • coding
  • ,
  • git
  • ,
  • tools
  • ,
  • security
  • , and
  • deployment

Git is hard

Some time ago, I found myself in a debate on irc regarding the security of git. On one side, my opponent argued that you could not trust git to reliably give you a particular version of your code in light of a malicious remote or a man in the middle attack, even if you checked it out by a particular hash. I argued that because of the nature of how git stores revisions, even a change in the history of the repository would require breaking the security of the SHA1 hashes git uses, an unlikely event. We eventually came to agreement that if you get code via revision hashes and not via branches or unsigned tags, you are not vulnerable to the kind of attach he was proposing.

This got me thinking about the security of git. About how it stores objects and builds a working directory. What if the contents of one of the object files changed? Git makes these files read only on the file system to prevent this kind of problem, but that is a weak protection. If the other end of your git clone is malicious, how much damage could they do? If there really is a security problem here, it means that a lot of deployment tools that rely on git telling the truth are vulnerable.

The malicious git

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in two files good.txt and evil.txt. I put the words "good" and "evil" in them, respectively. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in a files good.txt, and put word "good" in it. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

My first thought was to modify the object file that represented the tree, to try and replace the file with another one. Unfortunately, git's objects files aren't packed in a human readable way, so this didn't work out. After some more thought, I decided I could just modify the object file representing good.txt directly. Surely those are stored in a human readable way.

Nope. File blobs are equally unreadable. It seems the only thing within my reach that could deal with them was git itself. Hmm. Do file blobs only depend on the file's content? I checked in another throw away repository. I made another good.txt with the same contents, and commited it. The hash was the same. This was what I need to test out my theory of malice! So I made a second file, evil.txt, and checked it in to the throwaway repository.

I took the contents of the object file for evil.txt and replaced the object file for good.txt with them. The original repository still was unaware of my treachery: git status said all was well. Mischief managed!

What does the git think?

Next I cloned the modified repository. Alarmingly, no red flags were raised, and the exit code was 0. Opening good.txt revealed that the treachery had worked. It contained the text "evil", just like evil.txt. Uh oh. Surely git knows something is wrong right? I ran git status in the cloned repository. modified: good.txt. Well, that is a start. But the return code was still 0. That means that git status can't help in our deploy scripts. Out of curiosity, I ran git diff, to see what git things was modified in the file. Nothing. Which makes sense. Git knows something is up because the hash of good.txt doesn't match it's object id, but the contents match up, so it can't tell any more.

This is worrisome. To protect against malicious server or MITM attacks, there needs to be an automated way to detect this treachery. I looked around in git help. Nothing obvious. I delved deeper. I wondered if git gc would notice something? Nope. Status and Clone are already out. Repack? No dice. I started getting very worried by this point.

The solution

Then I found the command I needed: git fsck. It does just what you would expect it to. The name comes from the system utility by the same name, and it originally stood for "File System Check". After finding this command, I had hope. I ran it. It didn't light up in big flashing lights, but reading it's output revealed "error: sha1 mismatch 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7". More importantly, the exit code of the command was 5. I don't know what 5 means, but I do know it isn't 0, so it is an error. Yes!

So the solution is to always check git fsck after cloning if you really must know that your code is what you intended to run. If you do not, you run the risk of getting code that could be entirely different from what you thought.

A small comfort

Someone pointed that the various remote protocols git used would probably be a little pickier about what it got, in case of network transmission errors. Luckily, this was true. I tested http, git, and ssh protocols, and each of them raised an error on clone:

Cloning into './c'...
error: File 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7 has bad hash
fatal: missing blob object '12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7'
fatal: remote did not send all necessary objects
Unexpected end of command stream

The particular output varied a little with each protocol, but the result was the same. An error in the output, return code 128, and no repository cloned. This is good.

I feel that this is something that was improved recently, because when I originally did this experiment I remember the remote protocols printed an error, but did not have a non-zero exit code and still created the repository. Unfortunately I did not document this, so I'm not sure. Yay for continuous improvement and poor memories, I guess.


If you are cloning from an untrusted git server, and especially if you are cloning from an untrusted repository via the file protocol, run git fsck afterwards and check the error code, to make everything is at it should be.

Sublime, urxvt, and nose-progressive

Posted on 2013-04-24 in
  • coding
  • ,
  • dotfiles
  • ,
  • sublime
  • ,
  • urxvt
  • ,
  • nose-progressive
  • , and
  • python

For many of my projects I use the excellent nose-progressive for running tests. Among other features, it prints out lines that are intended to be helpful to jump straight to the code that caused the error. This works well for some workflows, but not mine. Here is an example of nose-progressive's output:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.
Generating sample data from wiki...

  vim +44 apps/wiki/tests/  # test_document_is_template
    assert 0

1438 tests, 1 failure, 0 errors, 5 skips in 115.0s

In particular, note the line that begins vim +44 apps/wiki.... It is indicating the file and line number where the error occurred, and if I were to copy that line and execute it in my shell, it would launch vim with the right file and location. Not bad! It chose vim because that is what I have $EDITOR set to.

Unfortunately, even though my $EDITOR is set to vim, I use Sublime in my day to day editing tasks. I like to keep $EDITOR set to vim, because it tends to be used in places where I don't want to escalate to Sublime, but in this case I really do want the GUI editor.. This feature of nose-progressive doesn't help me much.

So how can I get nose-progressive to be helpful? In the recent 1.5 release of nose-progressive, a feature to customize this line was added. Promising. Additionally, I use urxvt as my terminal, and with some configuring, it can open links when they are clicked on. A plan is beginning to form.

Configuring nose-progressive

First, I made nose-progressive output a line that will indicate that Sublime should be used to open the file, not vim. A quick trip to the documentation taught me that I can set the environment variable $NOSE_PROGRESSIVE_EDITOR_SHORTCUT_TEMPLATE to a template string to


Quite a mouthful, but it gets the job done. This format string will print something visually resembling the old line, but with a custom format. In action, it looks like this:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.

  subl:///home/mythmon/src/kitsune/apps/wiki/tests/  # test_document_is_template
    assert 0

1 test, 1 failure, 0 errors in 0.0s

Awesome! Now to get the terminal to respond.

Configuring urxvt

I use a package called urxvt-perls to add features like clickable links to my terminal. I tweaked it's config to look like this to make it recognize my custom Sublime links from above, as well as normal web links. This is the relevant snippet of my ~/.Xdefaults file:

URxvt.perl-ext-common: default,url-select
URxvt.url-select.launcher: urxvt_launcher
URxvt.url-select.underline: true
URxvt.url-select.button: 3
URxvt.matcher.pattern.1: \\b(subl://[^ ]+)\\b

Line by line: - url-select add-on is loaded. - Set the launcher script to urxvt_launcher. More on this in a second. - Underline links when they are detected. - Use the right mouse button to open links. - Add an additional pattern to search for to make clickable.

Now when normal web links (like are found, or my custom subl:// links are clicked, urxvt_launcher will be executed with the underlined text as $1.

The launcher

Bash is not my native language, but it seemed the appropriate tool for this job. I hacked together this script:


if [[ $1 == 'subl://'* ]]; then
    path=$(echo $1 | sed -e 's|^subl://([^ :]+)(:(\d+))?|\1 :\2|')
    exec subl $path
    exec browser $1

This seems to do the trick. If the "url" starts with the string subl://, then it extracts a url and a line number from the argument, and then execs sublime with that information. Otherwise, it runs another script, browser, which is simply a symlink for whatever browser I'm using at the moment.

All of this combined together make nice, clickable links to exactly what line of code is breaking my tests. Time will tell if this is useful but if nothing else, it is quite neat.