mythmon

Github Pages, Travis, and Static Sites

Posted on 2014-09-01 in
  • github
  • ,
  • travis
  • ,
  • static
  • , and
  • meta
.

I recently switched my blog to being hosted on GitHub Pages instead of hosting the static site myself. Along with this change, I was able to automate the rendering and updating of the site, thanks to GitHub webhooks, and Travis-CI. As always, I'm using wok for the rendering and management of the site.

My work flow now looks like this:

  1. Write a post, edit something, change a template, etc.
  2. git commit.
  3. git push.
  4. Wait for the robots to do my bidding.

It is ideal.

Prerequisites

For this to work, there are a few things that are needed. First, and most fundamental, the site needs to be fully static. A pile of HTML, CSS, JS, images, etc. Nothing server side at all. Otherwise it can't be hosted on GitHub Pages.

Next, the site has to be stored on GitHub, and it can't be the account's main GHPage repository. For example, I cannot use the repository mythmon/mythmon.github.io. This is because GHPages treats the branches in that repository differently. It should be possible to set up this this work flow on a repository like this, but I won't go into it here.

The master branch will be where the source of the site is, the parts a human edits. The gh-pages branch will hold the rendered output, and be generated automatically.

Finally, the site needs to be easy to render on Travis. This usually means that all the tools are easy to install with pip or npm or another package manager, and the process of rendering the output from a checkout of the site can be scripted. Any wok sites should fit these requirements.

Part 1 - Automation.

Before I can ask the robots to do my bidding, I have to automate the process they are going to be doing. Two commands are needed, one to build the site, and one to commit the new version and push it to GitHub.

My site uses wok, which is a Python library. Because of this, I wanted a Python task runner to automate the build process. It may have been overkill, but I used Invoke. Here is my invoke script, task.py, with explanation interspersed.

If you're unfamiliar with Invoke, it is gives a nice way to define tasks, run shell commands, and run Python code. Make, shell scripts, Gulp, or any other task runner would work just as well.

Here's the code. First, some imports.

import os
from contextlib import contextmanager
from datetime import datetime

from invoke import task, run

These two constants are used to clone and push the repository. GH_REF is the repository's remote URL, without any protocol, and GH_TOKEN will be a GitHub authorization token from the environment. More on this in a bit.

GH_REF = 'github.com/mythmon/mythmon.com.git'
GH_TOKEN = os.environ.get('GH_TOKEN')

This is a simple context manager that lets me change into a directory, run some commands, then safely change out of it. I'm likely reinventing the wheel here.

@contextmanager
def cd(newdir):
    print 'Entering {}/'.format(newdir)
    prevdir = os.getcwd()
    os.chdir(newdir)
    try:
        yield
    finally:
        print 'Leaving {}/'.format(newdir)
        os.chdir(prevdir)

Here is the first of the three tasks defined. This one makes sure that the output directory is in the right state. It should work even if the directory already exists, if it wasn't a git repo, if it has stray file lying around, or even if it is on the wrong commit or branch. This is generally useful outside of Travis as well.

def make_output():
    if os.path.isdir('output/.git'):
        with cd('output'):
            run('git reset --hard')
            run('git clean -fxd')
            run('git checkout gh-pages')
            run('git fetch origin')
            run('git reset --hard origin/gh-pages')
    else:
        run('rm -rf output')
        run('git clone https://{} output'.format(GH_REF))

This next task simply renders the site. It sets up the output directory by calling the above task, and then calls then trigger wok, the site renderer. Nice and simple

@task(default=True)
def build():
    make_output()
    run('wok')

This last task is the bit that actually publishes to GitHub in a safe, secure, and automated way.

@task
def publish():
    if not GH_TOKEN:
        raise Exception("Probably can't push because GH_TOKEN is blank.")
    build()

    with cd('output'):

The first thing it does is configure a user for git. GitHub won't accept pushes without user information, so I put some fake information here.

        run('git config user.email "travis@mythmon.com"')
        run('git config user.name "Travis Build"')

Next it git adds ll the files in the output directory. The --all fag will deal with new files being added, old files being changed, and old files being deleted. It won't commit anything in the .gitignore, if you have one.

        run('git add --all .')

Now, make the commit. I thought for a while about what to put in the git commit message. At first I was going to put a timestamp, but I realized that git will do that for me already. Future improvements might note what commits this version of the site was built from.

        run('git commit -am "Travis Build"')

Finally, the script needs to push the resulting commit up to the gh-pages branch of GitHub, so it will be served. The first problem I faced was how to authenticate with GitHub to do this. The second was how to do that without revealing any secrets.

The solution to the first problem was GitHub token auth. By using the HTTPS protocol, and putting the token in the authentication section of the URL, I can push to any GitHub repo that token has access to.

The problem with this is that git prints out the remote when you push. Since my token is in the URL, which is the remote name in this case, it was printing secrets out in Travis logs! The solution is to hide git's output. It seems obvious in retrospect, but I revealed two tokens this way in Travis logs (they were immediately revoked, of course).

        # Hide output, since there will be secrets.
        cmd = 'git push https://{}@{} gh-pages:gh-pages'
        run(cmd.format(GH_TOKEN, GH_REF), hide='both')

To run these tasks, I use invoke build and invoke publish, to build and publish the site, respectively.

Part 2 - Travis

As you can tell, the bulk of the work is in the automation. A lot of thought went into the 60 or so lines of code above. Now that it is automated, it is easy to make the robots do the rest. I chose Travis for my automation.

I went to the Travis site, set up the repository for the site, and tweaked a few settings. In particular, I turned off the "Build pushes" option, because it isn't useful to me. There isn't any risk of revealing secrets in PRs, because Travis doesn't decrypt the secrets in PR builds. The other setting I tweaked is to turn on "Buildon if .travis.yml is present". Since I was doing all this work on a branch, I didn't want my master branch to be making builds happen, and I think this is a generally good setting to set on Travis.

So that Travis knows what tools it needs, I added a requirements.txt file to my repository, which Travis understands how to use, if you set the language to Python. Then I added a .travis.yml to tell Travis how to build my site.

I wrapped the secure token here. And yes, it is fine. It is encrypted, as I'll explain below.

language: python
python:
- '2.7'
script:
- invoke build
after_success:
- invoke publish
env:
  global:
    secure: LXYt0XENsCV58GD2g2jB27Hil9O80DXdnyM6palKLNcYa7z/hqvqtkCwW9Wmj5jqLXj
            UjiTAk0BUqinvL6ZPrqGiluWQ5hY2e9YNG/eYRd1Qv1TdaDu2+iCfIK8VDehGZl9G8L
            y09RL6gfHWxofnSamMztcFWqDbh/2iDp3GmUU=

Basic simple stuff. Since the The script command (invoke build) builds the site, using the big script above. Similarly, the after success command invoke publish uploads the site to GitHub (but only if the site actually builds).

Woah, hold on. What's all this junk at the end? That "junk" is the magic to safely pass secrets to Travis builds in a public repository. It is a line that looks like GH_TOKEN=abcdef1234567890, encrypted using a public key for which Travis holds the private key. In "safe" builds (builds on my repo that are not from PRs) Travis will decrypt that token and provide it to the build. The invoke script then picks up the environment variable and uses it when it pushes to GitHub. Pretty slick.

To generate this encrypted line, I used the [Travis CLI tool][traviscli] like this:

$ travis encrypt --add
Reading from stdin, press Ctrl+D when done
GH_TOKEN=abcdef123457890
^D

That is, I ran the command, typed the name of the environement variable, followed by an equals sign, and then the value, then I pressed enter, and then Ctrl+D. This is a normal interactive read from stdin. After doing that, my .travis.yml file contained the encrypted string, and I was ready to commit it.

I got the value for that environment variable from GitHub's personal API token generator.

Part 3 - DNS

I could be done now. At this point, when I push a new version of my site to GitHub, it fires a webhook, Travis builds my site, pushes it back to GitHub, and then GitHub serves it with GitHub Pages.

This didn't work for me for a couple reasons. First, I like my URL, and didn't really want to change. Second, my site assumes it as at the root of the server, and can't deal with GitHub's insistence of putting this site at mythmon.github.io/mythmon.com. The content of this site is there, but it's all unstyled, because of broken links to the CSS, and none of the links work. Maybe someday I'll fix this.

So I have to do some DNS tricks and tell GH pages to expect another domain name for my site.

Telling GitHub.

So that GitHub knows what site to serve when someone visits www.mythmon.com, I had to add a CNAME file to the gh-pages branch of the site. Luckily with wok that was pretty easy. I made the file media/CNAME which wok put in the root of the gh-pages branch and gave it the contents www.mythmon.com. It takes some minutes for GitHub to recognize this change, but after that it works well.

Setting up DNS

You may have noticed that I say www.mythmon.comthere, instead of the nice cleanmythmon.com`. I would prefer the latter, but it isn't to be with GHPages.

The recommended way to use custom DNS with GHPages is to make whatever domain name that should serve the site use a CNAME to username.github.io. So for me, I have www.mythmon.com IN CNAME mythmon.github.io in a Bind config. The problem is, according to RFC 1912, "A CNAME record is not allowed to coexist with any other data." Since the root of a domain has to have some other records (NS, SOA, possible MX or TXT), you can't have a CNAME to GitHub at the root. :(.

Redirects

The problem with this is that I have used http://mythmon.com to reference my blog in the past, and cool URIs don't change. So I needed to find a way to make the old adadress work.

First I tried making mythmon.com have an A record to the IPs of the GHPages servers. This isn't recommended, but it does work if that is the primary DNS name of the site. However, since in the CNAME file above, I wrote down www.mythmon.com (with the recommended DNS CNAME setup), this didn't work. It gave "No such domain" errors. Bummer.

The solution I ended up going with is less nice. I pointed the root record at the server I used to host the site on, which is still running nginx. I put this in my Nginx config:

server {
    listen       80;
    server_name  mythmon.com;
    return 301 $scheme://www.mythmon.com$request_uri;
}

This causes Nginx to serve permanent redirects to the correct url, preserving any path information. Not the best experience, but it works.

That's it.

Now the site works. It gets served from a fast CDN, I don't have to worry about re-rendering the site, and I get to make blog posts with git. The robots do the tedious work for me. It is ideal.

If you have any comments or questions, I'm @mythmon on Twitter.


Tracking Deploys in Git Log.

Posted on 2013-11-18 in
  • tools
  • ,
  • sumo
  • , and
  • git
.

Knowing what is going on with git and many environments can be hard. In particular, it can be hard to easily know where the server environments are on the git history, and how the rest of the world relates to that. I've set up a couple interlocking gears of tooling that help me know whats going on.

Network

One thing that I love about GitHub is it's network view, which gives a nice high level overview of branches and forks in a project. One thing I don't like about it is that it only shows what is on GitHub, and is a bit light on details. So I did some hunting, and I found a set of git commands that does a pretty good at replicating GitHub's network view.

$ git log --graph --all --decorate

I have this aliased to git net. Let's break it down:

  • git log - This shows the history of commits.
  • --graph - This adds lines between commits showing merging, branching, and all the rest of the non-linearity git allows in history.
  • --all - This shows all refs in your repo, instead of only your current branch.
  • --decorate - This shows the name of each ref net to each commit, like "origin/master" or "upstream/master".

This isn't that novel, but it is really nice. I often get asked what tool I'm using for this when I pull this up where other people can see it.

Cron Jobs

Having all the extra detail in my view of git's history is nice, but it doesn't help if I can only see what is on my laptop. I generally know what I've commited (on a good day), so the real goal here is to see what is in all of my remotes.

In practice, I only have this done for my main day-job project, so the update script is specific to that project. It could be expanded to all my git repos, but I haven't done that. To pull this off, I have this line in my crontab:

*/10 * * * * python2 /home/mythmon/src/kitsune/scripts/update-git.py

I'll get to the details of this script in the next section, but the important part is that it runs git fetch --all for the repo on question. To run this from a cronjob, I had to switch all my remotes to using https protocol for git instead of ssh, since my SSH keys aren't unlocked. Git knows the passwords to my http remotes thanks to it's gnome-keychain integration, so this all works without user interaction.

This has the result of keeping git up to date on what refs exist in the world. I have my teammate's repos as remotes, as well as our central master. This makes it easier for me to see what is going on in the world.

Deployment Refs

The last bit of information I wanted to see in my local network is the state of deployment on our servers. We have three environments that run our code, and knowing what I'm about to deploy is really useful. If you look in the screenshot above, you'll notice a couple refs that are likely unfamiliar: deployed/state and deployed/prod, in green. This is the second part of the update-git.py script I mentioned above.

As a part of the SUMO deploy process, we put a file on the server that contains the current git sha. This script read that file, and makes local references in my git repo that correspond to them

Wait, creates git refs from thin air? Yeah. This is a cool trick my friend Jordan Evans taught me about git. Since git's references are just files on the file system, you can make new ones easily. For example, in any git repo, the file .git/refs/heads/master contains a commit sha, which is how git knows where your master branch is. You could make new refs by editing these files manually, creating files and overwriting them to manipulate git. That's a little messy though. Instead we should use git's tools to do this.

Git provides git update-ref to manipulate refs. For example, to make my deployment refs, I run something like git update-ref refs/heas/deployed/prod 895e1e5ae. The last argument can be any sort of commit reference, including HEAD or branch names. If the ref doesn't exist, it will be created, and if you want to delete a ref, you can add -d. Cool stuff.

All Together Now

Now finally the entire script. Here I am using an git helper that I wrote that I have ommited for space. It works how you would expect, translating git.log(all=True, 'some-branch' to git log --all some-branch. I made a gist of it for the curious.

The basic strategy is to get fetch all remotes, then add/update the refs for the various server environments using git update-rev. This is run on a cron every few minutes, and makes knowing what is going on a little easier, and git in a distributed team a little nicer.

#!/usr/bin/env python

import os
import re
import subprocess

import requests

repo_dir = "{HOME}/src/kitsune".format(**os.environ)
environments = {
    'dev': 'http://support-dev.allizom.org/media/revision.txt',
    'stage': 'http://support.allizom.org/media/revision.txt',
    'prod': 'http://support.mozilla.org/media/revision.txt',
}

def main():
    cdpath = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')
    os.chdir(cdpath)

    git = Git()

    print(git.fetch(all=True))
    for env_name, revision_url in environments.items():
        try:
            cur_rev = git.rev_parse('deployed/' + env_name).strip()
        except subprocess.CalledProcessError:
            cur_rev = None
        new_rev = requests.get(revision_url).text.strip()

        if cur_rev != new_rev:
            print 'updating ' + env_name, cur_rev[:8], new_rev[:8]
            git.update_ref('refs/heads/deployed/' + env_name, new_rev)

if __name__ == '__main__':
    main()

That's It

The general idea is really easy:

  1. Fetch remotes often.
  2. Write down deployment shas.
  3. Actually look at it all.

The fact that it requires a little bit of cleverness, and a bit of git magic along the way means it took some time figure out. I think it was well worth it though.


Localized search on SUMO

Posted on 2013-08-07 in
  • projects
  • ,
  • sumo
  • ,
  • mozilla
  • , and
  • elasticsearch
.

My primary project at work is SUMO, the Firefox support site. It consists of a few parts, including a wiki, a question/answer forum, and a customized Twitter client for helping people with Firefox. It is also a highly localized site, with support in the code for over 80 languages. We don't have a community in all of those languages, but should one emerge, we are ready to embrace it. In other words, we take localization seriously.

Until recently, however, this embrace of multilingual coding didn't extend to our search engine. Our search engine (based on ElasticSearch) assumed that all of the wiki documents, question and answers, and forum posts were in English, and applied English based tricks to improve search. No more! On Monday, I flipped the switch to enable locale-specific analyzer and query support in the search engine, and now many languages have improved search. Here, I will explain just what happened, and how we did it.

Background

Currently, we use two major tricks to improve search: stemming and stop words. These help the search engine behave in a way that is more consistent with how we understand language, generally.

Stemming

Stemming is recognizing that words like "jump", "jumping", "jumped", and "jumper" are all related. They all stem from the common word "jump". In our search engine, this is done by enabling the ElasticSearch Snowball analyzer, which uses the Porter stemming algorithmm.

Unfortunately, Porter is English specific, because it stems algorithmically based on patterns in English, such as removing trailing "ing", "ed" or "er". The algorithm is much more complicated, but the point is, it really only works for English.

Stop Words

Stop words are words like "a", "the", "I", or "we" that generally carry little information in regards to a search engine's behavior. ES includes a list of these words, and removes them intelligently from search queries.

Analysis

ES is actually a very powerful system that can be used for many different kinds of search tasks (as well as other data slicing and dicing). One of the more interesting features that make it more than just full text search are it's analyzers. There are many built in analyzers, and there are ways to recombine analyzers and parts of analyzers to build custom behavior. If you really need something special, you could even write a plugin to add a new behavior, but that requires writing Java, so lets not go there.

The goal of Analysis is to take a stream of characters and create a stream of tokens out of them. Stemming and stop words are things that can play into this process. These modifications to analysis actually change the document that gets inserted into the ES index, so we will have to take that into account later. If we insert a document contains "the dog jumped" into the index, it would get indexed as something like

[
    {"token": "dog", "start": 4, "end": 7},
    {"token": "jump", "start": 8, "end": 14}
]

This isn't really what ES would return, but it is close enough. Note how the tokens inserted are post-analysis version, that include the changes made by the stop words and stemming token filters. That means the analysis process is languages specific, so we need to change the analyzer depending on the language. Easy, right? Actually, yes. This consists of a few parts.

Choosing a language

SUMO is a Django app, so in settings.py, we define a map of languages to ES analyzers, like this (except with a lot more languages):

ES_LOCALE_ANALYZERS = {
    'en-US': 'snowball',
    'es': 'snowball-spanish',
}

Note: snowball-spanish is simply the normal Snowball analyzer with an option of {"language": "Spanish"}.

Then we use this helper function to pick the right language based on a locale, with a fallback. This also takes into account the possibility that some ES analyzers are located in plugins which may not be available.

def es_analyzer_for_locale(locale, fallback="standard"):
  """Pick an appropriate analyzer for a given locale.

  If no analyzer is defined for `locale`, return fallback instead,
  which defaults to ES analyzer named "standard".
  """
  analyzer = settings.ES_LOCALE_ANALYZERS.get(locale, fallback)

  if (not settings.ES_USE_PLUGINS and
          analyzer in settings.ES_PLUGIN_ANALYZERS):
      analyzer = fallback

  return analyzer

Indexing

Next, the mapping needs to be modified. Prior to this change, we explicitly listed the analyzer for all analyzed fields, such as the document content or document title. Now, we leave off the analyzer, which causes it to use the default analyzer.

Finally, we can set the default analyzer on a per document basis, by setting the _analyzer field when indexing it into ES. This ends up looking something like this (this isn't the real code, because the real code is much longer for uninteresting reasons):

def extract_document(obj):
    return {
        'document_title': obj.title,
        'document_content': obj.content,
        'locale': obj.locale,
        '_analyzer': es_analyzer_for_locale(obj.locale),
    }

Searching

This is all well and good, but what use is an index of documents if you can't query it correctly? Lets consider an example. If there is a wiki document with a title "Deleting Cookies", and a user searches for "how to delete cookies", here is what happens:

First, the document would have been indexed and analyzed, producing this:

[
    {"token": "delet", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}
]

So now, if we try and query "how to delete cookies", nothing will match! That is because we need to analyze the search query as well (ES does this by default). analyzing the search query results in:

[
    {"token": "how", "start": 0, "end": 3},
    {"token": "delet", "start": 7, "end": 13},
    {"token": "cooki", "start": 14, "end": 21}
]

Excellent! This will match the document's title pretty well. Remember the ElasticSearch doesn't enforce that 100% of the query matches. It simply finds the best one available, which can be confusing in edge cases, but in the normal case it works out quite well.

There is an issue though. Let's try this example in Spanish. Here is the document title "Borrando Cookies", as analyzed by our analysis process from above.

[
    {"token": "borr", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}
]

and the search "como borrar las cookies":

[
    {"token": "como", "start": 0, "end": 4},
    {"token": "borrar", "start": 6, "end": 11},
    {"token": "las", "start": 12, "end": 15},
    {"token": "cooki", "start": 16 "end": 23}
]

... Not so good. In particular, 'borrar', which is another verb form of 'Borrando' in the title, got analyzed as English, and so didn't get stemmed correctly. It won't match the token borr that was generated in the analysis of the document. So clearly, searches need to be analyzed in the same way as documents.

Luckily in SUMO we know what language the user (probably) wants, because the interface language will match. So if the user has a Spanish interface, we assume that the search is written in Spanish.

The original query that we use to do searches looks something like this much abbreviated sample:

{
    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies"
            }
        }
    }
}

The new query includes an analyzer field on the text match:

{
    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies",
                "analyzer": "snowball-spanish"
            }
        }
    }
}

This will result in the correct analyzer being used at search time.

Conclusion

This took me about three weeks off and on to develop, plus some chats with ES developers on the subject. Most of that time was spent researching and thinking about what the best way to do localized search was. Alternatives include having lots of fields, like document_title_es, document_title_de, etc, which seems icky to me, or using multifields to achieve a similar result. Another proposed example idea was to use different ES indexes for each language. Ultimately I decided in the approach outlined above.

For the implementation, modifying the indexing method to insert the right data into ES was the easy part, and I knocked it out in an afternoon. The difficult part was modifying our search code, working with the library we use to interact with ES to get it to support search analyzers, testing everything, and debugging the code that broke when this change was made. Overall, I think that the task was easier than we had expected when we wrote it down in our quarterly goals, and I think it went well.

For more nitty-gritty details, you can check out the two commits to the mozilla/kitsune repo that I made these changes in: 1212c97 and 0040e6b.


The Crimson Twins

Posted on 2013-06-13 in
  • projects
  • ,
  • crimson twins
  • , and
  • mozilla
.

Crimson Twins is a project that I started at Mozilla to power two ambient displays that we mounted on the wall in the Mountain View Mozilla office. We use it to show dashboards, such as the current status of the sprint or the state of GitHub. We also use it to share useful sites, such as posting a YouTube video that is related to the current conversation. Most of the time that I pay attention to the Twins, is when my coworkers and I post amusing animated GIFs to distract amuse the rest of the office.

Technical Details

Crimson Twins is a Node.JS app that is currently running on Mozilla's internal PaaS, an instance of Stackato. It uses Socket.io for real time two way communication between the server and the client. Socket.io uses WebSockets, long polling, Flash sockets, or a myriad of other techniques.

The architecture of the system is that there are a small number of virtual screens, which represent targets to send content to. Each client connected to the server can choose one or more of these virtual screens to display. A client can be any modern web browser, and I have used Firefox for Desktop and Android, Safari on iOS, and Chrome on desktop without any trouble.

Because of this setup, remote Mozillians can connect to the server and load up the same things that are shown on the TVs in the office. Put another way, Crimson Twins is remotie friendly, and people can play along at home.

Content Handling

Content is displayed with one of two mechanisms. For images, the content is loaded as a background of a div. Originally img tags were used, but it was difficult to style them this way. The switch to divs made it much easier to zoom the image to full screen without using JavaScript.

For content that is not images, a sandboxed iframe is used. This allows most sites to be shown with Crimson Twins, and the sandboxing prevents malicious sites from hijacking the Crimson Twins container1. This means that sites that disallow framing cannot be used with the system, but after much brainstorming we have yet to find a satisfactory way to get around this. Luckily most sites don't worry about iframes, so this isn't normally a huge annoyance.

For every URL that is sent to the screens, the sever first makes a HEAD request to the requested resource. A few things happen with this information. First, it is used to determine if the URL is an image or not, by examining content type. Second, it examines the headers to find things, such as X-Frames-Options: deny, server errors, or malformed URLs; it provides useful error messages if something like this happens.

Additionally, the requested URLs can go through various transformations. For example, if a link to an Imgur page is posted, the server will transform the URL into the URL for the image on the page. A link to an XKCD comic page will query the XKCD API for the URL of the image for that comic. This mechanism also allows for black listing of various content.

What's with the name?

The name of the project is a little silly, and is (I'm told) a reference to the old G.I. Joe cartoons. One of the enemies in the show were the twins Tomax and Xamot, collectively known as the Crimson Twins. It was proposed humorously, but I decided to keep it, and now I rarely think about the cartoon series anymore.

It has enough related names that the related projects that have sprung up have been easy to name, such as the Crimson Guard Commanders, the IRC name for one of the bots that interfaces with the API; and Extensive Enterprises, a web based camera-to-Imgur-to-IRC-to-CrimsonTwins roundabout way to post photos to the screens.

The Future

Crimson Twins has been proposed to be used as ambient displays in the public areas of various Mozilla offices, and as a general purpose manager for driving screens. To this end it is probably going to grow features such as remote control of clients, a more powerful API, and features to make it easy to manage remotely.

CrimsonTwins is open source, and can be found on GitHub. Pull requests are welcome, and if you want to chat about it, you can find me as mythmon in

bots on irc.mozilla.org.


  1. Due to bug 785310, Firefox allows sandboxed iframes with scripts enabled to directly access the parent document, which is a violation of the spec. Hopefully this bug will be fixed in the near future. 


Malicious Git Server

Posted on 2013-06-03 in
  • coding
  • ,
  • git
  • ,
  • tools
  • ,
  • security
  • , and
  • deployment
.

Git is hard

Some time ago, I found myself in a debate on irc regarding the security of git. On one side, my opponent argued that you could not trust git to reliably give you a particular version of your code in light of a malicious remote or a man in the middle attack, even if you checked it out by a particular hash. I argued that because of the nature of how git stores revisions, even a change in the history of the repository would require breaking the security of the SHA1 hashes git uses, an unlikely event. We eventually came to agreement that if you get code via revision hashes and not via branches or unsigned tags, you are not vulnerable to the kind of attach he was proposing.

This got me thinking about the security of git. About how it stores objects and builds a working directory. What if the contents of one of the object files changed? Git makes these files read only on the file system to prevent this kind of problem, but that is a weak protection. If the other end of your git clone is malicious, how much damage could they do? If there really is a security problem here, it means that a lot of deployment tools that rely on git telling the truth are vulnerable.

The malicious git

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in two files good.txt and evil.txt. I put the words "good" and "evil" in them, respectively. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in a files good.txt, and put word "good" in it. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

My first thought was to modify the object file that represented the tree, to try and replace the file with another one. Unfortunately, git's objects files aren't packed in a human readable way, so this didn't work out. After some more thought, I decided I could just modify the object file representing good.txt directly. Surely those are stored in a human readable way.

Nope. File blobs are equally unreadable. It seems the only thing within my reach that could deal with them was git itself. Hmm. Do file blobs only depend on the file's content? I checked in another throw away repository. I made another good.txt with the same contents, and commited it. The hash was the same. This was what I need to test out my theory of malice! So I made a second file, evil.txt, and checked it in to the throwaway repository.

I took the contents of the object file for evil.txt and replaced the object file for good.txt with them. The original repository still was unaware of my treachery: git status said all was well. Mischief managed!

What does the git think?

Next I cloned the modified repository. Alarmingly, no red flags were raised, and the exit code was 0. Opening good.txt revealed that the treachery had worked. It contained the text "evil", just like evil.txt. Uh oh. Surely git knows something is wrong right? I ran git status in the cloned repository. modified: good.txt. Well, that is a start. But the return code was still 0. That means that git status can't help in our deploy scripts. Out of curiosity, I ran git diff, to see what git things was modified in the file. Nothing. Which makes sense. Git knows something is up because the hash of good.txt doesn't match it's object id, but the contents match up, so it can't tell any more.

This is worrisome. To protect against malicious server or MITM attacks, there needs to be an automated way to detect this treachery. I looked around in git help. Nothing obvious. I delved deeper. I wondered if git gc would notice something? Nope. Status and Clone are already out. Repack? No dice. I started getting very worried by this point.

The solution

Then I found the command I needed: git fsck. It does just what you would expect it to. The name comes from the system utility by the same name, and it originally stood for "File System Check". After finding this command, I had hope. I ran it. It didn't light up in big flashing lights, but reading it's output revealed "error: sha1 mismatch 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7". More importantly, the exit code of the command was 5. I don't know what 5 means, but I do know it isn't 0, so it is an error. Yes!

So the solution is to always check git fsck after cloning if you really must know that your code is what you intended to run. If you do not, you run the risk of getting code that could be entirely different from what you thought.

A small comfort

Someone pointed that the various remote protocols git used would probably be a little pickier about what it got, in case of network transmission errors. Luckily, this was true. I tested http, git, and ssh protocols, and each of them raised an error on clone:

Cloning into './c'...
error: File 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7 has bad hash
fatal: missing blob object '12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7'
fatal: remote did not send all necessary objects
Unexpected end of command stream

The particular output varied a little with each protocol, but the result was the same. An error in the output, return code 128, and no repository cloned. This is good.

I feel that this is something that was improved recently, because when I originally did this experiment I remember the remote protocols printed an error, but did not have a non-zero exit code and still created the repository. Unfortunately I did not document this, so I'm not sure. Yay for continuous improvement and poor memories, I guess.

Conclusion

If you are cloning from an untrusted git server, and especially if you are cloning from an untrusted repository via the file protocol, run git fsck afterwards and check the error code, to make everything is at it should be.


Sublime, urxvt, and nose-progressive

Posted on 2013-04-24 in
  • coding
  • ,
  • dotfiles
  • ,
  • sublime
  • ,
  • urxvt
  • ,
  • nose-progressive
  • , and
  • python
.

For many of my projects I use the excellent nose-progressive for running tests. Among other features, it prints out lines that are intended to be helpful to jump straight to the code that caused the error. This works well for some workflows, but not mine. Here is an example of nose-progressive's output:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.
Generating sample data from wiki...
Done!

FAIL: kitsune.apps.wiki.tests.test_models:DocumentTests.test_document_is_template
  vim +44 apps/wiki/tests/test_models.py  # test_document_is_template
    assert 0
AssertionError

1438 tests, 1 failure, 0 errors, 5 skips in 115.0s

In particular, note the line that begins vim +44 apps/wiki.... It is indicating the file and line number where the error occurred, and if I were to copy that line and execute it in my shell, it would launch vim with the right file and location. Not bad! It chose vim because that is what I have $EDITOR set to.

Unfortunately, even though my $EDITOR is set to vim, I use Sublime in my day to day editing tasks. I like to keep $EDITOR set to vim, because it tends to be used in places where I don't want to escalate to Sublime, but in this case I really do want the GUI editor.. This feature of nose-progressive doesn't help me much.

So how can I get nose-progressive to be helpful? In the recent 1.5 release of nose-progressive, a feature to customize this line was added. Promising. Additionally, I use urxvt as my terminal, and with some configuring, it can open links when they are clicked on. A plan is beginning to form.

Configuring nose-progressive

First, I made nose-progressive output a line that will indicate that Sublime should be used to open the file, not vim. A quick trip to the documentation taught me that I can set the environment variable $NOSE_PROGRESSIVE_EDITOR_SHORTCUT_TEMPLATE to a template string to

{dim_format}subl://{path}:{line_number}{normal}{function_format}{hash_if_function}{function}{normal}

Quite a mouthful, but it gets the job done. This format string will print something visually resembling the old line, but with a custom format. In action, it looks like this:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.

FAIL: kitsune.apps.wiki.tests.test_models:DocumentTests.test_document_is_template
  subl:///home/mythmon/src/kitsune/apps/wiki/tests/test_models.py:44  # test_document_is_template
    assert 0
AssertionError

1 test, 1 failure, 0 errors in 0.0s

Awesome! Now to get the terminal to respond.

Configuring urxvt

I use a package called urxvt-perls to add features like clickable links to my terminal. I tweaked it's config to look like this to make it recognize my custom Sublime links from above, as well as normal web links. This is the relevant snippet of my ~/.Xdefaults file:

URxvt.perl-ext-common: default,url-select
URxvt.url-select.launcher: urxvt_launcher
URxvt.url-select.underline: true
URxvt.url-select.button: 3
URxvt.matcher.pattern.1: \\b(subl://[^ ]+)\\b

Line by line: - url-select add-on is loaded. - Set the launcher script to urxvt_launcher. More on this in a second. - Underline links when they are detected. - Use the right mouse button to open links. - Add an additional pattern to search for to make clickable.

Now when normal web links (like http://www.grinchcentral.com/) are found, or my custom subl:// links are clicked, urxvt_launcher will be executed with the underlined text as $1.

The launcher

Bash is not my native language, but it seemed the appropriate tool for this job. I hacked together this script:

1
2
3
4
5
6
7
8
#!/bin/bash

if [[ $1 == 'subl://'* ]]; then
    path=$(echo $1 | sed -e 's|^subl://([^ :]+)(:(\d+))?|\1 :\2|')
    exec subl $path
else
    exec browser $1
fi

This seems to do the trick. If the "url" starts with the string subl://, then it extracts a url and a line number from the argument, and then execs sublime with that information. Otherwise, it runs another script, browser, which is simply a symlink for whatever browser I'm using at the moment.

All of this combined together make nice, clickable links to exactly what line of code is breaking my tests. Time will tell if this is useful but if nothing else, it is quite neat.