mythmon

Tracking Deploys in Git Log.

Posted on 2013-11-18 in
  • tools
  • ,
  • sumo
  • , and
  • git
.

Knowing what is going on with git and many environments can be hard. In particular, it can be hard to easily know where the server environments are on the git history, and how the rest of the world relates to that. I've set up a couple interlocking gears of tooling that help me know whats going on.

Network

One thing that I love about GitHub is it's network view, which gives a nice high level overview of branches and forks in a project. One thing I don't like about it is that it only shows what is on GitHub, and is a bit light on details. So I did some hunting, and I found a set of git commands that does a pretty good at replicating GitHub's network view.

$ git log --graph --all --decorate

I have this aliased to git net. Let's break it down:

  • git log - This shows the history of commits.
  • --graph - This adds lines between commits showing merging, branching, and all the rest of the non-linearity git allows in history.
  • --all - This shows all refs in your repo, instead of only your current branch.
  • --decorate - This shows the name of each ref net to each commit, like "origin/master" or "upstream/master".

This isn't that novel, but it is really nice. I often get asked what tool I'm using for this when I pull this up where other people can see it.

Cron Jobs

Having all the extra detail in my view of git's history is nice, but it doesn't help if I can only see what is on my laptop. I generally know what I've commited (on a good day), so the real goal here is to see what is in all of my remotes.

In practice, I only have this done for my main day-job project, so the update script is specific to that project. It could be expanded to all my git repos, but I haven't done that. To pull this off, I have this line in my crontab:

*/10 * * * * python2 /home/mythmon/src/kitsune/scripts/update-git.py

I'll get to the details of this script in the next section, but the important part is that it runs git fetch --all for the repo on question. To run this from a cronjob, I had to switch all my remotes to using https protocol for git instead of ssh, since my SSH keys aren't unlocked. Git knows the passwords to my http remotes thanks to it's gnome-keychain integration, so this all works without user interaction.

This has the result of keeping git up to date on what refs exist in the world. I have my teammate's repos as remotes, as well as our central master. This makes it easier for me to see what is going on in the world.

Deployment Refs

The last bit of information I wanted to see in my local network is the state of deployment on our servers. We have three environments that run our code, and knowing what I'm about to deploy is really useful. If you look in the screenshot above, you'll notice a couple refs that are likely unfamiliar: deployed/state and deployed/prod, in green. This is the second part of the update-git.py script I mentioned above.

As a part of the SUMO deploy process, we put a file on the server that contains the current git sha. This script read that file, and makes local references in my git repo that correspond to them

Wait, creates git refs from thin air? Yeah. This is a cool trick my friend Jordan Evans taught me about git. Since git's references are just files on the file system, you can make new ones easily. For example, in any git repo, the file .git/refs/heads/master contains a commit sha, which is how git knows where your master branch is. You could make new refs by editing these files manually, creating files and overwriting them to manipulate git. That's a little messy though. Instead we should use git's tools to do this.

Git provides git update-ref to manipulate refs. For example, to make my deployment refs, I run something like git update-ref refs/heas/deployed/prod 895e1e5ae. The last argument can be any sort of commit reference, including HEAD or branch names. If the ref doesn't exist, it will be created, and if you want to delete a ref, you can add -d. Cool stuff.

All Together Now

Now finally the entire script. Here I am using an git helper that I wrote that I have ommited for space. It works how you would expect, translating git.log(all=True, 'some-branch' to git log --all some-branch. I made a gist of it for the curious.

The basic strategy is to get fetch all remotes, then add/update the refs for the various server environments using git update-rev. This is run on a cron every few minutes, and makes knowing what is going on a little easier, and git in a distributed team a little nicer.

#!/usr/bin/env python

import os
import re
import subprocess

import requests

repo_dir = "{HOME}/src/kitsune".format(**os.environ)
environments = {
    'dev': 'http://support-dev.allizom.org/media/revision.txt',
    'stage': 'http://support.allizom.org/media/revision.txt',
    'prod': 'http://support.mozilla.org/media/revision.txt',
}

def main():
    cdpath = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')
    os.chdir(cdpath)

    git = Git()

    print(git.fetch(all=True))
    for env_name, revision_url in environments.items():
        try:
            cur_rev = git.rev_parse('deployed/' + env_name).strip()
        except subprocess.CalledProcessError:
            cur_rev = None
        new_rev = requests.get(revision_url).text.strip()

        if cur_rev != new_rev:
            print 'updating ' + env_name, cur_rev[:8], new_rev[:8]
            git.update_ref('refs/heads/deployed/' + env_name, new_rev)

if __name__ == '__main__':
    main()

That's It

The general idea is really easy:

  1. Fetch remotes often.
  2. Write down deployment shas.
  3. Actually look at it all.

The fact that it requires a little bit of cleverness, and a bit of git magic along the way means it took some time figure out. I think it was well worth it though.


Localized search on SUMO

Posted on 2013-08-07 in
  • projects
  • ,
  • sumo
  • ,
  • mozilla
  • , and
  • elasticsearch
.

My primary project at work is SUMO, the Firefox support site. It consists of a few parts, including a wiki, a question/answer forum, and a customized Twitter client for helping people with Firefox. It is also a highly localized site, with support in the code for over 80 languages. We don't have a community in all of those languages, but should one emerge, we are ready to embrace it. In other words, we take localization seriously.

Until recently, however, this embrace of multilingual coding didn't extend to our search engine. Our search engine (based on ElasticSearch) assumed that all of the wiki documents, question and answers, and forum posts were in English, and applied English based tricks to improve search. No more! On Monday, I flipped the switch to enable locale-specific analyzer and query support in the search engine, and now many languages have improved search. Here, I will explain just what happened, and how we did it.

Background

Currently, we use two major tricks to improve search: stemming and stop words. These help the search engine behave in a way that is more consistent with how we understand language, generally.

Stemming

Stemming is recognizing that words like "jump", "jumping", "jumped", and "jumper" are all related. They all stem from the common word "jump". In our search engine, this is done by enabling the ElasticSearch Snowball analyzer, which uses the Porter stemming algorithmm.

Unfortunately, Porter is English specific, because it stems algorithmically based on patterns in English, such as removing trailing "ing", "ed" or "er". The algorithm is much more complicated, but the point is, it really only works for English.

Stop Words

Stop words are words like "a", "the", "I", or "we" that generally carry little information in regards to a search engine's behavior. ES includes a list of these words, and removes them intelligently from search queries.

Analysis

ES is actually a very powerful system that can be used for many different kinds of search tasks (as well as other data slicing and dicing). One of the more interesting features that make it more than just full text search are it's analyzers. There are many built in analyzers, and there are ways to recombine analyzers and parts of analyzers to build custom behavior. If you really need something special, you could even write a plugin to add a new behavior, but that requires writing Java, so lets not go there.

The goal of Analysis is to take a stream of characters and create a stream of tokens out of them. Stemming and stop words are things that can play into this process. These modifications to analysis actually change the document that gets inserted into the ES index, so we will have to take that into account later. If we insert a document contains "the dog jumped" into the index, it would get indexed as something like

[
    {"token": "dog", "start": 4, "end": 7},
    {"token": "jump", "start": 8, "end": 14}
]

This isn't really what ES would return, but it is close enough. Note how the tokens inserted are post-analysis version, that include the changes made by the stop words and stemming token filters. That means the analysis process is languages specific, so we need to change the analyzer depending on the language. Easy, right? Actually, yes. This consists of a few parts.

Choosing a language

SUMO is a Django app, so in settings.py, we define a map of languages to ES analyzers, like this (except with a lot more languages):

ES_LOCALE_ANALYZERS = {
    'en-US': 'snowball',
    'es': 'snowball-spanish',
}

Note: snowball-spanish is simply the normal Snowball analyzer with an option of {"language": "Spanish"}.

Then we use this helper function to pick the right language based on a locale, with a fallback. This also takes into account the possibility that some ES analyzers are located in plugins which may not be available.

def es_analyzer_for_locale(locale, fallback="standard"):
  """Pick an appropriate analyzer for a given locale.

  If no analyzer is defined for `locale`, return fallback instead,
  which defaults to ES analyzer named "standard".
  """
  analyzer = settings.ES_LOCALE_ANALYZERS.get(locale, fallback)

  if (not settings.ES_USE_PLUGINS and
          analyzer in settings.ES_PLUGIN_ANALYZERS):
      analyzer = fallback

  return analyzer

Indexing

Next, the mapping needs to be modified. Prior to this change, we explicitly listed the analyzer for all analyzed fields, such as the document content or document title. Now, we leave off the analyzer, which causes it to use the default analyzer.

Finally, we can set the default analyzer on a per document basis, by setting the _analyzer field when indexing it into ES. This ends up looking something like this (this isn't the real code, because the real code is much longer for uninteresting reasons):

def extract_document(obj):
    return {
        'document_title': obj.title,
        'document_content': obj.content,
        'locale': obj.locale,
        '_analyzer': es_analyzer_for_locale(obj.locale),
    }

Searching

This is all well and good, but what use is an index of documents if you can't query it correctly? Lets consider an example. If there is a wiki document with a title "Deleting Cookies", and a user searches for "how to delete cookies", here is what happens:

First, the document would have been indexed and analyzed, producing this:

[
    {"token": "delet", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}
]

So now, if we try and query "how to delete cookies", nothing will match! That is because we need to analyze the search query as well (ES does this by default). analyzing the search query results in:

[
    {"token": "how", "start": 0, "end": 3},
    {"token": "delet", "start": 7, "end": 13},
    {"token": "cooki", "start": 14, "end": 21}
]

Excellent! This will match the document's title pretty well. Remember the ElasticSearch doesn't enforce that 100% of the query matches. It simply finds the best one available, which can be confusing in edge cases, but in the normal case it works out quite well.

There is an issue though. Let's try this example in Spanish. Here is the document title "Borrando Cookies", as analyzed by our analysis process from above.

[
    {"token": "borr", "start": 0, "end": 8},
    {"token": "cooki", "start": 9, "end": 16}
]

and the search "como borrar las cookies":

[
    {"token": "como", "start": 0, "end": 4},
    {"token": "borrar", "start": 6, "end": 11},
    {"token": "las", "start": 12, "end": 15},
    {"token": "cooki", "start": 16 "end": 23}
]

... Not so good. In particular, 'borrar', which is another verb form of 'Borrando' in the title, got analyzed as English, and so didn't get stemmed correctly. It won't match the token borr that was generated in the analysis of the document. So clearly, searches need to be analyzed in the same way as documents.

Luckily in SUMO we know what language the user (probably) wants, because the interface language will match. So if the user has a Spanish interface, we assume that the search is written in Spanish.

The original query that we use to do searches looks something like this much abbreviated sample:

{
    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies"
            }
        }
    }
}

The new query includes an analyzer field on the text match:

{
    "query": {
        "text": {
            "document_title": {
                "query": "como borrar las cookies",
                "analyzer": "snowball-spanish"
            }
        }
    }
}

This will result in the correct analyzer being used at search time.

Conclusion

This took me about three weeks off and on to develop, plus some chats with ES developers on the subject. Most of that time was spent researching and thinking about what the best way to do localized search was. Alternatives include having lots of fields, like document_title_es, document_title_de, etc, which seems icky to me, or using multifields to achieve a similar result. Another proposed example idea was to use different ES indexes for each language. Ultimately I decided in the approach outlined above.

For the implementation, modifying the indexing method to insert the right data into ES was the easy part, and I knocked it out in an afternoon. The difficult part was modifying our search code, working with the library we use to interact with ES to get it to support search analyzers, testing everything, and debugging the code that broke when this change was made. Overall, I think that the task was easier than we had expected when we wrote it down in our quarterly goals, and I think it went well.

For more nitty-gritty details, you can check out the two commits to the mozilla/kitsune repo that I made these changes in: 1212c97 and 0040e6b.


The Crimson Twins

Posted on 2013-06-13 in
  • projects
  • ,
  • crimson twins
  • , and
  • mozilla
.

Crimson Twins is a project that I started at Mozilla to power two ambient displays that we mounted on the wall in the Mountain View Mozilla office. We use it to show dashboards, such as the current status of the sprint or the state of GitHub. We also use it to share useful sites, such as posting a YouTube video that is related to the current conversation. Most of the time that I pay attention to the Twins, is when my coworkers and I post amusing animated GIFs to distract amuse the rest of the office.

Technical Details

Crimson Twins is a Node.JS app that is currently running on Mozilla's internal PaaS, an instance of Stackato. It uses Socket.io for real time two way communication between the server and the client. Socket.io uses WebSockets, long polling, Flash sockets, or a myriad of other techniques.

The architecture of the system is that there are a small number of virtual screens, which represent targets to send content to. Each client connected to the server can choose one or more of these virtual screens to display. A client can be any modern web browser, and I have used Firefox for Desktop and Android, Safari on iOS, and Chrome on desktop without any trouble.

Because of this setup, remote Mozillians can connect to the server and load up the same things that are shown on the TVs in the office. Put another way, Crimson Twins is remotie friendly, and people can play along at home.

Content Handling

Content is displayed with one of two mechanisms. For images, the content is loaded as a background of a div. Originally img tags were used, but it was difficult to style them this way. The switch to divs made it much easier to zoom the image to full screen without using JavaScript.

For content that is not images, a sandboxed iframe is used. This allows most sites to be shown with Crimson Twins, and the sandboxing prevents malicious sites from hijacking the Crimson Twins container1. This means that sites that disallow framing cannot be used with the system, but after much brainstorming we have yet to find a satisfactory way to get around this. Luckily most sites don't worry about iframes, so this isn't normally a huge annoyance.

For every URL that is sent to the screens, the sever first makes a HEAD request to the requested resource. A few things happen with this information. First, it is used to determine if the URL is an image or not, by examining content type. Second, it examines the headers to find things, such as X-Frames-Options: deny, server errors, or malformed URLs; it provides useful error messages if something like this happens.

Additionally, the requested URLs can go through various transformations. For example, if a link to an Imgur page is posted, the server will transform the URL into the URL for the image on the page. A link to an XKCD comic page will query the XKCD API for the URL of the image for that comic. This mechanism also allows for black listing of various content.

What's with the name?

The name of the project is a little silly, and is (I'm told) a reference to the old G.I. Joe cartoons. One of the enemies in the show were the twins Tomax and Xamot, collectively known as the Crimson Twins. It was proposed humorously, but I decided to keep it, and now I rarely think about the cartoon series anymore.

It has enough related names that the related projects that have sprung up have been easy to name, such as the Crimson Guard Commanders, the IRC name for one of the bots that interfaces with the API; and Extensive Enterprises, a web based camera-to-Imgur-to-IRC-to-CrimsonTwins roundabout way to post photos to the screens.

The Future

Crimson Twins has been proposed to be used as ambient displays in the public areas of various Mozilla offices, and as a general purpose manager for driving screens. To this end it is probably going to grow features such as remote control of clients, a more powerful API, and features to make it easy to manage remotely.

CrimsonTwins is open source, and can be found on GitHub. Pull requests are welcome, and if you want to chat about it, you can find me as mythmon in

bots on irc.mozilla.org.


  1. Due to bug 785310, Firefox allows sandboxed iframes with scripts enabled to directly access the parent document, which is a violation of the spec. Hopefully this bug will be fixed in the near future. 


Malicious Git Server

Posted on 2013-06-03 in
  • coding
  • ,
  • git
  • ,
  • tools
  • ,
  • security
  • , and
  • deployment
.

Git is hard

Some time ago, I found myself in a debate on irc regarding the security of git. On one side, my opponent argued that you could not trust git to reliably give you a particular version of your code in light of a malicious remote or a man in the middle attack, even if you checked it out by a particular hash. I argued that because of the nature of how git stores revisions, even a change in the history of the repository would require breaking the security of the SHA1 hashes git uses, an unlikely event. We eventually came to agreement that if you get code via revision hashes and not via branches or unsigned tags, you are not vulnerable to the kind of attach he was proposing.

This got me thinking about the security of git. About how it stores objects and builds a working directory. What if the contents of one of the object files changed? Git makes these files read only on the file system to prevent this kind of problem, but that is a weak protection. If the other end of your git clone is malicious, how much damage could they do? If there really is a security problem here, it means that a lot of deployment tools that rely on git telling the truth are vulnerable.

The malicious git

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in two files good.txt and evil.txt. I put the words "good" and "evil" in them, respectively. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

So I did an experiment. I created a repository in ~/tmp/malice/a. I checked in a files good.txt, and put word "good" in it. I commited, and it was good. For a sanity check, I cloned that repository to ~/tmp/malice/b. Everything looked as I expected. I deleted the clone, and started fiddling with git's internals.

My first thought was to modify the object file that represented the tree, to try and replace the file with another one. Unfortunately, git's objects files aren't packed in a human readable way, so this didn't work out. After some more thought, I decided I could just modify the object file representing good.txt directly. Surely those are stored in a human readable way.

Nope. File blobs are equally unreadable. It seems the only thing within my reach that could deal with them was git itself. Hmm. Do file blobs only depend on the file's content? I checked in another throw away repository. I made another good.txt with the same contents, and commited it. The hash was the same. This was what I need to test out my theory of malice! So I made a second file, evil.txt, and checked it in to the throwaway repository.

I took the contents of the object file for evil.txt and replaced the object file for good.txt with them. The original repository still was unaware of my treachery: git status said all was well. Mischief managed!

What does the git think?

Next I cloned the modified repository. Alarmingly, no red flags were raised, and the exit code was 0. Opening good.txt revealed that the treachery had worked. It contained the text "evil", just like evil.txt. Uh oh. Surely git knows something is wrong right? I ran git status in the cloned repository. modified: good.txt. Well, that is a start. But the return code was still 0. That means that git status can't help in our deploy scripts. Out of curiosity, I ran git diff, to see what git things was modified in the file. Nothing. Which makes sense. Git knows something is up because the hash of good.txt doesn't match it's object id, but the contents match up, so it can't tell any more.

This is worrisome. To protect against malicious server or MITM attacks, there needs to be an automated way to detect this treachery. I looked around in git help. Nothing obvious. I delved deeper. I wondered if git gc would notice something? Nope. Status and Clone are already out. Repack? No dice. I started getting very worried by this point.

The solution

Then I found the command I needed: git fsck. It does just what you would expect it to. The name comes from the system utility by the same name, and it originally stood for "File System Check". After finding this command, I had hope. I ran it. It didn't light up in big flashing lights, but reading it's output revealed "error: sha1 mismatch 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7". More importantly, the exit code of the command was 5. I don't know what 5 means, but I do know it isn't 0, so it is an error. Yes!

So the solution is to always check git fsck after cloning if you really must know that your code is what you intended to run. If you do not, you run the risk of getting code that could be entirely different from what you thought.

A small comfort

Someone pointed that the various remote protocols git used would probably be a little pickier about what it got, in case of network transmission errors. Luckily, this was true. I tested http, git, and ssh protocols, and each of them raised an error on clone:

Cloning into './c'...
error: File 12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7 has bad hash
fatal: missing blob object '12799ccbe7ce445b11b7bd4833bcc2c2ce1b48b7'
fatal: remote did not send all necessary objects
Unexpected end of command stream

The particular output varied a little with each protocol, but the result was the same. An error in the output, return code 128, and no repository cloned. This is good.

I feel that this is something that was improved recently, because when I originally did this experiment I remember the remote protocols printed an error, but did not have a non-zero exit code and still created the repository. Unfortunately I did not document this, so I'm not sure. Yay for continuous improvement and poor memories, I guess.

Conclusion

If you are cloning from an untrusted git server, and especially if you are cloning from an untrusted repository via the file protocol, run git fsck afterwards and check the error code, to make everything is at it should be.


Sublime, urxvt, and nose-progressive

Posted on 2013-04-24 in
  • coding
  • ,
  • dotfiles
  • ,
  • sublime
  • ,
  • urxvt
  • ,
  • nose-progressive
  • , and
  • python
.

For many of my projects I use the excellent nose-progressive for running tests. Among other features, it prints out lines that are intended to be helpful to jump straight to the code that caused the error. This works well for some workflows, but not mine. Here is an example of nose-progressive's output:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.
Generating sample data from wiki...
Done!

FAIL: kitsune.apps.wiki.tests.test_models:DocumentTests.test_document_is_template
  vim +44 apps/wiki/tests/test_models.py  # test_document_is_template
    assert 0
AssertionError

1438 tests, 1 failure, 0 errors, 5 skips in 115.0s

In particular, note the line that begins vim +44 apps/wiki.... It is indicating the file and line number where the error occurred, and if I were to copy that line and execute it in my shell, it would launch vim with the right file and location. Not bad! It chose vim because that is what I have $EDITOR set to.

Unfortunately, even though my $EDITOR is set to vim, I use Sublime in my day to day editing tasks. I like to keep $EDITOR set to vim, because it tends to be used in places where I don't want to escalate to Sublime, but in this case I really do want the GUI editor.. This feature of nose-progressive doesn't help me much.

So how can I get nose-progressive to be helpful? In the recent 1.5 release of nose-progressive, a feature to customize this line was added. Promising. Additionally, I use urxvt as my terminal, and with some configuring, it can open links when they are clicked on. A plan is beginning to form.

Configuring nose-progressive

First, I made nose-progressive output a line that will indicate that Sublime should be used to open the file, not vim. A quick trip to the documentation taught me that I can set the environment variable $NOSE_PROGRESSIVE_EDITOR_SHORTCUT_TEMPLATE to a template string to

{dim_format}subl://{path}:{line_number}{normal}{function_format}{hash_if_function}{function}{normal}

Quite a mouthful, but it gets the job done. This format string will print something visually resembling the old line, but with a custom format. In action, it looks like this:

Reusing old database "test_kitsune". Set env var FORCE_DB=1 if you need fresh DBs.

FAIL: kitsune.apps.wiki.tests.test_models:DocumentTests.test_document_is_template
  subl:///home/mythmon/src/kitsune/apps/wiki/tests/test_models.py:44  # test_document_is_template
    assert 0
AssertionError

1 test, 1 failure, 0 errors in 0.0s

Awesome! Now to get the terminal to respond.

Configuring urxvt

I use a package called urxvt-perls to add features like clickable links to my terminal. I tweaked it's config to look like this to make it recognize my custom Sublime links from above, as well as normal web links. This is the relevant snippet of my ~/.Xdefaults file:

URxvt.perl-ext-common: default,url-select
URxvt.url-select.launcher: urxvt_launcher
URxvt.url-select.underline: true
URxvt.url-select.button: 3
URxvt.matcher.pattern.1: \\b(subl://[^ ]+)\\b

Line by line: - url-select add-on is loaded. - Set the launcher script to urxvt_launcher. More on this in a second. - Underline links when they are detected. - Use the right mouse button to open links. - Add an additional pattern to search for to make clickable.

Now when normal web links (like http://www.grinchcentral.com/) are found, or my custom subl:// links are clicked, urxvt_launcher will be executed with the underlined text as $1.

The launcher

Bash is not my native language, but it seemed the appropriate tool for this job. I hacked together this script:

1
2
3
4
5
6
7
8
#!/bin/bash

if [[ $1 == 'subl://'* ]]; then
    path=$(echo $1 | sed -e 's|^subl://([^ :]+)(:(\d+))?|\1 :\2|')
    exec subl $path
else
    exec browser $1
fi

This seems to do the trick. If the "url" starts with the string subl://, then it extracts a url and a line number from the argument, and then execs sublime with that information. Otherwise, it runs another script, browser, which is simply a symlink for whatever browser I'm using at the moment.

All of this combined together make nice, clickable links to exactly what line of code is breaking my tests. Time will tell if this is useful but if nothing else, it is quite neat.