jtobin.ca / archives / about

Automasymbolic Differentiation

Automatic differentiation is one of those things that’s famous for not being as famous as it should be (uh..). It’s useful, it’s convenient, and yet fewer know about it than one would think. This article (by one of the guys working on Venture) is the single best introduction you’ll probably find to AD, anywhere. It gives a wonderful introduction to the basics, the subtleties, and the gotchas of the subject. You should read it. I’m going to assume you have.

In particular, you should note this part:

[computing gradients] can be done — the method is called reverse mode — but it introduces both code complexity and runtime cost in the form of managing this storage (traditionally called the “tape”). In particular, the space requirements of raw reverse mode are proportional to the runtime of f.

In some applications this can be somewhat inconvenient - picture iterative gradient-based sampling algorithms like Hamiltonian Monte Carlo, its famous auto-tuning version NUTS, and Riemannian manifold variants. Tom noted in this reddit thread that symbolic differentiation - which doesn’t need to deal with tapes and infinitesimal-tracking - can often be orders of magnitude faster than AD for this kind of problem. When running these algos we calculate gradients many, many times, and the additional runtime cost of the reverse-mode AD dance can add up.

An interesting question is whether or not this can this be mitigated at all, and to what degree. In particular: can we use automatic differentiation to implement efficient symbolic differentiation? Here’s a possible attack plan:

  • use an existing automatic differentiation implementation to calculate the gradient for some target function at a point
  • capture the symbolic expression for the gradient and optimize it by eliminating common subexpressions or whatnot
  • reify that expression as a function that we can evaluate for any input
  • voila, (more) efficient gradient

Ed Kmett’s ‘ad’ library is the best automatic differentiation library I know of, in terms of its balance between power and accessibility. It can be used on arbitrary Haskell types that are instances of the Num typeclass and can carry out automatic differentiation via a number of modes, so it’s very easy to get started with. Rather than rolling my own AD implementation to try to do this sort of thing, I’d much rather use his.

Start with a really basic language. In practice we’d be interested in working with more expressive things, but this is sort of the minimal interesting language capable of illustrating the problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import Numeric.AD

data Expr a =
    Lit a
  | Var String
  | Add (Expr a) (Expr a)
  | Sub (Expr a) (Expr a)
  | Mul (Expr a) (Expr a)
  deriving (Eq, Show)

instance Num a => Num (Expr a) where
  fromInteger = Lit . fromInteger
  e0 + e1 = Add e0 e1
  e0 - e1 = Sub e0 e1
  e0 * e1 = Mul e0 e1

We can already use Kmett’s ad library on these expressions to generate symbolic expressions. We just have to write functions we’re interested in generically (using +, -, and *) and then call diff or grad or whatnot on them with a concretely-typed argument. Some examples:

> :t diff (\x -> 2 * x ^ 2) 
diff (\x -> 2 * x ^ 2) :: Num a => a -> a

> diff (\x -> 2 * x ^ 2) (Lit 1)
Mul (Add (Mul (Lit 1) (Lit 1)) (Mul (Lit 1) (Lit 1))) (Lit 2)

> grad (\[x, y] -> 2 * x ^ 2 + 3 * y) [Lit 1, Lit 2]
[ Add (Lit 0) (Add (Add (Lit 0) (Mul (Lit 1) (Add (Lit 0) (Mul (Lit 2) (Add
  (Lit 0) (Mul (Lit 1) (Lit 1))))))) (Mul (Lit 1) (Add (Lit 0) (Mul (Lit 2) (Add
  (Lit 0) (Mul (Lit 1) (Lit 1)))))))
, Add (Lit 0) (Add (Lit 0) (Mul (Lit 3) (Add (Lit 0) (Mul (Lit 1) (Lit 1)))))
]

It’s really easy to extract a proper derivative/gradient ‘function’ by doing something like this:

> diff (\x -> 2 * x ^ 2) (Var "x")
Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2)

and then, given that expression, reifying a direct function for the derivative by substituting over the captured variable and evaluating everything. Here’s some initial machinery to handle that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- | Close an expression over some variable.
close :: Expr a -> String -> a -> Expr a
close (Add e0 e1) s x = Add (close e0 s x) (close e1 s x)
close (Sub e0 e1) s x = Sub (close e0 s x) (close e1 s x)
close (Mul e0 e1) s x = Mul (close e0 s x) (close e1 s x)

close (Var v) s x
  | v == s    = Lit x
  | otherwise = Var v

close e _ _ = e

-- | Evaluate a closed expression.
eval :: Num a => Expr a -> a
eval (Lit d) = d
eval (Var _) = error "expression not closed"
eval (Add e0 e1) = eval e0 + eval e1
eval (Sub e0 e1) = eval e0 - eval e1
eval (Mul e0 e1) = eval e0 * eval e1

So, using this on the example above yields

> let testExpr = diff (\x -> 2 * x ^ 2) (Var "x")
> eval . close testExpr "x" $ 1
2

and it looks like a basic toDerivative function for expressions could be implemented as follows:

1
2
3
-- | Do some roundabout AD.
toDerivative expr = eval . close diffExpr "x" where
  diffExpr = diff expr (Var "x")

But that’s a no go. ‘ad’ throws a type error, as presumably we’d be at risk of perturbation confusion:

Couldn't match expected type ‘AD
                                s (Numeric.AD.Internal.Forward.Forward (Expr c))
                              -> AD s (Numeric.AD.Internal.Forward.Forward (Expr c))’
            with actual type ‘t’
  because type variable ‘s’ would escape its scope
This (rigid, skolem) type variable is bound by
  a type expected by the context:
    AD s (Numeric.AD.Internal.Forward.Forward (Expr c))
    -> AD s (Numeric.AD.Internal.Forward.Forward (Expr c))
  at ParamBasic.hs:174:14-32
Relevant bindings include
  diffExpr :: Expr c (bound at ParamBasic.hs:174:3)
  expr :: t (bound at ParamBasic.hs:173:14)
  toDerivative :: t -> c -> c (bound at ParamBasic.hs:173:1)
In the first argument of ‘diff’, namely ‘expr’
In the expression: diff expr (Var "x")

Instead, we can use ad’s auto combinator to write an alternate eval function:

1
2
3
4
5
6
7
8
9
10
autoEval :: Mode a => String -> Expr (Scalar a) -> a -> a
autoEval x expr = (`go` expr) where
  go _ (Lit d) = auto d
  go v (Var s)
    | s == x    = v
    | otherwise = error "expression not closed"

  go v (Add e0 e1) = go v e0 + go v e1
  go v (Sub e0 e1) = go v e0 - go v e1
  go v (Mul e0 e1) = go v e0 * go v e1

and using that, implement a working toDerivative:

1
2
toDerivative :: Num a => String -> Expr (Expr a) -> Expr a
toDerivative v expr = diff (autoEval v expr)

which, though it has a weird-looking type, typechecks and does the trick:

> let diffExpr = toDerivative "x" (Mul (Lit 2) (Mul (Var "x") (Var "x"))) (Var "x")
Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2)

> eval . close "x" 1 $ diffExpr
4

So now we have access to a reified AST for a derivative (or gradient), which can be tweaked and optimized as needed. Cool.

The available optimizations depend heavily on the underlying language. For starters, there’s easy and universal stuff like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- | Reduce superfluous expressions.
elimIdent :: (Num a, Eq a) => Expr a -> Expr a
elimIdent (Add (Lit 0) e) = elimIdent e
elimIdent (Add e (Lit 0)) = elimIdent e
elimIdent (Add e0 e1)     = Add (elimIdent e0) (elimIdent e1)

elimIdent (Sub (Lit 0) e) = elimIdent e
elimIdent (Sub e (Lit 0)) = elimIdent e
elimIdent (Sub e0 e1)     = Sub (elimIdent e0) (elimIdent e1)

elimIdent (Mul (Lit 1) e) = elimIdent e
elimIdent (Mul e (Lit 1)) = elimIdent e
elimIdent (Mul e0 e1)     = Mul (elimIdent e0) (elimIdent e1)

elimIdent e = e

Which lets us do some work up-front:

> let e = 2 * Var "x" ^ 2 + Var "x" ^ 4
Add (Mul (Lit 2) (Mul (Var "x") (Var "x"))) (Mul (Mul (Var "x") (Var "x"))
(Mul (Var "x") (Var "x")))

> let ge = toDerivative "x" e
Add (Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2))
(Add (Mul (Mul (Var "x") (Var "x")) (Add (Mul (Var "x") (Lit 1)) (Mul
(Lit 1) (Var "x")))) (Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1)
(Var "x"))) (Mul (Var "x") (Var "x"))))

> let geOptim = elimIdent ge
Add (Mul (Add (Var "x") (Var "x")) (Lit 2)) (Add (Mul (Mul (Var "x")
(Var "x")) (Add (Var "x") (Var "x"))) (Mul (Add (Var "x") (Var "x")) (Mul
(Var "x") (Var "x"))))

> eval . close "x" 1 $ geOptim
8

But there are also some more involved optimizations that can be useful for some languages. The basic language I’ve been using above, for example, has no explicit support for sharing common subexpressions. You’ll recall from one of my previous posts that we have a variety of methods to do that in Haskell EDSLs, including some that allow sharing to be observed without modifying the underlying language. We can use data-reify, for example, to observe any implicit sharing in expressions:

> reifyGraph $ geOptim
let [(1,AddF 2 6),(6,AddF 7 10),(10,MulF 11 12),(12,MulF 4 4),(11,AddF 4 4),
(7,MulF 8 9),(9,AddF 4 4),(8,MulF 4 4),(2,MulF 3 5),(5,LitF 2),(3,AddF 4 4),
(4,VarF "x")] in 1

And even make use of a handy library found on Hackage for performing common subexpression elimination on graphs returned by reifyGraph:

> cse <$> reifyGraph geOptim
let [(5,LitF 2),(1,AddF 2 6),(3,AddF 4 4),(6,AddF 7 10),(2,MulF 3 5),
(10,MulF 3 8),(8,MulF 4 4),(7,MulF 8 3),(4,VarF "x")] in 1

With an appropriate graph evaluator we can cut down the size of the syntax we have to traverse substantially.

Happy automasymbolic differentiating!

Synchronous Remote Collaboration

I don’t really do a ton of pair programming. By the nature of my work and geographical location I tend to work i) largely independently ii) on small teams iii) consisting of people who live about a dozen time zones away from me.

I would like to do more, or at least try it out a bit more. In the past I’ve done it very occasionally while trying to pick up something completely new from an expert, but I think the experience would be more valuable while working on something familiar but tough, or even just for a change of pace now and then.

In the past I’ve used join.me for screen sharing plus voice. I found it to be lighter-weight and better quality than Skype. But that already seems to be a bit old-fashioned nowadays; I’ve since taken to using Hangouts for most of my calls, and I’d rather be able to work with people more interactively than screen sharing will allow.

Since servers/instances on your cloud provider of choice are so cheap and easy to manage these days, I decided to put together a quick pairing environment using Vagrant, AWS, Ansible, and tmux. The idea is that when you’d like to collaborate with someone or other, you just quickly fire up a cloud instance with all your favourite junk on it, ssh (or mosh) in, and get right to work in a familiar environment. Everyone connects to the same tmux session on an ephemereal cloud instance, which is then just halted or destroyed as soon as you get bored of each others’ company.

I’ve created a Github repo here that hopefully makes it quite easy to get this setup going. Feel free to use that as a base for your own setup if you’d like to try it for yourself. It’s even pretty much plug and play so long as you enjoy vim, tmux, and all of my configuration settings (ps, I re-bind the tmux prefix to C-s).

TODO: build a custom AMI to dodge the long provisioning waits.

Sharing in Haskell EDSLs

Lately I’ve been trying to do some magic by way of nonstandard interpretations of abstract syntax. One of the things that I’ve managed to grok along the way has been the problem of sharing in deeply-embedded languages.

Here’s a simple illustration of the ‘vanilla’ sharing problem by way of plain Haskell; a function that computes 2^n:

1
2
3
naiveTree :: (Eq a, Num a, Num b) => a -> a
naiveTree 0 = 1
naiveTree n = naiveTree (n - 1) + naiveTree (n - 1)

This naive implementation is a poor way to roll as it is exponentially complex in n. Look at how evaluation proceeds for something like naiveTree 4:

> naiveTree 4
> naiveTree 3 + naiveTree 3
> naiveTree 2 + naiveTree 2 + naiveTree 2 + naiveTree 2
> naiveTree 1 + naiveTree 1 + naiveTree 1 + naiveTree 1
  + naiveTree 1 + naiveTree 1 + naiveTree 1 + naiveTree 1
> naiveTree 0 + naiveTree 0 + naiveTree 0 + naiveTree 0
  + naiveTree 0 + naiveTree 0 + naiveTree 0 + naiveTree 0
  + naiveTree 0 + naiveTree 0 + naiveTree 0 + naiveTree 0
  + naiveTree 0 + naiveTree 0 + naiveTree 0 + naiveTree 0
> 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
> 16

Each recursive call doubles the number of function evaluations we need to make. Don’t wait up for naiveTree 50 to return a value.

A better way to write this function would be:

1
2
3
4
5
tree :: (Eq a, Num a, Num b) => a -> a
tree 0 = 1
tree n =
  let shared = tree (n - 1)
  in  shared + shared

Here we store solutions to subproblems, and thus avoid having to recompute things over and over. Look at how tree 4 proceeds now:

> tree 4
> let shared0 =
      let shared1 = 
          let shared2 =
              let shared3 = 1 
              in  shared3 + shared3
          in  shared2 + shared2
      in  shared1 + shared1
  in  shared0 + shared0
> let shared0 =
      let shared1 = 
          let shared2 = 2
          in  shared2 + shared2
      in  shared1 + shared1
  in  shared0 + shared0
> let shared0 =
      let shared1 = 4
      in  shared1 + shared1
  in  shared0 + shared0
> let shared0 = 8
  in  shared0 + shared0
> 16

You could say that Haskell’s let syntax enables sharing between computations; using it reduces the complexity of our tree implementation from $O(2^n)$ to $O(n)$. tree 50 now returns instantly:

> tree 50
1125899906842624

So let’s move everything over to an abstract syntax setting and see how the results translate there.

Let’s start with a minimalist language, known in some circles as Hutton’s Razor. While puny, it is sufficiently expressive to illustrate the subtleties of this whole sharing business:

1
2
3
4
5
6
7
8
9
10
11
12
data Expr =
    Lit Int
  | Add Expr Expr
  deriving (Eq, Ord, Show)

instance Num Expr where
  fromInteger = Lit . fromInteger
  (+)         = Add

eval :: Expr -> Int
eval (Lit d)     = d
eval (Add e0 e1) = eval e0 + eval e1

I’ve provided a Num instance so that we can conveniently write expressions in this language. We can use conventional notation and extract abstract syntax for free by specifying a particular type signature:

> 1 + 1 :: Expr
Add (Lit 1) (Lit 1)

And of course we can use eval to evaluate things:

> eval (1 + 1 :: Expr)
2

Due to the Num instance and the polymorphic definitions of naiveTree and tree, these functions will automatically work on our expression type. Check them out:

> naiveTree 2 :: Expr
Add (Add (Lit 1) (Lit 1)) (Add (Lit 1) (Lit 1))

> tree 2 :: Expr
Add (Add (Lit 1) (Lit 1)) (Add (Lit 1) (Lit 1))

Notice there’s a quirk here: each of these functions - having wildly different complexities - yields the same abstract syntax, implying that tree is no more efficient than naiveTree when it comes to dealing with this expression type. That means..

> eval (tree 50 :: Expr)
-- ain't happening

So there is a big problem here: Haskell’s let syntax doesn’t carry its sharing over to our embedded language. Equivalently, the embedded language is not expressive enough to represent sharing in its own abstract syntax.

There are a few ways to get around this.

Memoizing Evaluation

For some interpretations (like evaluation) we can use a memoization library. Here we can use Data.StableMemo to define a clean and simple evaluator:

1
2
3
4
5
6
7
import Data.StableMemo

memoEval :: Expr -> Int
memoEval = go where
  go = memo eval
  eval (Lit i)     = i
  eval (Add e0 e1) = go e0 + go e1

This will very conveniently handle any grimy details of caching intermediate computations. It passes the tree 50 test just fine:

> memoEval (tree 50 :: Expr)
1125899906842624

Some other interpretations are still inefficient; a similar memoPrint function will still dump out a massive syntax tree due to the limited expressiveness of the embedded language. The memoizer doesn’t really allow us to observe sharing, if we’re interested in doing that for some reason.

Observing Implicit Sharing

We can actually use GHC’s internal sharing analysis to recover any implicit sharing present in an embedded expression. This is the technique introduced by Andy Gill’s Type Safe Observable Sharing In Haskell and implemented in the data-reify library on Hackage. It’s as technically unsafe as it sounds, but in practice has the benefits of being both relatively benign and minimally intrusive on the existing language.

Here is the extra machinery required to observe implicit sharing in our Expr type:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE TypeFamilies #-}

import Control.Applicative
import Data.Reify hiding (Graph)
import qualified Data.Reify as Reify
import System.IO.Unsafe

data ExprF e =
    LitF Int
  | AddF e e
  deriving (Eq, Ord, Show, Functor)

instance MuRef Expr where
  type DeRef Expr        = ExprF
  mapDeRef f (Add e0 e1) = AddF <$> f e0 <*> f e1
  mapDeRef _ (Lit v)     = pure (LitF v)

We need to make Expr an instance of the MuRef class, which loosely provides a mapping between the Expr and ExprF types. ExprF itself is a so-called ‘pattern functor’, which is a parameterized type in which recursive points are indicated by the parameter. We need the TypeFamilies pragma for instantiating the MuRef class, and DeriveFunctor eliminates the need to manually instantiate a Functor instance for ExprF.

Writing MuRef instances is pretty easy. For more complicated types you can often use Data.Traversable.traverse in order to provide the required implementation for mapDeRef (example).

With this in place we can use reifyGraph from data-reify in order to observe the implicit sharing. Let’s try this on a bite-sized tree 2 and note that it is an IO action:

> reifyGraph (tree 2 :: Expr)
let [(1,AddF 2 2),(2,AddF 3 3),(3,LitF 1)] in 1

Here we get an abstract syntax graph - rather than a tree - and sharing has been made explicit.

We can write an interpreter for expressions by internally reifying them as graphs and then working on those. reifyGraph is an IO action, but since its effects are pretty tame I don’t feel too bad about wrapping calls to it in unsafePerformIO. Interpreting these graphs must be handled a little differently from interpreting a tree; a naive ‘tree-like’ evaluator will eliminate sharing undesirably:

1
2
3
4
5
6
7
8
naiveEval :: Expr -> Int
naiveEval expr = gEval reified where
  reified = unsafePerformIO $ reifyGraph expr
  gEval (Reify.Graph env r) = go r where
    go j = case lookup j env of
      Just (AddF a b) -> go a + go b
      Just (LitF d)   -> d
      Nothing         -> 0

This evaluator fails the tree 50 test:

> naiveEval (tree 50)
-- hang

We need to use a more appropriately graph-y method to traverse and interpret this (directed, acyclic) graph. Here’s an idea:

  • topologically sort the graph, yielding a linear ordering of vertices such that for every edge $u \to v$, $v$ is ordered before $u$.
  • iterate through the sorted vertices, interpreting them as desired and storing the interpretation
  • look up the previously-interpreted vertices as needed

We can use the Data.Graph module from the containers library to deal with the topological sorting and vertex lookups. The following graph-based evaluator gets the job done:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import Data.Graph
import Data.Maybe

graphEval :: Expr -> Int
graphEval expr = consume reified where
  reified = unsafePerformIO (toGraph <$> reifyGraph expr)
  toGraph (Reify.Graph env _) = graphFromEdges . map toNode $ env
  toNode (j, AddF a b) = (AddF a b, j, [a, b])
  toNode (j, LitF d)   = (LitF d, j, [])

consume :: Eq a => (Graph, Vertex -> (ExprF a, a, b), c) -> Int
consume (g, vmap, _) = go (reverse . topSort $ g) [] where
  go [] acc = snd $ head acc
  go (v:vs) acc =
    let nacc = evalNode (vmap v) acc : acc
    in  go vs nacc

evalNode :: Eq a => (ExprF a, b, c) -> [(a, Int)] -> (b, Int)
evalNode (LitF d, k, _)   _ = (k, d)
evalNode (AddF a b, k, _) l =
  let v = fromJust ((+) <$> lookup a l <*> lookup b l)
  in  (k, v)

In a serious implementation I’d want to use a more appropriate caching structure and avoid the partial functions like fromJust and head, but you get the point. In any case, this evaluator passes the tree 50 test without issue:

> graphEval (tree 50)
1125899906842624

Making Sharing Explicit

Instead of working around the lack of sharing in our language, we can augment it by adding the necessary sharing constructs. In particular, we can add our own let-binding that piggybacks on Haskell’s let. Here’s an enhanced language (using the same Num instance as before):

1
2
3
4
data Expr =
    Lit Int
  | Add Expr Expr
  | Let Expr (Expr -> Expr)

The new Let constructor implements higher-order abstract syntax, or HOAS. There are some immediate consequences of this: we can’t derive instances of our language for typeclasses like Eq, Ord, and Show, and interpreting everything becomes a bit more painful. But, we don’t need to make any use of data-reify in order to share expressions, since the language now handles that ‘a la carte. Here’s an efficient evaluator:

1
2
3
4
5
6
eval :: Expr -> Int
eval (Lit d)     = d
eval (Add e0 e1) = eval e0 + eval e1
eval (Let e0 e1) =
  let shared = Lit (eval e0)
  in  eval (e1 shared)

In particular, note that we need a sort of back-interpreter to re-embed shared expressions into our language while interpreting them. Here we use Lit to do that, but this is more painful if we want to implement, say, a pretty printer; in that case we need a parser such that print (parse x) == x (see here).

We also can’t use the existing tree function. Here’s the HOAS equivalent, which is no longer polymorphic in its return type:

1
2
3
tree :: (Num a, Eq a) => a -> Expr
tree 0 = 1
tree n = Let (tree (n - 1)) (\shared -> shared + shared)

Using that, we can see that sharing is preserved just fine:

> eval (tree 50)
1125899906842624

Another way to make sharing explicit is to use a paramterized HOAS, known as PHOAS. This requires the greatest augmentation of the original language (recycling the same Num instance):

1
2
3
4
5
6
7
8
9
10
11
data Expr a =
    Lit Int
  | Add (Expr a) (Expr a)
  | Let (Expr a) (a -> Expr a)
  | Var a

eval :: Expr Int -> Int
eval (Lit d)     = d
eval (Var v)     = v
eval (Add e0 e1) = eval e0 + eval e1
eval (Let e f)   = eval (f (eval e))

Here we parameterize the expression type and add both Let and Var constructors to the language. Sharing expressions explicitly now takes a slightly different form than in the HOAS version:

1
2
3
tree :: (Num a, Eq a) => a -> Expr b
tree 0 = 1
tree n = Let (tree (n - 1)) ((\shared -> shared + shared) . Var)

The Var term wraps the intermediate computation, which is then passed to the semantics-defining lambda. Sharing is again preserved in the language:

> eval $ tree 50
1125899906842624

Here, however, we don’t need the same kind of back-interpreter that we did when using HOAS. It’s easy to write a pretty-printer that observes sharing, for example (from here):

1
2
3
4
5
6
7
text e = go e 0 where
  go (Lit j)     _ = show j
  go (Add e0 e1) c = "(Add " ++ go e0 c ++ " " ++ go e1 c ++ ")"
  go (Var x) _     = x
  go (Let e0 e1) c = "(Let " ++ v ++ " " ++ go e0 (c + 1) ++
                     " in " ++ go (e1 v) (c + 1) ++ ")"
    where v = "v" ++ show c

Which yields the following string representation of our syntax:

> putStrLn . text $ tree 2
(Let v0 (Let v1 1 in (Add v1 v1)) in (Add v0 v0))

Cluing up

I’ve gone over several methods of handling sharing in embedded languages: an external memoizer, observable (implicit) sharing, and adding explicit sharing via adding a HOAS or PHOAS let-binding to the original language. Some may be more convenient than others, depending on what you’re trying to do.

I’ve dumped code for the minimal, HOAS, and PHOAS examples in some gists.

Sampling Functions and Generative Models

The term ‘probability distribution’ is actually pretty vague. A probability distribution is an abstract object that can be distinguished uniquely (i.e. from other probability distributions) by way of any of a bunch of concrete representations. Probability density or mass functions, moment generating functions, characteristic functions, cumulative distribution functions, random variables, and measures all reify a probability distribution as something tangible that can be worked with. Image measures in particular are sometimes called ‘distributions’, though they still just form a single possible reification of the underlying concept. In formal probability theory the term ‘law’ is often used to refer to the abstract object being characterized.

Different characterizations are useful when doing different kinds of work. Measures are useful when doing proofs. Probability densities and mass functions are useful for a whole whack of applications. Moment generating functions are useful for exercises in introductory mathematical statistics courses.

Sampling functions - procedures that produce samples from some target distribution - also characterize probability distributions uniquely. They are an excellent basis for introducing and transforming uncertainty in probabilistic programming languages. A sampling function takes as its sole input a stream of randomness, which is consumed and transformed to produce a sample from the distribution that it characterizes. The stream can be a lazy list or, similarly, a pseudo-random number generator.

Sampling functions are precursors to generative models: models that take both randomness and causal inputs as arguments and produce a possible effect. Generative models are the main currency of probabilistic programming; they specify a mapping between hypothesized causes and a probability distribution over effects. Sampling functions are the basis for handling all uncertainty in several PP implementations.

It’s useful to look at sampling functions and generative models to emphasize the distinction between the two. In Haskelly pseudocode, they have the following types:

1
2
3
samplingFunction :: Randomness -> Effect

generativeModel  :: Causes -> Randomness -> Effect

It’s easy to see that a generative model, when provided with some causes, is itself a sampling function.

We can use a monad to abstract out the provisioning of randomness and make everything a little cleaner. Imagine ‘Observable’ to be a monad that handles the propagation of randomness in our functions; any type tagged with ‘Observable’ is a probability distribution over some other type (and maybe we would run it with a function called observe or sample). Using that, we can write the above as follows:

1
2
3
samplingFunction :: Observable Effect

generativeModel  :: Causes -> Observable Effect

Very clean. Here it’s immediately clear that only difference between a sampling function and a model is the introduction of causes to the latter. A generative model contains some internal logic that manipulates external causes in some way; a sampling function does not.

You can see some in-progress development of this idea here and here.

Property Testing in Ruby

Testing properties of Haskell functions with QuickCheck is easy and pretty enjoyable. It turns out Ruby has a QuickCheck-like property testing library called rantly.

Let’s test the ‘greed game’ Ruby koan. In the greed game, one rolls up to five dice and calculates a score according to the following rules:

  • three ones is 1000 points
  • three of the same non-one number is worth 100 times that number
  • a one that is not a part of a set of three is worth 100 points
  • a five that is not a part of a set of three is worth 50 points
  • everything else is worth one point

So, for example, a roll of 1, 1, 1, 5, 1 would yield 1150 points. A roll of 3, 4, 5, 3, 3 would yield 350 points.

The basic scoring mechanism can be implemented like so:

1
2
3
4
5
6
7
8
9
10
11
12
require 'rantly/property'

def score(ds)
  (ds.group_by { |i| i }.map { |k, v| score_group k, v.count }.reduce :+) || 0
end

def score_group(k, v)
  value_score  = { 1 => 100, 5 => 50}[k] || 0
  triple_score = k == 1 ? 1000 : k * 100

  v >= 3 ? triple_score + value_score * (v - 3) : value_score * v
end

Property tests require generators to cook up requisite input data. Some generators we might be interested in, to start, are those to create valid dice rolls:

1
2
valid_roll   = -> { Rantly { range 1, 6 } }
valid_number = -> { Rantly { range 0, 5 } }

valid_roll describes the result of an individual dice roll, while valid_number describes the number of dice that can be rolled for an input score. range generates a value between its provided arguments, inclusive. Rantly comes equipped with a bunch of primitive generators and combinators: choose, array, sized, etc.

We can use those primitive generators to create other generators. In particular, a valid input to the score function should be a 0-to-5 length array in which each element is between 1 and 6 inclusive; that is, an array with length generated from valid_number and elements generated from valid_roll.

Below, I’ll create that composite generator in an RSpec describe block and then test a couple of properties of the score function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
describe 'score' do

  let (:rolls) do
    ->(r) { sized(valid_number.call) { array { valid_roll.call } } }
  end

  it 'should be non-negative' do
    property_of(&rolls).check(750) do |r|
      expect(score r).to be >= 0
    end
  end

  it 'should be less than or equal to 1200' do
    property_of(&rolls).check(750) do |r|
      expect(score r).to be <= 1200
    end
  end

end

Running this code will test each of those properties on 750 random inputs generated by the rolls generator.

One might also want to test how functions behave on invalid input. Let’s augment the original scoring functions with some exception handling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def score(ds)
  fail ArgumentError, 'invalid argument' unless ds.count.between?(0, 5)

  begin
    groups = ds.group_by { |i| i }
    (groups.map { |k, v| score_group k, v.count }.reduce :+) || 0
  rescue ArgumentError
    raise ArgumentError, 'invalid argument'
  end
end

def score_group(k, v)
  fail ArgumentError unless k.between?(1, 6)
  fail ArgumentError unless v.between?(1, 5)

  value_score  = { 1 => 100, 5 => 50 }[k] || 0
  triple_score = k == 1 ? 1000 : k * 100

  v >= 3 ? triple_score + value_score * (v - 3) : value_score * v
end

and then add generators for invalid rolls:

1
2
invalid_roll   = -> { Rantly { choose 0, 7 } }
invalid_number = -> { Rantly { choose 6, 10, 50 } }

then, with the addition of two new composite generators, we can test that the exception handling behaves as we expect:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
describe 'score' do

  let(:rolls) do
    ->(r) { sized(valid_number.call) { array { valid_roll.call } } }
  end

  let(:rolls_invalid_number) do
    ->(r) { sized(invalid_number.call) { array { valid_roll.call } } }
  end

  let(:rolls_invalid_roll) do
    ->(r) { sized(valid_number.call) { array { invalid_roll.call } } }
  end

  it 'should be non-negative' do
    property_of(&rolls).check(750) do |r|
      expect(score r).to be >= 0
    end
  end

  it 'should be less than or equal to 1200' do
    property_of(&rolls).check(750) do |r|
      expect(score r).to be <= 1200
    end
  end

  it 'should raise an error for an invalid number of rolls' do
    property_of(&rolls_invalid_number).check(750) do |r|
      expect { score r }.to raise_error(ArgumentError)
    end
  end

  it 'should raise an error for an invalid roll value' do
    property_of(&rolls_invalid_roll).check(750) do |r|
      expect { score r }.to raise_error(ArgumentError) if r.count > 0
    end
  end

end

As often happens, property testing tends to suss out weird corner cases in your code and help you understand it better. Just while doing this example I realized that ArgumentError wouldn’t necessarily be thrown for an invalid roll value if the number of input rolls was actually empty. Hence, the addition of if r.count > 0 to the last test.

Property testing also subsumes unit tests. If you use a static or relatively static generator, you’re effectively doing unit testing. You can see this in the cases of the invalid_roll and invalid_number generators, in which each is generating inputs from only a very small domain.

IMO familiarity with a QuickCheck-like property testing library is good to have. Rantly is not quite QuickCheck, but it’s still a joy to use.

Notes on Another Guy’s Notes

I recently bought Ilya Grigorik’s High Performance Browser Networking, which is an excellent book. Also excellent is that Ilya wrote a great retrospective on his book-writing process.

He made a few noteworthy points about the whole endeavour:

  • A ‘shitty first draft’ is the initial goal of most any writing. Just get to the keyboard and start mashing the keys.

  • Consistency is key. Show up and get to work.

  • Early feedback is invaluable.

  • Incremental milestones and deadlines are helpful.

  • Writing is an excellent way to expose the initial sloppiness of one’s thinking.

I think these are all excellent insights, but the second and third ones really stand out to me.

Early and constant feedback is just really important. This is something that I’ve had to constantly remind myself of when working on largely-solo projects. Having others examine your work immediately gives you an idea of its promise. Are the other parties excited by it? Indifferent? Confused? Can they point out an area that you haven’t really understood all that well, or something important that you missed?

And above all else, consistency is sacrosanct. This idea is only getting reinforced with time.

The Unreasonable Effectiveness of Habit

Exactly 41 days ago I started a project called 750 Words, which is simply a habit of writing a meagre 750 words every day. They can be written on anything; just pick a topic (or several topics) in your head and get to writing about it.

When I originally started, I thought that this would be a great way to work on blog posts, research papers, my dissertation, and so on. Not so much the case, I’ve found. After much internal debate as to its merits (the subject of at least one entry), the best use that I have found far 750 Words thus far has been a complete mind dump every morning over breakfast.

Initially I tried rigorously picking a topic and writing essays and technical entries or what have you, but this seemed to actually go against the spirit of the exercise. Nowadays I just open the browser and crank out whatever’s on my mind. It generally takes me about 15 minutes.

Why do I deem this to be a good use of my time? More than anything, I think, it has been by observing the results over time: I’ve sustained a streak now for 37 days straight, and quite enjoy waking up every day and putting another X on the calendar by virtue of writing another entry. I don’t want to stop, and indeed, don’t intend to. The main reward to me has been seeing a basic goal manifest as a string of X’s on a calendar; the little 750-word-minimum mind dumps (which constitute over 32,000 words now) are a bonus.

Measures and Continuations

I’ve always been interested in measure theory. I’m not even sure why, exactly. Probably because it initially seemed so mysterious to me. As an undergrad, measure theory was this unknown, exotic key to truly understanding probability.

Well, sort of. It’s certainly the key to understanding formal probability, but it no longer seems all that exotic, nor really necessary for grokking what I’d call the true essence of probability. It’s pretty much real analysis with specialized attention paid to notions of factoring (independence) and ratios (conditioning). Nowadays I relate it moreso to accounting; not the most inspiring of subjects, but necessary if you want to make sure everything adds up the way you think it should.

Basic EC2 Management With Ansible

EC2 is cool. The ability to dynamically spin up a whack of free-to-cheap server instances anywhere in the world at any time, is.. well, pretty mean. Need to run a long computation job? Scale up a distributed system? Reduce latency to clients in a particular geographical region? YEAH WE CAN DO THAT.

The EC2 Management Console is a pretty great tool in of itself. Well laid-out and very responsive. But for a hacker’s hacker, a great tool to manage EC2 instances (amongst other things) is Ansible, which provides a way to automate tasks over an arbitrary number of servers, concurrently.

With EC2 and Ansible you can rapidly find yourself controlling an amorphous, globally-distributed network of servers that are eager to do your bidding.

Again.. pretty mean.

Managing Learning as a Side Project

I regularly find myself wanting to learn too many things at once. This might be justifiable in some sense; there’s an awful lot out there to learn. Often, skimming over some topic or other feels like enough to develop a sufficiently high-level model of what it is or how it works. Armed with that (dangerously small amount of) knowledge, however, the urge to pick up some other topic tends to arise.. and so the process repeats.

This is all well and good in order to survey what’s out there, but left unchecked, a survey of topics is all one might get. Le dilettantisme can be understandable, but never desirable.

Some time ago, I decided to try restricting myself to learning only one particular topic for two weeks at a time, as a bit of a side project. Think of it as a ‘learning sprint’, if you will. The idea is that the time between iterations is sufficiently short to ensure that one can’t hunger too badly to switch to some other topic mid-sprint. At the same time, each iteration is lengthy enough to ensure a reasonable amount of immersion and depth.

I managed a single iteration, but due to travel and a lack of conviction about the whole thing, never started another. I think I’m going to start again, but with a little more intent this time.

Two weeks can be a long-as time for some topics, though, so I believe I’ll work in one-to-two week commitments, depending on the subject.

To start, I’m going to choose 0MQ, a framework that I sort-of know and definitely love. We’ve used it in production on a previous app I worked on, and I’ve even contributed to the official Ruby bindings. But, I still have a lot to learn about it.

So let’s see how it goes.