Forem: Martin Modrák

Optional Parameters/Data in Stan

Martin Modrák — Tue, 24 Apr 2018 00:00:00 +0000

Sometimes you are developing a Stan statistical model that has multiple variants: maybe you want to consider several different link functions somewhere deep in your model, or you want to switch between estimating a quantity and getting it as data or something completely different. In these cases, you might have wanted to use optional parameters and/or data that apply only to some variants of your model. Sadly, Stan does not support this feature directly, but you can implement it yourself with just a bit of additional code. In this post I will show how.

The Base Model

Let’s start with a very simple model: just estimating the mean and standard deviation of a normal distribution:

library(rstan)
library(knitr)
library(tidyverse)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
set.seed(3145678)

model_fixed_code <- "
data {
  int N;
  vector[N] X;
}

parameters {
  real mu;
  real<lower=0> sigma; 
}

model {
  X ~ normal(mu, sigma);

  //And some priors
  mu ~ normal(0, 10);
  sigma ~ student_t(3, 0, 1);
}

"

model_fixed <- stan_model(model_code = model_fixed_code)

And let’s simulate some data and see that it fits:

mu_true = 8
sigma_true = 2
N = 10
X <- rnorm(N, mean = mu_true, sd = sigma_true)

data_fixed <- list(N = N, X = X)
fit_fixed <- sampling(model_fixed, data = data_fixed, iter = 500)
summary(fit_fixed, probs = c(0.1, 0.9))$summary %>% kable()

	mean	se_mean	sd	10%	90%	n_eff	Rhat
mu	7.855031	0.0256139	0.5632183	7.162485	8.548415	483.5059	1.007501
sigma	1.774158	0.0206974	0.4400573	1.302616	2.350727	452.0508	1.003409
lp__	-12.103350	0.0555738	1.1132479	-13.664610	-11.091775	401.2768	1.004955

Now With Optional Parameters

Let’s say we now want to handle the case where the standard deviation is known. Obviously we could write a new model. But what if the full model has several hundred lines and the only thing we want to change is to let the user specify the known standard deviation? The simplest solution is to just have all parameters/data that are needed in any of the variants lying around and use if conditions in the model block to ignore some of them, but that is a bit unsatisfactory (and also those unused parameters may in some cases hinder sampling).

For a better solution, we can take advantage of the fact that Stan allows zero-sized arrays/vectors and features the ternary operator ?. The ternary operator has the syntax (condition) ? (true value) : (false value) and works like an if - else statement, but within an expression. The last piece of the puzzle is that Stan allows size of data and parameter arrays to depend on arbitrary expressions computed from data. The model that can handle both known and unknown standard deviation follows:

model_optional_code <- "
data {
  int N;
  vector[N] X;

  //Just a verbose way to specify boolean variable
  int<lower = 0, upper = 1> sigma_known; 

  //sigma_data is size 0 if sigma_known is FALSE
  real<lower=0> sigma_data[sigma_known ? 1 : 0]; 
}

parameters {
  real mu;

  //sigma is size 0 if sigma_known is TRUE
  real<lower=0> sigma_param[sigma_known ? 0 : 1]; 
}

transformed parameters {
  real<lower=0> sigma;
  if (sigma_known) {
    sigma = sigma_data[1];
  } else {
    sigma = sigma_param[1];
  }
}

model {
  X ~ normal(mu, sigma);

  //And some priors
  mu ~ normal(0, 10);
  if (!sigma_known) {
    sigma_param ~ student_t(3, 0, 1);
  }
}

"

model_optional <- stan_model(model_code = model_optional_code)

We had to add some biolerplate code, but now we don’t have to maintain two separate models. This trick is also sometimes useful if you want to test multiple variants in development. As the model compiles only once and then you can test the two variants while modifying other parts of your code and reduce time waiting for compilation.

Just to make sure the model works and see how to correctly specify the data, let’s fit it assuming the standard deviation is to be estimated:

data_optional <- list(
  N = N,
  X = X,
  sigma_known = 0,
  sigma_data = numeric(0) #This produces an array of size 0
)

fit_optional <- sampling(model_optional, 
                         data = data_optional, 
                         iter = 500, pars = c("mu","sigma"))
summary(fit_optional, probs = c(0.1, 0.9))$summary %>% kable()

	mean	se_mean	sd	10%	90%	n_eff	Rhat
mu	7.854036	0.0198265	0.5440900	7.181837	8.531780	753.0924	0.9981102
sigma	1.730077	0.0152808	0.3918781	1.308565	2.270505	657.6701	0.9989029
lp__	-11.992770	0.0503044	0.9811551	-13.383729	-11.089657	380.4199	1.0016842

And now let’s run the model and give it the correct standard deviation:

data_optional_sigma_known <- list(
  N = N,
  X = X,
  sigma_known = 1,
  sigma_data = array(sigma_true, 1) 
  #The array conversion is necessary, otherwise Stan complains about dimensions
)

fit_optional_sigma_known <- sampling(model_optional, 
                                     data = data_optional_sigma_known, 
                                     iter = 500, pars = c("mu","sigma"))
summary(fit_optional_sigma_known, probs = c(0.1, 0.9))$summary %>% kable()

	mean	se_mean	sd	10%	90%	n_eff	Rhat
mu	7.808058	0.0292710	0.6273565	7.017766	8.622762	459.3600	1.006164
sigma	2.000000	0.0000000	0.0000000	2.000000	2.000000	1000.0000	NaN
lp__	-11.072234	0.0321233	0.6750295	-11.917321	-10.585280	441.5753	1.002187

Extending

Obviously this method lets you do all sorts of more complicated things, in particular:

When the optional parameter is a vector you can have something like

vector[sigma_known ? 0 : n_sigma] sigma;

You can have more than two variants to choose from and then use something akin to

real param[varaint == 5 ? 0 : 1];

If your conditions become more complex you can always put them into a user-defined function (for optional data) or transformed data block (for optional parameters) as in:

functions {
  int compute_whatever_size(int X, int Y, int Z) {
        //do stuff
  }
}

data {
  ...
  real whatever[compute_whatever_size(X,Y,Z)];
  real<lower = 0> whatever_sigma[compute_whatever_size(X,Y,Z)];
}

transformed data {
  int carebear_size;

  //do stuff
  carebear_size = magic_result;
}

parameters {
  vector[carebear_size] carebear;
  matrix[carebear_size,carebear_size] spatial_carebear;
}

Taming Divergences in Stan Models

Martin Modrák — Mon, 19 Feb 2018 00:00:00 +0000

Although my time with the Stan language for statistical computing has been enjoyable, there is one thing that is not fun when modelling with Stan. And it is the dreaded warning message:

There were X divergent transitions after warmup. 
Increasing adapt_delta above 0.8 may help.

Now once you have increased adapt_delta to no avail, what should you do? Divergences (and max-treedepth and low E-BFMI warnings alike) tell you there is something wrong with your model, but do not exactly tell you what. There are numerous tricks and strategies to diagnose convergence problems, but currently, those are scattered across Stan documentation, Discourse and the old mailing list. Here, I will try to bring all the tricks that helped me at some point to one place for the reference of future desperate modellers.

The strategies

I don’t want to keep you waiting, so below is a list of all strategies I have ever used to diagnose and/or remedy divergences:

Check your code. Twice. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match?
Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data, your model may be a bad fit for the actual observed data.
Check your priors. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign.
Visualisations: use mcmc_parcoord from the bayesplot package, Shinystan and pairs from rstan. Documentation for Stan Warnings (contains a few hints), Case study - diagnosing a multilevel model, Gabry et al. 2017 - Visualization in Bayesian workflow
Make sure your model is identifiable - non-identifiability and/or multimodality (multiple local maxima of the posterior distributions) is a problem. Case study - mixture models.
Run Stan with the test_grad option.
Reparametrize your model to make your parameters independent (uncorrelated) and close to N(0,1) (a.k.a change the actual parameters and compute your parameters of interest in the transformed parameters block).
Try non-centered parametrization - this is a special case of reparametrization that is so frequently useful that it deserves its own bullet. Case study - diagnosing a multilevel model, Betancourt & Girolami 2015
Move parameters to the data block and set them to their true values (from simulated data). Then return them one by one to paremters block. Which parameter introduces the problems?
Introduce tight priors centered at true parameter values. How tight need the priors to be to let the model fit? Useful for identifying multimodality.
Play a bit more with adapt_delta, stepsize and max_treedepth. Example

In the coming weeks I hope to be able to provide separate posts on some of the bullets above with a worked-out example. In this introductory post I will try to provide you with some geometric intuition behind what divergences are.

Before We Delve In

Caveat: I am not a statistician and my understanding of Stan, the NUTS sampler and other technicalities is limited, so I might be wrong in some of my assertions. Please correct me, if you find mistakes.

Make sure to follow Stan Best practices. Especially, start with a simple model , make sure it works and add complexity step by step. I really cannot repeat this enough. To be honest, I often don’t follow this advice myself, because just writing the full model down is so much fun. To be more honest, this has always resulted in me being sad and a lots of wasted time.

Also note that directly translating models from JAGS/BUGS often fails as Stan requires different modelling approaches. Stan developers have experienced first hand that some JAGS models produce wrong results and do not converge even in JAGS, but no one noticed before they compared their output to results from Stan.

What Is a Divergence?

Following the Stan manual:

A divergence arises when the simulated Hamiltonian trajectory departs from the true trajectory as measured by departure of the Hamiltonian value from its initial value.

What does that actually mean? Hamiltonian is a function of the posterior density and auxiliary momentum parameters. The auxiliary parameters are well-behaved by construction, so the problem is almost invariably in the posterior density. Keep in mind that for numerical reasons Stan works with the logarithm of posterior density (also known as: log_prob, __lp and target). The NUTS sampler performs several discrete steps per iteration and is guided by the gradient of the density. With some simplification, the sampler assumes that the log density is approximately linear at the current point, i.e. that small change in parameters will result in small change in log-density. This assumption is approximately correct if the step size is small enough. Lets look at two different step sizes in a one-dimensional example:

The sampler starts at the red dot, the black line is the log-density, magenta line is the gradient. When moving 0.1 to the right, the sampler expects the log-density to decrease linearly (green triangle) and although the actual log-density decreases more (the green square), the difference is small. But when moving 0.4 to the right the difference between expected (blue cross) and actual (pink crossed square) becomes much larger. It is a large discrepancy of a similar kind that is signalled as a divergence. During warmup Stan will try to adjust the step size to be small enough for divergences to not occur, but large enough for the sampling to be efficient. But if the parameter space is not well behaved, this might not be possible. Why? Keep on reading, fearless reader.

2D Examples

Lets try to build some geometric intuition in 2D parameter space. Keep in mind that sampling is about exploring the parameter space proportionally to the associated posterior density - or, in other words - exploring uniformly across the volume between the zero plane and the surface defined by density (probability mass). For simplicity, we will ignore the log transform Stan actually doeas and talk directly about density in the rest of this post. Imagine the posterior density is a smooth wide hill:

Stan starts each iteration by moving across the posterior in random direction and then lets the density gradient steer the movement preferrentially to areas with high density. To explore the hill efficiently, we need to take quite large steps in this process - the chain of samples will represent the posterior well if it can move across the whole posterior in a small-ish number of steps (actually at most 2^max_treedepth steps). So average step size of something like 0.1 might be reasonable here as the posterior is approximately linear at this scale. We need to spend a bit more time around the center, but not that much, as there is a lot of volume also close to the edges - it has lower density, but it is a larger area.

Now imagine that the posterior is much sharper:

Now we need much smaller steps to explore safely. Step size of 0.1 won’t work as the posterior is non-linear on this scale, which will result in divergences. The sampler is however able to adapt and chooses a smaller step size accordingly. Another thing Stan will do is to rescale dimensions where the posterior is narrow. In the example above, posterior is narrower in y and thus this dimension will be inflated to roughly match the spread in x. Keep in mind that Stan rescales each dimension separately (the posterior is transformed by a diagonal matrix).

Now what if the posterior is a combination of both a “smooth hill” and a “sharp mountain”?

The sampler should spend about half the time in the “sharp mountain” and the other half in the “smooth hill”, but those regions need different step sizes and the sampler only takes one step size. There is also no way to rescale the dimensions to compensate. A chain that adapted to the “smooth hill” region will experience divergences in the “sharp mountain” region, a chain that adapted to the “sharp mountain” will not move efficiently in the “smooth hill” region (which will be signalled as transitions exceeding maximum treedepth). The latter case is however less likely, as the “smooth hill” is larger and chains are more likely to start there. I think that this is why problems of this kind mostly manifest as divergences and less likely as exceeding maximum treedepth.

This is only one of many reasons why multimodal posterior hurts sampling. Multimodality is problematic even if all modes are similar - one of the other problems is that traversing between modes might require much larger step size than exploration within each mode, as in this example:

I bet Stan devs would add tons of other reasons why multimodality is bad for you (it really is), but I’ll stop here and move to other possible sources of divergences.

The posterior geometry may be problematic, even if it is unimodal. A typical example is a funnel, which often arises in multi-level models:

Here, the sampler should spend a lot of time near the peak (where it needs small steps), but a non-negligible volume is also found in the relatively low-density but large area on the right where a larger step size is required. Once again, there is no way to rescale each dimension independently to selectively “stretch” the area around the peak. Similar problems also arise with large regions of constant or almost constant density combined with a single mode.

Last, but not least, lets look at tight correlation between variables, which is a different but frequent problem:

The problem is that if we are moving in the direction of the ridge, we need large step size, but when we move tangentially to that direction, we need small step size. Once again, Stan is unable to rescale the posterior to compensate as scaling x or y on its own will increase both width and length of the ridge.

Things get even more insidious when the relationship between the two variables is not linear:

Here, a good step size is a function of both location (smaller near the peak) and direction (larger when following the spiral) making this kind of posterior hard to sample.

Bigger Picture

This has been all pretty vague and folksy. Remeber these examples are there just to provide intuition. To be 100% correct, you need to go to the NUTS paper and/or the Conceptual Introduction to HMC paper and delve in the math. The math is always correct.

In particular all the above geometries may be difficult for NUTS and seeing them in visualisations hints at possible issues, but they may also be handled just fine. In fact, I wouldn’t be surprised if Stan worked with almost anything in two dimensions. Weak linear correlations that form wide ridges are also - in my experience - quite likely to be sampled well, even in higher dimensions. The issues arise when regions of non-negligible density are very narrow in some directions and much wider in others and rescaling each dimension individually won’t help. And finally, keep in mind that the posterios we discussed are even more difficult for Gibbs or other older samplers - and Gibbs will not even let you know there was a problem.

Love Thy Divergences

The amazing thing about divergences is that what is essentially a numerical problem actually signals a wide array of possibly severe modelling problems. Be glad - few algorithms (in any area) have such a clear signal that things went wrong. This is also the reason why you should be suspicious about your results even when only a single divergence had been reported - you don’t know what is hiding in the parts of your posterior that are inaccessible with the current step size.

That’s all for now. Hope to see you in the future with examples of actual diverging Stan models.

Launch Shiny App Without Blocking the Session

Martin Modrák — Tue, 13 Feb 2018 00:00:00 +0000

This is a neat trick I found on Tyler Morgan-Wall’s Twitter and is originally attributed to Joe Cheng. You can run any Shiny app without blocking the session. My helper function to run ShinyStan without blocking is below:

launch_shinystan_nonblocking <- function(fit) {
  library(future)
  plan(multisession)
  future(
    launch_shinystan(fit) #You can replace this with any other Shiny app
  )
}

Hope that helps!

A Gentle Stan vs. INLA Comparison

Martin Modrák — Fri, 02 Feb 2018 00:00:00 +0000

I have recently become a huge fan of Bayesian statistics. It makes so much sense and if you are doing any kind of inferences from data, you should check it out, especially the Stan language. Without much further intro, this is my first blog on statistical topics.

Not long ago, I came across a nice blogpost by Kahtryn Morrison called A gentle INLA tutorial. The blog was nice and helped me better appreciate INLA. But as a fan of the Stan probabilistic language, I felt that comparing INLA to JAGS is not really that relevant, as Stan should - at least in theory - be way faster and better than JAGS. Here, I ran a comparison of INLA to Stan on the second example called “Poisson GLM with an iid random effect”.

The TLDR is: For this model, Stan scales considerably better than JAGS, but still cannot scale to very large model. Also, for this model Stan and INLA give almost the same results. It seems that Stan becomes useful only when your model cannot be coded in INLA.

Pleas let me know (via an issue on GitHub) should you find any error or anything else that should be included in this post. Also, if you run the experiment on a different machine and/or with different seed, let me know the results.

Here are the original numbers from Kathryn’s blog:

N	kathryn_rjags	kathryn_rinla
100	30.394	0.383
500	142.532	1.243
5000	1714.468	5.768
25000	8610.32	30.077
100000	got bored after 6 hours	166.819

Full source of this post is available at this blog’s Github repo. Keep in mind that installing RStan is unfortunately not as straightforward as running install.packages. Please consult https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started if you don’t have RStan already installed.

The model

The model we are interested in is a simple GLM with partial pooling of a random effect:

y_i ~ poisson(mu_i)
log(mu_i) ~ alpha + beta * x_i + nu_i
nu_i ~ normal(0, tau_nu)

The comparison

Let’s setup our libraries.

library(rstan)
library(brms)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library(INLA)
library(tidyverse)
set.seed(6619414)

The results are stored in files within the repository to let me rebuild the site with blogdown easily. Delete cache directory to force a complete rerun.

cache_dir = "_stan_vs_inla_cache/"
if(!dir.exists(cache_dir)){
  dir.create(cache_dir)
}

Let’s start by simulating data

#The sizes of datasets to work with
N_values = c(100, 500, 5000, 25000)
data = list()
for(N in N_values) {
  x = rnorm(N, mean=5,sd=1) 
  nu = rnorm(N,0,0.1)
  mu = exp(1 + 0.5*x + nu) 
  y = rpois(N,mu) 


  data[[N]] = list(
    N = N,
    x = x,
    y = y
  )  
}

Measuring Stan

Here is the model code in Stan (it is good practice to put it into a file, but I wanted to make this post self-contained). It is almost 1-1 rewrite of the original JAGS code, with few important changes:

JAGS parametrizes normal distribution via precision, Stan via sd. The model recomputes precision to sd.
I added the ability to explicitly set parameters of the prior distributions as data which is useful later in this post
With multilevel models, Stan works waaaaaay better with so-called non-centered parametrization. This means that instead of having nu ~ N(0, nu_sigma), mu = alpha + beta * x + nu we have nu_normalized ~ N(0,1), mu = alpha + beta * x + nu_normalized * nu_sigma. This gives exactly the same inferences, but results in a geometry that Stan can explore efficiently.

There are also packages to let you specify common models (including this one) without writing Stan code, using syntax similar to R-INLA - checkout rstanarm and brms. The latter is more flexible, while the former is easier to install, as it does not depend on rstan and can be installed simply with install.packages.

Note also that Stan developers would suggest against Gamma(0.01,0.01) prior on precision in favor of normal or Cauchy distribution on sd, see https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.

model_code = "
  data {
    int N;
    vector[N] x;
    int y[N];

    //Allowing to parametrize the priors (useful later)
    real alpha_prior_mean;
    real beta_prior_mean;
    real<lower=0> alpha_beta_prior_precision;
    real<lower=0> tau_nu_prior_shape;
    real<lower=0> tau_nu_prior_rate; 
  }

  transformed data {
    //Stan parametrizes normal with sd not precision
    real alpha_beta_prior_sigma = sqrt(1 / alpha_beta_prior_precision);
  }

  parameters {
    real alpha;
    real beta;
    vector[N] nu_normalized;
    real<lower=0> tau_nu;
  }

  model {
    real nu_sigma = sqrt(1 / tau_nu);
    vector[N] nu = nu_normalized * nu_sigma;

    //taking advantage of Stan's implicit vectorization here
    nu_normalized ~ normal(0,1);
    //The built-in poisson_log(x) === poisson(exp(x))
    y ~ poisson_log(alpha + beta*x + nu); 

    alpha ~ normal(alpha_prior_mean, alpha_beta_prior_sigma);
    beta ~ normal(beta_prior_mean, alpha_beta_prior_sigma); 
    tau_nu ~ gamma(tau_nu_prior_shape,tau_nu_prior_rate);
  }

//Uncomment this to have the model generate mu values as well
//Currently commented out as storing the samples of mu consumes 
//a lot of memory for the big models
/*  
  generated quantities {
    vector[N] mu = exp(alpha + beta*x + nu_normalized * nu_sigma);
  }
*/
"

model = stan_model(model_code = model_code)

Below is the code to make the actual measurements. Some caveats:

The compilation of the Stan model is not counted (you can avoid it with rstanarm and need to do it only once otherwise)
There is some overhead in transferring the posterior samples from Stan to R. This overhead is non-negligible for the larger models, but you can get rid of it by storing the samples in a file and reading them separately. The overhead is not measured here.
Stan took > 16 hours to converge for the largest data size (1e5) and then I had issues fitting the posterior samples into memory on my computer. Notably, R-Inla also crashed on my computer for this size. The largest size is thus excluded here, but I have to conclude that if you get bored after 6 hours, Stan is not practical for such a big model.
I was not able to get rjags running in a reasonable amount of time, so I did not rerun the JAGS version of the model.

stan_times_file = paste0(cache_dir, "stan_times.csv")
stan_summary_file = paste0(cache_dir, "stan_summary.csv")
run_stan = TRUE
if(file.exists(stan_times_file) && file.exists(stan_summary_file)) {
  stan_times = read.csv(stan_times_file)
  stan_summary = read.csv(stan_summary_file) 
  if(setequal(stan_times$N, N_values) && setequal(stan_summary$N, N_values)) {
    run_stan = FALSE
  }
} 

if(run_stan) {
  stan_times_values = numeric(length(N_values))
  stan_summary_list = list()
  step = 1
  for(N in N_values) {
    data_stan = data[[N]]
    data_stan$alpha_prior_mean = 0
    data_stan$beta_prior_mean = 0
    data_stan$alpha_beta_prior_precision = 0.001
    data_stan$tau_nu_prior_shape = 0.01
    data_stan$tau_nu_prior_rate = 0.01


    fit = sampling(model, data = data_stan);
    stan_summary_list[[step]] = 
      as.data.frame(
        rstan::summary(fit, pars = c("alpha","beta","tau_nu"))$summary
      ) %>% rownames_to_column("parameter")
    stan_summary_list[[step]]$N = N

    all_times = get_elapsed_time(fit)
    stan_times_values[step] = max(all_times[,"warmup"] + all_times[,"sample"])

    step = step + 1
  }
  stan_times = data.frame(N = N_values, stan_time = stan_times_values)
  stan_summary = do.call(rbind, stan_summary_list)

  write.csv(stan_times, stan_times_file,row.names = FALSE)
  write.csv(stan_summary, stan_summary_file,row.names = FALSE)
}

Measuring INLA

inla_times_file = paste0(cache_dir,"inla_times.csv")
inla_summary_file = paste0(cache_dir,"inla_summary.csv")
run_inla = TRUE
if(file.exists(inla_times_file) && file.exists(inla_summary_file)) {
  inla_times = read.csv(inla_times_file)
  inla_summary = read.csv(inla_summary_file) 
  if(setequal(inla_times$N, N_values) && setequal(inla_summary$N, N_values)) {
    run_inla = FALSE
  }
} 

if(run_inla) {
  inla_times_values = numeric(length(N_values))
  inla_summary_list = list()
  step = 1
  for(N in N_values) {
    nu = 1:N 
    fit_inla = inla(y ~ x + f(nu,model="iid"), family = c("poisson"), 
               data = data[[N]], control.predictor=list(link=1)) 

    inla_times_values[step] = fit_inla$cpu.used["Total"]
    inla_summary_list[[step]] = 
      rbind(fit_inla$summary.fixed %>% select(-kld),
            fit_inla$summary.hyperpar) %>% 
      rownames_to_column("parameter")
    inla_summary_list[[step]]$N = N

    step = step + 1
  }
  inla_times = data.frame(N = N_values, inla_time = inla_times_values)
  inla_summary = do.call(rbind, inla_summary_list)

  write.csv(inla_times, inla_times_file,row.names = FALSE)
  write.csv(inla_summary, inla_summary_file,row.names = FALSE)
}

Checking inferences

Here we see side-by-side comparisons of the inferences and they seem pretty comparable between Stan and Inla:

for(N_to_show in N_values) {
  print(kable(stan_summary %>% filter(N == N_to_show) %>% 
                select(c("parameter","mean","sd")), 
              caption = paste0("Stan results for N = ", N_to_show)))
  print(kable(inla_summary %>% filter(N == N_to_show) %>% 
                select(c("parameter","mean","sd")), 
              caption = paste0("INLA results for N = ", N_to_show)))
}

Table 1: Stan results for N = 100| parameter | mean | sd |
| --- | --- | --- |
| alpha | 1.013559 | 0.0989778 |
| beta | 0.495539 | 0.0176988 |
| tau_nu | 162.001608 | 82.7700473 |
Table 1: INLA results for N = 100| parameter | mean | sd |
| --- | --- | --- |
| (Intercept) | 1.009037e+00 | 9.15248e-02 |
| x | 4.971302e-01 | 1.61486e-02 |
| Precision for nu | 1.819654e+04 | 1.71676e+04 |
Table 1: Stan results for N = 500| parameter | mean | sd |
| --- | --- | --- |
| alpha | 1.0046284 | 0.0555134 |
| beta | 0.4977522 | 0.0102697 |
| tau_nu | 71.6301530 | 13.8264812 |
Table 1: INLA results for N = 500| parameter | mean | sd |
| --- | --- | --- |
| (Intercept) | 1.0053202 | 0.0538456 |
| x | 0.4977124 | 0.0099593 |
| Precision for nu | 77.3311793 | 16.0255430 |
Table 1: Stan results for N = 5000| parameter | mean | sd |
| --- | --- | --- |
| alpha | 1.009930 | 0.0159586 |
| beta | 0.496859 | 0.0029250 |
| tau_nu | 101.548580 | 7.4655716 |
Table 1: INLA results for N = 5000| parameter | mean | sd |
| --- | --- | --- |
| (Intercept) | 1.0099282 | 0.0155388 |
| x | 0.4968718 | 0.0028618 |
| Precision for nu | 103.1508773 | 7.6811740 |
Table 1: Stan results for N = 25000| parameter | mean | sd |
| --- | --- | --- |
| alpha | 0.9874707 | 0.0066864 |
| beta | 0.5019566 | 0.0012195 |
| tau_nu | 104.3599424 | 3.5391938 |
Table 1: INLA results for N = 25000| parameter | mean | sd |
| --- | --- | --- |
| (Intercept) | 0.9876218 | 0.0067978 |
| x | 0.5019341 | 0.0012452 |
| Precision for nu | 104.8948949 | 3.4415929 |

Summary of the timing

You can see that Stan keeps reasonable runtimes for longer time than JAGS in the original blog post, but INLA is still way faster. Also Kathryn got probably very lucky with her seed for N = 25 000, as her INLA run completed very quickly. With my (few) tests, INLA always took at least several minutes for N = 25 000. It may mean that Kathryn’s JAGS time is also too short.

my_results = merge.data.frame(inla_times, stan_times, by = "N")
kable(merge.data.frame(my_results, kathryn_results, by = "N"))

N	inla_time	stan_time	kathryn_rjags	kathryn_rinla
100	1.061742	1.885	30.394	0.383
500	1.401597	11.120	142.532	1.243
5000	10.608704	388.514	1714.468	5.768
25000	611.505543	5807.670	8610.32	30.077

You could obviously do multiple runs to reduce uncertainty etc., but this post has already taken too much time of mine, so this will be left to others.

Testing quality of the results

I also had a hunch that maybe INLA is less precise than Stan, but that turned out to be based on an error. Thus, without much commentary, I put here my code to test this. Basically, I modify the random data generator to actually draw from priors (those priors are quite constrained to provide similar values of alpha, beta nad tau_nu as in the original). I than give both algorithms the knowledge of these priors. I compute both difference between true parameters and a point estimate (mean) and quantiles of the posterior distribution where the true parameter is found. If the algorithms give the best possible estimates, the distribution of such quantiles should be uniform over (0,1). Turns out INLA and Stan give almost exactly the same results for almost all runs and the differences in quality are (for this particular model) negligible.

test_precision = function(N) {
  rejects <- 0
  repeat {
    #Set the priors so that they generate similar parameters as in the example above

    alpha_beta_prior_precision = 5
    prior_sigma = sqrt(1/alpha_beta_prior_precision)
    alpha_prior_mean = 1
    beta_prior_mean = 0.5
    alpha = rnorm(1, alpha_prior_mean, prior_sigma)
    beta = rnorm(1, beta_prior_mean, prior_sigma)

    tau_nu_prior_shape = 2
    tau_nu_prior_rate = 0.01
    tau_nu = rgamma(1,tau_nu_prior_shape,tau_nu_prior_rate)
    sigma_nu = sqrt(1 / tau_nu)

    x = rnorm(N, mean=5,sd=1) 


    nu = rnorm(N,0,sigma_nu)
    linear = alpha + beta*x + nu

    #Rejection sampling to avoid NAs and ill-posed problems
    if(max(linear) < 15) {
      mu = exp(linear) 
      y = rpois(N,mu) 
      if(mean(y == 0) < 0.7) {
        break;
      }
    } 
    rejects = rejects + 1
  }

  #cat(rejects, "rejects\n")


  data = list(
    N = N,
    x = x,
    y = y
  )
  #cat("A:",alpha,"B:", beta, "T:", tau_nu,"\n")
  #print(linear)
  #print(data)

  #=============== Fit INLA
  nu = 1:N 
  fit_inla = inla(y ~ x + f(nu,model="iid",
                  hyper=list(theta=list(prior="loggamma",
                                        param=c(tau_nu_prior_shape,tau_nu_prior_rate)))), 
                  family = c("poisson"), 
                  control.fixed = list(mean = beta_prior_mean, 
                                       mean.intercept = alpha_prior_mean,
                                       prec = alpha_beta_prior_precision,
                                       prec.intercept = alpha_beta_prior_precision
                                       ),
             data = data, control.predictor=list(link=1)
             ) 

  time_inla = fit_inla$cpu.used["Total"]

  alpha_mean_diff_inla = fit_inla$summary.fixed["(Intercept)","mean"] - alpha
  beta_mean_diff_inla = fit_inla$summary.fixed["x","mean"] - beta
  tau_nu_mean_diff_inla = fit_inla$summary.hyperpar[,"mean"] - tau_nu

  alpha_q_inla = inla.pmarginal(alpha, fit_inla$marginals.fixed$`(Intercept)`)
  beta_q_inla = inla.pmarginal(beta, fit_inla$marginals.fixed$x)
  tau_nu_q_inla = inla.pmarginal(tau_nu, fit_inla$marginals.hyperpar$`Precision for nu`)



  #================ Fit Stan
  data_stan = data
  data_stan$alpha_prior_mean = alpha_prior_mean
  data_stan$beta_prior_mean = beta_prior_mean
  data_stan$alpha_beta_prior_precision = alpha_beta_prior_precision
  data_stan$tau_nu_prior_shape = tau_nu_prior_shape
  data_stan$tau_nu_prior_rate = tau_nu_prior_rate

  fit = sampling(model, data = data_stan, control = list(adapt_delta = 0.95)); 
  all_times = get_elapsed_time(fit)
  max_total_time_stan = max(all_times[,"warmup"] + all_times[,"sample"])

  samples = rstan::extract(fit, pars = c("alpha","beta","tau_nu"))
  alpha_mean_diff_stan = mean(samples$alpha) - alpha
  beta_mean_diff_stan = mean(samples$beta) - beta
  tau_nu_mean_diff_stan = mean(samples$tau_nu) - tau_nu

  alpha_q_stan = ecdf(samples$alpha)(alpha)
  beta_q_stan = ecdf(samples$beta)(beta)
  tau_nu_q_stan = ecdf(samples$tau_nu)(tau_nu)

  return(data.frame(time_rstan = max_total_time_stan,
                    time_rinla = time_inla,
                    alpha_mean_diff_stan = alpha_mean_diff_stan,
                    beta_mean_diff_stan = beta_mean_diff_stan,
                    tau_nu_mean_diff_stan = tau_nu_mean_diff_stan,
                    alpha_q_stan = alpha_q_stan,
                    beta_q_stan = beta_q_stan,
                    tau_nu_q_stan = tau_nu_q_stan,
                    alpha_mean_diff_inla = alpha_mean_diff_inla,
                    beta_mean_diff_inla = beta_mean_diff_inla,
                    tau_nu_mean_diff_inla = tau_nu_mean_diff_inla,
                    alpha_q_inla= alpha_q_inla,
                    beta_q_inla = beta_q_inla,
                    tau_nu_q_inla = tau_nu_q_inla
                    ))
}

Actually running the comparison. On some occasions, Stan does not converge, my best guess is that the data are somehow pathological, but I didn’t investigate thoroughly. You see that results for Stan and Inla are very similar both as point estimates and the distribution of posterior quantiles. The accuracy of the INLA approximation is also AFAIK going to improve with more data.

library(skimr) #Uses skimr to summarize results easily
precision_results_file = paste0(cache_dir,"precision_results.csv")
if(file.exists(precision_results_file)) {
  results_precision_df = read.csv(precision_results_file)
} else {
  results_precision = list()
  for(i in 1:100) {
    results_precision[[i]] = test_precision(50)
  }

  results_precision_df = do.call(rbind, results_precision)
  write.csv(results_precision_df,precision_results_file,row.names = FALSE)
}

#Remove uninteresting skim statistics
skim_with(numeric = list(missing = NULL, complete = NULL, n = NULL))

skimmed = results_precision_df %>% select(-X) %>% skim() 
#Now a hack to display skim histograms properly in the output:
skimmed_better = skimmed %>% rowwise() %>% mutate(formatted = 
     if_else(stat == "hist", 
         utf8ToInt(formatted) %>% as.character() %>% paste0("&#", . ,";", collapse = ""), 
         formatted))  
mostattributes(skimmed_better) = attributes(skimmed)

skimmed_better %>% kable(escape = FALSE)

Skim summary statistics

n obs: 100

n variables: 14

Variable type: numeric

variable	mean	sd	p0	p25	p50	p75	p100	hist
alpha_mean_diff_inla	-0.0021	0.2	-0.85	-0.094	0.0023	0.095	0.53	▁▁▁▂▇▇▁▁
alpha_mean_diff_stan	-0.0033	0.2	-0.84	-0.097	-0.00012	0.093	0.52	▁▁▁▂▇▇▁▂
alpha_q_inla	0.5	0.29	0.00084	0.25	0.5	0.73	0.99	▅▇▇▆▇▆▆▇
alpha_q_stan	0.5	0.28	0.001	0.26	0.5	0.73	0.99	▅▇▇▆▇▆▆▇
beta_mean_diff_inla	-0.00088	0.04	-0.12	-0.016	-0.001	0.014	0.17	▁▁▃▇▂▁▁▁
beta_mean_diff_stan	-0.001	0.04	-0.12	-0.016	-5e-04	0.014	0.16	▁▁▂▇▂▁▁▁
beta_q_inla	0.51	0.28	0.0068	0.26	0.52	0.75	1	▆▆▅▆▇▅▆▆
beta_q_stan	0.51	0.28	0.0065	0.27	0.51	0.75	1	▆▆▅▇▆▅▆▆
tau_nu_mean_diff_inla	4.45	90.17	-338.58	-26.74	4.49	53.38	193	▁▁▁▂▅▇▃▂
tau_nu_mean_diff_stan	5.21	90	-339.89	-24.62	4.29	54.48	191.94	▁▁▁▂▅▇▃▂
tau_nu_q_inla	0.53	0.26	0.023	0.32	0.52	0.74	0.99	▃▅▆▆▇▆▅▅
tau_nu_q_stan	0.53	0.26	0.021	0.32	0.53	0.75	0.99	▃▅▅▆▇▃▅▅
time_rinla	0.97	0.093	0.86	0.91	0.93	0.98	1.32	▇▇▂▁▁▁▁▁
time_rstan	1.79	1.4	0.55	0.89	1.45	2.09	10.04	▇▂▁▁▁▁▁▁

What Elm and Rust Teach us About the Future

Martin Modrák — Wed, 08 Feb 2017 08:07:49 +0000

So I recently started programming in Elm and also did some stuff in Rust. Honestly, it was mostly a hype-driven decision, but the journey was definitely worth it. I also noticed that although those two languages differ in their target audience and use cases, they made many similar design decisions. I think this is no coincidence. It is well possible that ten years from now, both Elm and Rust will be forgotten, but I am quite sure that the ideas they are built upon will be present in the languages we will use by then. This is a post about the ideas I find charming in Elm and Rust.

A quick disclaimer first: I am no expert in either language and while I am starting to feel comfortable in Elm, I am undoubtedly a Rust beginner, so please correct me if I am doing injustice to any of the languages.

Setting the Scene

Rust is a systems language which aims to compete with C++. Rust values performance, concurrency, and memory safety, but is not garbage-collected. Rust compiles to native binaries, not only for the major x86/x64 platforms but also on ARM and even certain ARM-based microcontrollers.

Elm is a language for web apps competing with Javascript in general and the virtual DOM frameworks in particular (e.g. ReactJS). Elm compiles to Javascript, is garbage collected and purely functional. Elm values simplicity and reliability.

Both languages are already usable for actual projects, but the ecosystems are still immature and the languages themselves are still evolving.

While I like both of the languages, I do not intend to limit this post to the positive sides and will also mention what are (to me) the pain points.
I will start with the ideas the languages have in common, and will give more details about either language later.

Common Themes

The features described here are mostly nothing completely new and could be found in languages like OCaml, Haskell and F#. The interesting part is that Elm and Rust prove they are useful for quite diverse use-cases.

Tagged unions

This is a small but very practical feature - I would say tagged unions are enums on steroids. Consider, how often did you write something like:

enum AccountType {Savings, CreditCard};     

//In real code please use Decimal types to represent money. Please.
class CreditParams {
    int creditLimit; 
    ...
}

class Account {
    AccountType accountType;
    int balance;
    CreditParams creditParams; //only present for CreditCard, always null for Savings       
}

This makes room for some sweet bugs, as your data model can represent a state that should be impossible (savings account with non-null credit parameters, or a credit card account with null credit parameters). The programmer needs to take care that no manipulation of the Account object can lead to such a state which may be non-trivial and error-prone. It also creates ambiguity - for example, there are multiple ways to get the credit limit of an account:

//Yes I know, this should be a class method
int getCreditLimit1(Account account) {
    if (account.creditParams != null) { 
        //wrong if account.accountType == Savings
        return account.creditParams.creditLimit;
    } else {
        return 0;
    }
}

int getCreditLimit2(Account account) {
    if (account.accountType == CreditCard) { 
        //possibly accessing a null pointer
        return account.creditParams.creditLimit; 
    } else {
        return 0;
    }
}

A more desirable option is to make impossible states impossible. Tagged unions let you do this by attaching heterogenous data to each variant. This lets us rewrite the data model as (Rust syntax, try it online):



struct CreditParams {
    credit_limit: i32, //i32 is a 32bit signed int
    ...
} 

enum AccountDetails {
    Savings, //Savings has no attached data
    CreditCard(CreditParams), //CreditCard has a single CreditParams instance
}

struct Account { 
    balance: i32, 
    details: AccountDetails,
  }

With tagged unions, you cannot access the attached data without explicitly checking the type - so there is only one way to get the credit limit and it is always correct (Rust syntax, try it online):

fn get_credit_limit(account: Account) -> i32 {
    match account.details { //match is like case
        AccountDetails::CreditCard(params) =>  //bind local variable params to the attached data
            params.credit_limit,    //in Rust, return is implicit
        AccountDetails::Savings => 
            0
    }
}

Since both Elm and Rust don't have null values, you have to specify CreditParams when building an AccountDetails instance, and so the code above is safe in all situations.

A further bonus is that in both Elm and Rust, you have to handle all possible cases of a tagged union (or provide a default branch). Failing to handle all cases is a compile-time error. In this way, the compiler makes sure that you update all your code when you extend the AcountDetails.

Type Inference

Some people are fond of static typing as it is harder to write erroneous code in statically-typed languages. Some poeple like dynamic typing, because it avoids the bureacracy of adding type annotations to everything. Type inference tries to get the best of both worlds: the language is statically typed, but you rarely need to provide type annotations. Type inference in Rust and Elm works a bit like auto in C++, but it is much more powerful - it looks at broader context and takes also downstream code into consideration. So for example (Rust syntax, try it online)

// The compiler infers that elem is a float.
let elem = 1.36;

//Explicit type annotation - f64 is a double precision float
let elem2: f64 = 3.141592;

// Create an empty vector (a growable array).
let mut vec = Vec::new();
// At this point the compiler doesn't know the exact type of `vec`, it
// just knows that it's a vector of something (`Vec<_>`).

// Insert `elem` and `elem2` in the vector.
vec.push(elem);
vec.push(elem2);
// Aha! Now the compiler knows that `vec` is a vector of doubles (`Vec<f64>`)

//The compiler infers that s is a &str (reference to string)
let s = "Hello";

//Compile-time error: expected floating-point variable, found &str
vec.push(s);

Type inference in Rust has certain limitations and so explicit type annotations are still needed now and then. But Elm goes further, implementing a variant of the Hindley-Milner type system. In practice this means that type annotations in Elm are basically just comments (except some weird corner cases). While the Elm compiler enforces that type annotations match the code, they can be omitted and the compiler will still statically typecheck everything. Nevertheless, it is a warning to not annotate your functions with types, as type annotations let the compiler give you better error messages and force you to articulate your intent clearly.

Immutability

Immutability means that variables/data cannot be modified after initial assignment/creation. Another way to state it is that operations on immutable data can have no observable effect except for returning a value. This implies that functions on immutable data will always return the same value for the same arguments. Code working with immutable data is easier to understand and reason about and is inherently thread-safe. Consider this code with mutable data:

address = new Address();
address.street = "Mullholland Drive";
...
person = new Person();
person.primaryAddress = address;
print(person.primaryAddress.street) //Mullholland Drive
...
address.street = "Park Avenue"
...
print(person.primaryAddress.street) //Park Avenue

Now let's say we want to figure out why person.primaryAddress.street changed. Since the data is mutable, it is not sufficient to find all usages of person.primaryAddress - we also need to check the whole tree of all variables/fields that were assigned to/from person.primaryAddress. With immutable data structures this problem is prevented as the programmer is forced to write something like:

address = new Address("Mullholland Drive", 1035, "California");
//Elm and Rust also support syntax of the form:
//address = { street = "Mullholland Drive", number = 1035, state = "California" }
...
person = new Person(address);
...
address.street = "Park Avenue" //not allowed, the object is immutable

For a more detailed discussion of why immutability is good, see for example 3 benefits of using Immutable Objects.

Elm goes all-in on immutability - everything is immutable and no function can have a side effect. Rust is a bit more relaxed: in Rust, you have to opt-in for mutability and the compiler ensures that as long as a piece of data can be changed within a code segment (there is a mutable reference to the data), no other code path can read or modify the same data.

The Problem with Immutability

Making sure that the data you are referencing cannot change without your cooperation generally makes your life easier. Unless this is EXACTLY what you want to achieve. Let's say you are writing a traffic monitoring tool. You might want to model your data like this (Elm syntax):

-- In Elm, double dash marks a comment
type alias City =         --Curly braces declare a record type, a bit like an object
  { name: String
  , routes: List Route    --list of Route instances
  }

type alias Route =
  { from: City
  , to: City
  , trafficLevel: Float
  }

type alias World =
  { cities: List City
  , routes: List Route
  }

You may expect that when you receive new traffic information, you simply work with World.routes and the changes will be seen when accessing through City.routes. But you would be mistaken. In Elm this will not even compile (fields in record types are fully expanded at compile time, and thus cannot have circular references). And if you use tagged unions to make the model compile, the trafficLevel accessed via World.routes may not be the same as one accessed via City.routes, as those always behave as different instances.

A similar data model in Rust will compile but it will be difficult to actually instantiate the structure and you won't be able to ever modify the trafficLevel of any Route instance, because the compiler won't let you create a mutable reference to it (every Route is referenced at least twice).

This brings us to a less talked-about implication of immutability: immutable data structures are inherently tree-like. In both Elm and Rust, it is a pain to work with graph-like structures and you have to give up some guarantees the languages give you.

In Elm, the only way to represent a graph is by using indices to a dictionary (map) instead of direct references. For the above example a practical data model could look like:

type alias RouteId = Int    -- New types just for clarity
type alias CityId = Int 

type alias City =
  { id: CityId 
  , name: String
  , routes: List RouteId 
  }

type alias Route =
  { id: RouteId
  , from: City
  , to: City
  , trafficLevel: Float
  }

type alias World =
  { cities: Dict CityId City     --dictionary (map) with CityId as keys and City as values
  , routes: Dict RouteId Route
  }

Notice that nothing prevents us from having an invalid RouteId stored in City.routes. While Elm gives you good tools to work with such a model (e.g., it forces you to always handle the case where a given RouteId is not present in World.routes), and the advantages for every other use case make this an acceptable cost, it is still a bit annoying.

Rust has a bit more options to work with graph-like data, but they all have downsides of their own (using indices, StackOverflow discussion, graphs using ref counting or arena allocation).

Smart but Restrictive Compilers

This is basically a generalization of the previous specific features. The compilers for Elm and Rust are powerful and they do a lot of stuff for you. They not only parse the code line-by-line, but they reason about your code in the context of the whole program. However, the most interesting thing about compilers for Rust and Elm is not what they let you do. It is what they DO NOT let you do (e.g., you cannot mix floats and ints without explicit conversion, you cannot get to the data stored in an tagged union without handling all possible cases, you cannot modify certain data etc.). At the same time, the compilers are smart enough to make conforming to these restrictions less of a chore. If you think that programmers will produce better code when given fewer limitations, think of the time people complained that restricting the use of GOTO hinders productivity.

Another way to formulate this stance is that languages should not strive to make best practices easy as much as they should make writing bad code hard. I think both languages achieve this to a good degree - writing any code is a bit harder than in their less restrictive relatives, but there is much less incentive to take shortcuts.

In practice, smart but restrictive compilers mean more time spent coding and less time spent debugging. Since debugging and reading messy code can be very time-consuming, this usually results in a net productivity gain. Personally, I love writing code, while debugging is often frustrating, so to me, this is a sweet deal.

Needless to say, all those restrictions make hacking one-off dirty solutions in Rust or Elm slightly annoying. But what code is truly one-off?

Style matters

The communities of both Elm and Rust make a big push for consistent presentation of source code. At the very least, this reduces the need for lengthy project-specific style guidelines at every team using the language. To be specific, Elm compiler enforces indentation for certain language constructs, does not allow Tabs for identation(!) and enforces that types begin with an upper-case letter while functions begin in lower-case. Further, there is elm-format, a community-endorsed source formatter.

In a similar vein, Rust compiler gives warnings if you do not stick to official naming conventions and also provides a community-endorsed formatter rustfmt.

More About Elm

Now is the time to talk about the languages individually, if you are still interested. We will take Elm first. Elm is a simple, small language. The complete syntax can be documented on a single page. Elm aimes at people already using Javascript and strives for low barrier of entry. Elm is currently at version 0.18 and new releases regularly bring backwards-incompatible changes (although official conversion tools are available). An interesting thing is that over the last few versions more syntax elements were removed than added, testifying to the focus on language simplicity.

Elm is purely functional. This means there are no variables in the classical sense, everything is a function. How does an application evolve over time if there are no variables? This is handled by The Elm Architecture (TEA). On the most simplistic level, an Elm application consists primarily of an update function and a view function. The update function takes a previous state of the application and input from the user/environment and returns a new state of the application. The view function than takes the state of the application and returns a HTML representation. All changes to the application state thus happen outside of Elm code, within the native code in TEA. The architecture also provides the necessary magic to correctly and efficiently update the DOM to match the latest view result.

TEA forces you to explicitly say what constitutes the state of your application and its inputs. This lets Elm to provide its killer feature: the time-travel debugger. In essence, when the debugger is turned on, you can replay the whole history of the application and inspect the application state at any point in past. And due to the way the language is designed, it works 100% of the time.

Another big plus of TEA is that you never have to worry about forgetting to hide an element when the user clicks a checkbox. If your view function correctly displays the element based on the current application state, the element will also be automatically hidden once the application state changes again.

Further sweet things about Elm is the effort to have nice and helpful error messages, with a dedicated GitHub repository for suggesting error message improvements. Also the record system which gives you a lot of freedom in using structured types (e.g., you do not have to declare them before use), but at the same time is statically checked for correctness.

Pain Points in Elm

A big downside of TEA is that it assumes that all state of the application can be made explicit. This makes working with HTML elements that have a state of their own tricky in certain contexts (e.g., text area contents, caret position in text areas, Web Components). You need care to prevent TEA from messing with such components destructively. Further, TEA can be resource intensive, albeit less than comparable JS frameworks. Last but not least, creating large apps in Elm involves writing a significant amount of boilerplate code. The Elm community is still discussing how to develop large projects more easily.

More About Rust

Whoa, that's a lot of new syntax!

Rust book, section 4.34 on Macros

In comparison with Elm, Rust is quite the beast. There is a lot of syntax and a lot of things to learn. This is however not unexpected: if you want to write fast code, you really need a lot of control. Also, C and especially C++ also have loads of syntax, so Rust is definitely not at a big disadvantage here. Rust is currently at version 1.15 and has forward compatibility guarantees.

While Rust is imperative, it took in a lot of useful functional programming concepts and boasts zero cost abstractions - i.e. that all the fancy syntactic tricks that let you develop code easily incur no actual performance penalty in comparison with a hand-tuned but dirty solution.

Rust also has no OOP of the usual kind, instead it has traits (a bit like interfaces) and deliberately avoids inheritance (you should compose instead).

The weirdest and most interesting part of Rust is the borrow checker. While Rust does not have managed memory (garbage collection), it can still guarantee that you cannot access uninitialized memory, dereference a null pointer or otherwise corrupt your memory. This has big implications not only for reliability but also for security, as Rust automatically prevents whole classes of severe attacks as buffer overflow or Heartbleed (blog post). Rust also prevents most (but not all) memory leaks. The borrow checker is what enables a big portion of those guarantees by validating that your program accesses memory correctly at compile time, i.e. without the runtime penalty of managed memory. The borrow checker ensures that a mutable reference to a piece of data cannot coexist with any other reference (and thus that you cannot free memory while holding a reference to it). For some intuition, mutable references in Rust behave a bit like std::unique_ptr in C++ (specs), but with the uniqueness enforced at compile-time. More detailed description could not fit here, so check Rust by Example or just Google away :-).

Pain Points in Rust

The borrow checker is both the biggest strength and the biggest weakness of Rust. Although the Rust community took a lot of effort to make most code just work, you inevitably end up fighting the borrow checker. There are some promising updates to the borrow checker in the pipeline that could make the life of Rust programmer easier, but it will not be cakewalk anytime soon - making the compiler understand your program is hard (both for the programmer and for the compiler).

While Rust takes performance seriously and the compiler should in theory be able to do a lot more optimizations than C/C++, Rust is not quite there yet. Benchmarks I've seen put it equal or slightly behind C/C++ on gcc (e.g. Benchmarks game). From my memory gcc also used to produce slower code than MSVC or the Intel compiler which would be bad news for Rust. The Internet however suggests that recent gcc is on par with MSVC/Intel, but I was unable to find any good benchmark link.

Development in Rust also still has some rough edges, IDE support is incomplete - setting up a decent debug environment maybe as much as a 14-step process and still the features are limited.

Concluding

The same way functional programming has made its way from fringes to being included in mainstream languages, I believe the features that make both Elm and Rust interesting will show up in the mainstream.
Some of the ideas can also be immediately transferred to the current languages (e.g. ImmutableJS). I think the take-home message of this post is that you should consider learning a new language. Preferably one that is very different from what you have been working with so far. No only it is fun, it will make you a better programmer in your language of choice.

I'll be very happy if you provide your feedback on this post either here, on my Twitter or on Reddit.

Profiling Rust code on Windows using CodeXL

Martin Modrák — Tue, 31 Jan 2017 11:08:26 +0000

I like the Rust language. However, developing with Rust on Windows currently has a lot of rough edges. One thing I was struggling with was a working solution for profiling Rust code on Windows.
In the end, the best tool for the job I found was CodeXL (formerly known as AMD CodeXL). Here is a step-by-step tutorial on how it is done.
The tutorial was tested on Windows 10 with Rust 1.14.0.

Note on Debugging and Targets

To be able to debug the code I use the x86_64-pc-windows-gnu target (as opposed to x86_64-pc-windows-msvc which is the default for Windows).
However, the profiling experience is better with the MSVC target. So maybe the best approach is to switch between the targets based on your current needs using rustup:

> rustup default stable-x86_64-pc-windows-msvc
> rustup default stable-x86_64-pc-windows-gnu

An aside on debugging on Windows. If you want to setup Rust debug environment on Windows, I recommend Sherry Ummen's tutorial for
setting up Rust + Visual Studio Code (requires the GNU target).
I've heard good rumours about Rust support in Sublime and Atom, but have yet to test it myself.
~~AFAIK the debugging support in the Visual Studio 2015 extension is fairly limited and also requires the GNU target.~~

UPDATE: In theory, you should be able to use standard debugger and profiler of Visual Studio (with the C++ toolset) to analyze code compiled with the MSVC target. I tried attaching Visual Studio debugger to a running Rust process (compiled with debug information) and it worked nicely! I was however unable to make the profiler work with the exact same program. I'll write another tutorial if I figure it out.

Profiling

First thing we need to do is to compile the program optimized (otherwise profiling makes no sense), but with debugging information:

rustc -g -O -o rna.exe ..\src\main.rs
`

If you compile through Cargo, you should be able to set the necessary options by modifying the [profile.release] section of Cargo.toml file. See the docs for more details.

Now open CodeXL. I have tested with version 1.9 and 2.2, but I believe other versions would work as well.

First you switch to profile mode ( Profile -> Switch to Profile Mode):

My app was running pretty long, so I chose the easy path, started the app from command line and used Profile -> Attach to process.

It should be possible to have CodeXL start the app for you, but I didn't bother.

Then you end the profiling by clicking the stop button.

And you get the output:

As you can note, there are only two functions in the profiling output. That's not because my app has no functions, but because Rust inlined all the rest.

This is the moment, when the GNU target bites you. If you compiled to GNU target, double-clicking on a function shows the actual time spent in individual processors instructions, but I found no simple way to match those instructions with the Rust code (and the inlined functions).

On the other hand, if you use the MSVC target, you see the actual source code!

However, even with MSVC, all samples attributed to an inlined function are associated with the line where the function is called, so you still see very little detail.
To take the example above: is the next_cut() function really consuming so many samples or is the match statement responsible?

So I did a little trick, that gives me more information, but may add noise to the timings. I forced the compiler to not inline anything:

rustc -g -O -C inline-threshold=0 -o rna.exe ..\src\main.rs
`

Which gives a more helpful result:

I can now be quite confident that most of the time is spent in the actual body of main and find_next_cut and not in the inlined functions.

I also tried the Very Sleepy profiler, which seemed to sort of work,
but I could not get it to display any debug information (especially the function names).

Hope this helps you!

Hi, I'm Martin ModrÃ¡k

Martin Modrák — Mon, 30 Jan 2017 14:43:33 +0000

I wrote my first program about 22 years ago and have been coding regularly since then.

You can find me on GitHub as martinmodrak

I live in Prague, Czech Republic (that's in Europe ;-) ).

I work at the Czech Academy of Sciences as researcher/programmer.

I have biggest experience in Java, but I have switched projects and languages frequently. I thus also feel comfortable in C#, Python, C++ and recently Elm.

I am currently learning more about Elm, Rust and OpenCL.

Nice to meet you.