What you're reading is a blog post. Where is it from? It's from . What's it doing? It's cataloging the thoughts of the people who run Haskell.org.
That's right. This is our new adventure in communicating with you. We wanted some place to put more long-form posts, invite guests, and generally keep people up to date about general improvements (and faults) to our infrastructure, now and in the future. Twitter and a short-form status site aren't so great, and a non-collective mind of people posting scattered things on various sites/lists isn't super helpful for cataloging things.
So for an introduction post, we've really got a lot to talk about...
Haskell.org has had some rough times lately.
About a month and a half ago, we had an extended period of outage, roughly around the weekend of ICFP 2014. This was due to a really large amount of I/O getting backed up on our host machine, . is a single-tenant, bare-metal machine from that we used to host several VMs that comprise the old server set; including the main website, the GHC Trac and git repositories, and Hackage. We alleviated a lot of the load by turning off the hackage server, and migrating one of the VMs to a new hosting provider.
Then, about a week and a half ago, we had another hackage outage that was a result of more meager concerns: disk space usage. Much to my chagrin, this was due in part to an absence of log rotation over the past year, which resulted in a hefty 15GB of text sitting around (in a single file, no less). Oops.
This caused a small bump on the road, which was that the hackage server had a slight error while committing some transactions in the database when it ran out of disk. We recovered from this (thanks to @duncan for the analysis), and restarted it. (We also had point-in-time backups, but in this case it was easier to fix than rollback the whole database).
But we've had several other availability issues beforehand too, including faulty RAM and inconsistent performance. So we're setting out to fix it. And in the process we figured, hey, they'd probably like to hear us babble about a lot of other stuff, too, because why not?
OK, so enough sad news about what happened. Now you're wondering what's going to happen. Most of these happening-things will be good, I hope.
There are a bunch of new things we've done over the past year or so for Haskell.org, so it's best to summarize them a bit. These aren't in any particular order; most of the things written here are pretty new and some are a bit older since the servers have started churning a bit. But I imagine many things will be new to y'all.
And it's a fancy one at that (powered by ). Like I said, we'll be posting news updates here that we think are applicable for the Haskell.org community at large - but most of the content will focus on the administrative side.
As I mentioned earlier this year pending the GHC 7.8 release, has graciously donated resources towards Haskell.org for GHC, particularly for buildbots. We had at that time begun using Rackspace resources for hosting Haskell.org resources. Over the past year, we've done so more and more, to the point where we've decided to move all of Haskell.org. It became clear we could offer a lot higher reliability and greatly improved services for users, using these resources.
was my contact point at Rackspace, and has set up Haskell.org for its 2nd year running with free Rackspace-powered machines, storage, and services. That's right: free (to a point, the USD value of which I won't disclose here). With this, we can provide more redundant services both technically and geographically, we can offer better performance, better features and management, etc. And we have their awesome Fanatical Support.
So far, things have been going pretty well. We've migrated several machines to Rackspace, including:
We're still moving more servers, including:
Many thanks to Rackspace. We owe them greatly.
We've done several overhauls of the way Haskell.org is managed, including security, our underlying service organization, and more.
While we're on the subject, here's an example of what the new Hackage Server will be sporting:
So, Hackage should hopefully be OK for a long time. And, the doc builder is now working again, and should hopefully stay that way too.
Like many other sites, Haskell.org is big, complicated, intimidating, and there are occasionally points where you find a Grue, and it eats you mercilessly.
As a result, automation is an important part of our setup, since it means if one of us is hit by a bus, people can conceivably still understand, maintain and continue to improve Haskell.org in the future. We don't want knowledge of the servers locked up in anyone's head.
In The Past, Long ago in a Galaxy unsurprisingly similar to this one at this very moment, Haskell.org did not really have any automation. At all, not even to create users. Some of Haskell.org still does not have automation. And even still, in fact, some parts of it are still a mystery to all, waiting to be discovered. That's obviously not a good thing.
Today, Haskell.org has two projects dedicated to automation purposes. These are:
We eventually hope to phase out Ansible in favor of Auron. While Auron is still very preliminary, several services have been ported over, and the setup does work on existing providers. Auron also is much more philosophically aligned with our desires for automation, including reproducibility, binary determinism, security features, and more.
In our quest to automate our tiny part of the internet, we've begun naturally writing a bit of code. What's the best thing to do with code? Open source it!
The new organization on GitHub hosts our code, including:
Most of our repositories are hosted on GitHub, and we use our Phabricator for code review and changes between ourselves. (We still accept GitHub pull requests though!) So it's pretty easy to contribute in whatever way you want.
We're very recently begun using for Haskell.org for DNS management, DDoS mitigation, and analytics. After a bit of deliberation, we decided that after moving off Hetzner we'd think about a 3rd party provider, as opposed to running our own servers.
We chose CloudFlare mostly because aside from a nice DNS management interface, and great features like AnyCast, we also get analytics and security features, including immediate SSL delivery. And, of course, we get a nice CDN on top for all HTTP content. The primary benefits from CloudFlare are the security and caching features (in that order, IMO). The DNS interface is still particularly useful however; the nameservers should be redundant, and CloudFlare acts more like a reverse proxy as changes are quick and instant.
But unfortunately while CloudFlare is great, it's only a web content proxy. That means certain endpoints which need things like SSH access can not (yet) be reliably proxied, which is one of the major downfalls. As a result, not all of Haskell.org will be magically DDoS/spam resistant, but a much bigger amount of it will be. But the bigger problem is: we have a lot of non-web content!
In particular, none of our Hackage server downloads for example can proxied: Hackage, like most package repositories, merely uses HTTP as a transport layer for packages. In theory you could use a binary protocol, but HTTP has a number of advantages (like firewalls being nice to it). Using a service like CloudFlare for such content is - at the least - a complete violation of the spirit of their service, and just a step beyond that a total (Section 10). But hackage pushes a few TB a month in traffic - so we have to pay line-rate for that, by the bits. And also, Hackage can't have data usefully mirrored to CDN edges - all traffic has to hop through to the Rackspace DCs, meaning users suffer at the hands of latency and slower downloads.
But that's where came to the rescue. Fastly also recently stepped up to provide Haskell.org with an Open Source Software discount - meaning we get their awesome CDN for free, for custom services! Hooray!
Since Fastly is a dedicated CDN service, you can realistically proxy whatever you want with it, including our package downloads. With the help of a new friend of ours (@davean), we'll be moving Fastly in front of Hackage soon. Hopefully this just means your downloads and responsiveness will get faster, and we'll use less bandwidth. Everyone wins.
Finally, we're rolling out CloudFlare gradually to new servers to test them and make sure they're ready. In particular, we hope to not disturb any automation as a result of the switch (particularly to new SSL certificates), and also, we want to make sure we don't unfairly impact other people, such as Tor users (Tor/CloudFlare have a contentious relationship - lots of nasty traffic comes from Tor endpoints, but so does a ton of legitimate traffic). Let us know if anything goes wrong.
Server monitoring is a crucial part of managing a set of servers, and Haskell.org unfortunately was quite bad at it before. But not anymore! We've done a lot to try and increase things. Before my time, as far as I know, we pretty much only had some lame graphs of server metrics. But we really needed something more than that, because it's impossible to have modern infrastructure on that alone.
Enter . I played with their product last year, and I casually approached them and asked if they would provide an account for Haskell.org - and they did!
DD provides real-time analytics for servers, while providing a lot of custom integrations with services like MySQL, nginx, etc. We can monitor load, networking, and correlate this with things like database or webserver connection count. Events occur from all over Haskell.org On top of that, DD serves as a real-time dashboard for us to organize and comment on events as they happen.
But metrics aren't all we need. There are two real things we need: metrics (point-in-time data), and resource monitoring (logging, daemon watchdogs, resource checks, etc etc).
This is where comes in - we have it running and monitoring all our servers for daemons, heatlh checks, endpoint checks for connectivity, and more. Datadog helpfully plugins into Nagios, and reports events (including errors), as well as sending us weekly summaries of Nagios reports. This means we can helpfully use the Datadog dashboard as a consolidated piece of infrastructure for metrics and events.
As a result: Haskell.org is being monitored much more closely here on out we hope.
We've (very recently) also begun rolling out another part of the equation: log management. Log management is essential to tracking down big issues over time, and in the past several years, has become incredibly popular. We have a new ElasticSearch instance, running along with , which several of our servers now report to (via the service, which is lightweight even on smaller servers). sits in front on a separate server for query management so we can watch the systems live.
Furthermore, our ElasticSearch deployment is, like the rest of our infrastructure, 100% encrypted - Kibana proxies backend ElasticSearch queries through HTTPS and over . Servers dump messages into LogStash over SSL. I would have liked to use for the LogStash connection as well, but SSL is unfortunately mandatory at this time (perhaps for the best).
We're slowly rolling out over our new machines, and tweaking our LogStash filters so they can get juicy information. Hopefully our log index will become a core tool in the future.
As I'm sure some of you might be aware, we now have a fancy new site, , that we'll be using to post updates about the infrastructure, maintenance windows, and expected (or unexpected!) downtimes. And again, someone came to help us - gave us this for free!
Rackspace also fully supports their backup agents which provide compressed, deduplicated backups for your servers. Our previous situation on Hetzner was a lot more limited in terms of storage and cost. Our backups are stored privately on Cloud Files - the same infrastructure that hosts our static content.
Of course, backup on Rackspace is only one level of redundancy. That's why we're thinking about trying to roll out soon too. But either way, our setup is far more reliable and robust and a lot of us are sleeping easier (our previous backups were space hungry and becoming difficult to maintain by hand.)
GHC has for a long time had an open infrastructure request: the ability to build patches users submit, and even patches we write, in order to ensure they do not cause machines to regress. Developers don't necessarily have access to every platform (cross compilers, Windows, some obscurely old Linux machine), so having infrastructure here is crucial.
We also needed more stringent code review. I (Austin) review most of the patches, but ideally we want more people reviewing lots of patches, submitting patches, and testing patches. And we really need ways to test all that - I can't be the bottleneck to test a foreign patch on every machine.
At the same time, we've also had a nightly build infrastructure, but our build infrastructure as often hobbled along with custom code running it (bad for maintenance), and the bots are not directed and off to the side - so it's easy to miss build reports from them.
Enter , our Phabricator-powered buildbot for continuous integration and patch submissions!
Harbormaster is a part of Phabricator, and it runs builds on all incoming patches and commits to GHC. How?
This has already lead to a rather large change in development for most GHC developers, and Phabricator is building our patches regularly now - yes, even committers use it!
Harbormaster will get more powerful in the future: our build plans will lease more resources, including Windows, Mac, and different varieties of Linux machines, and it will run more general build plans for cross compilers and other things. It's solved a real problem for us, and the latest infrastructure has been relatively reliable. In fact I just get lazy and submit diffs to GHC without testing them - I let the machines do it. Viva la code review!
(See for more on our Phabricator process there's a lot written there for GHC developers.)
That's right, there's now documentation about the Haskell.org infrastructure, hosted on our new . And now you can report bugs through to us. Both of these applications are powered by , just like our blog.
In a previous life, Haskell.org used (RT) to do support management. Our old RT instance is still running, but it's filled with garbage old tickets, some spam, it has its own PostgreSQL instance alone for it (quite wasteful) and generally has not seen active use in years. We've decided to phase it out soon, and instead use our Phabricator instance to manage problems, tickets, and discussions. We've already started importing and rewriting new content into our wiki and modernizing things.
Hopefully these docs will help keep people up to date about the happenings here.
But also, our Phabricator installation has become an umbrella installation for several Haskell.org projects (even the Haskell.org Committee may try to use it for book-keeping). In addition, we've been taking the time to extend and contribute to Phab where possible to improve the experience for users.
In addition to that, we've also authored several Phab extensions:
Of course, we're not done. That would be silly. Maintaining Haskell.org and providing better services to the community is a real necessity for anything to work at all (and remember: computers are the worst).
We've got a lot further to go. Some sneak peaks...
And, of course, we'd appreciate all the help we can get!
This post was long. This is the ending. You probably won't read it. But we're done now! And I think that's all the time we have for today.