Hackage GSoC: Beginnings of an Infrastructure for Modular Features

Hi everyone. The process of transforming the Hackage codebase is ongoing, and I’ve made some sweeping changes and introduced a few regressions. (I’d more optimistically call them “todo list items”.) I’ve spent 3/4 of the last week working on the code base, and 1/4 reading books and websites and working on the site design. Site design here refers to the interface that everyone who uses Hackage sees, and I’ve posted a preliminary draft of the proposed URIs.

The goal is to make a REST API which can be read and manipulated by automated clients, and of course perused by web browsers just like the current Hackage. One of the purposes of REST (Representational State Transfer) is simplifying state between the client and server by manipulating representation of resources using plain old HTTP.

Those Haskellers who are familiar with REST might point out that documenting an API and setting up URI conventions (like I did on the Hackage wiki) are partly antithetical to the goals of REST, which eschew servers that only understand highly specific remote procedural calls and clients which construct URIs based on hard-coded conventions (coupling). Roy Fielding, the inventor of REST, stresses that REST APIs must be hypertext-driven. Don’t worry: my intention is to make all of the URIs fully discoverable from the server root, whether browsing the HTML representation or, say, a JSON version. The URI page is an aid in design, not documentation. Since there tends to be a one-to-one mapping between each URI/method pair I’ve listed and each feature I’d like to implement, it tells me what I have left to do.

Fine-tuning data structures

To this end, I’ve made and committed some changes to hackage-server. Some of time was spent adjusting a few important types, and the rest dealing with the subsequent code breakages. It is still better and safer than doing similar things in a dynamically typed programming language, where I’d end up either sweeping the entire code base or analyzing call graphs manually to determine what broke. Here’s an example of a type I altered, the PkgInfo type, which holds information about a specific package version:

data PkgInfo = PkgInfo {
    -- | The name and version represented here
    pkgInfoId :: !PackageIdentifier,
    -- | Parsed information from the cabal file.
    pkgDesc   :: !GenericPackageDescription,
    -- | The current .cabal file text.
    pkgData   :: !ByteString,
    -- | The actual package .tar.gz file, where BlobId
    -- is a filename in the state/blobs/ folder.
    -- The head of the list is the current package tarball.
    pkgTarball :: ![(BlobId, UploadInfo)],
    -- | Previous .cabal file texts, and when they were uploaded.
    pkgDataOld :: ![(ByteString, UploadInfo)],
    -- | When the package was created with the .cabal file.
    pkgUploadData :: !UploadInfo
} deriving (Typeable, Show)
type UploadInfo = (UTCTime, UserId)

The global Hackage state defines a mapping from PackageName to [PkgInfo]. Subtle differences in which types of values are allowed to inhabit PkgInfo have important consequences for package uploading policy. There are a few notable results of this definition.

  1. A package can exist without a tarball. This is more significant for importing data to create secondary Hackages than the normal upload process. The more incrementally importing can happen, the simpler it will be. Alternatively, this would allow for a metadata-only Hackage mirror.
  2. Cabal files can be updated, with a complete history, without having to change the version number. This would allow maintainers to expand version brackets or compiler flags, so long as the changes don’t break anything (constricting version brackets is more dangerous).
  3. Tarballs can be updated, also with a complete history, without having to change the version number. This probably won’t be enabled on the main Hackage, but exceptions can be granted by admins. If an ultra-unstable Hackage mirror came about, as opposed to the somewhat-unstable model we have now, this might be allowed.

Modular RESTful features

The HackageFeature data structure is intended to encapsulate the behavior of a feature and its state. Features include the core feature set—the minimal functionality that a server must have to be considered a Hackage server, which is serving package tarballs and cabal files—supplemented by user accounts, package pages, reverse dependencies, Linux distro integration, and so on.

The most important field of a feature is the locations :: [(BranchPath, ServerResponse)]. The BranchPath is the generic form of a URI, a list of BranchComponents. Taking inspiration from Ruby on Rails routing, you can construct one with the syntax "/package/:package/reports/:id", where visiting http://hackage.haskell.org/HDBC/reports/4/ will pass [("package", "HDBC"), ("id", "4")] to the code serving build reports. You can define arbitrary ServerPart Responses at a path, or you can use a Resource abstraction which lets you specify different HTTP methods (GET, POST, PUT, and DELETE). This system is still in development.

HTTP goodies

Because each resource defines its method set upfront, it’s possible to make an HTTP OPTIONS method for each one. This is an example of something you get “for free” by structuring resources in certain ways. As I’ve discovered, there can be an unfortunate trade-off: requiring too much structure makes it unpleasant to extend Hackage with new functionality (having to deal with all of the guts of the server). Too little structure means that those implementing new features can accidentally break the site’s design principles and generally cause havoc. A reasonable middle ground is the convention over configuration approach: I’d have plenty of configurable structure internally, and combinators which build on that structure by filling in recommended conventions. This applies particularly to getting the most out of HTTP.

The idea of content negotiation in HTTP is simple, although there’s no clear path ahead for implementing it yet. For Hackage, content negotiation consists of responding to preferences in the client’s Accept header, which contains MIME types with various priorities. (Other sorts of negotiation include those for languages and encoding.) A web browser like Firefox might send
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, and a hypothetical newer cabal-install client would send application/json for miscellaneous data it needs to read.

Authentication functionality is essentially done, although I had originally planned to work on it later this month. The types might still need some tweaking, of course. There is a system of access control lists called UserLists, each of which is a Data.IntSet of UserIds. With this, we can have an extensible hierarchy of user groups, such as site administrators, per-package maintainers, and trustees (allowed to manipulate all packages without possessing other admin functionality). The type signature for the main authentication function is:

requireHackageAuth :: MonadIO m => Users -> Maybe UserList
                        -> Maybe AuthType -> ServerPartT m UserId

Users is the type containing the site’s entire user database. AuthType, either BasicAuth or DigestAuth, can be passed in to force the type of authentication: either basic or digest. Since all passwords are currently hashed in crypt form. This method either returns the UserId, assuming authentication succeeded, or forces a 401 Unauthorized or 403 Forbidden. With this, we can easily extend it to handle specific tasks:

requirePackageAuth :: (MonadIO m, Package pkg) => pkg
                                       -> ServerPartT m UserId
requirePackageAuth pkg = do
    userDb <- query $ GetUserDb
    pkgm   <- query $ GetPackageMaintainers
                            (packageName pkg)
    trust  <- query $ GetHackageTrustees
    let groupSum = Groups.unions [trust, fromMaybe Groups.empty pkgm]
    requireHackageAuth userDb (Just groupSum) Nothing

Import/export

To paraphrase Don’s comment on my previous post, we absolutely can’t afford to lose any data. Although the state/db/ directory contains all of happstack-state’s MACID data, and can be periodically backed-up off-site, binary data is by nature easy to mess up and hard to recover. A bit of redundancy in storage is a reasonable safeguard, and there’s little more redundant than English, at least compared to bit-packing.

Antoine had implemented an extensive Hackage-to-CSV export/import system, where instead of e.g. having a single bit represent whether an account is enabled or disabled, we use the words “enabled” and “disabled”, and put the resulting export tarball in a safe place. Instead of having one centralized system, each HackageFeature should take care of its own data, and so I’d like to work on decentralizing the system in the days ahead. The type signatures, suggested by Duncan, are:

data HackageFeature = {
    ...
    dumpBackup     :: IO [BackupEntry],
    restoreBackup  :: [BackupEntry] -> IO (),
    ...
}
type BackupEntry = ([FilePath], ByteString)

Bringing features together

There are notable tasks remaining for the basic infrastructure, such as implementing this import/export system. Another major one is creating a hook system with the usual dual nature of one part that responds to actions (like uploading pages) and another which edits documents on the fly (like adding sections to a package page). If you have experience with website plugin systems, what are your thoughts on getting this done this in a strongly, safety typed manner?

Having taken a brief tour of the internal server proto-design and the types of functionality that can be implemented with it, I’d like to show how we can leverage these to implement some useful features, some this summer if we as a community approve of them:

  • Build reports, to see if a given package builds on your OS (might save time for unfortunately oft-neglected Windows-users), on your architecture, with your set of dependencies. I would strongly encourage all of you to at least submit anonymized build reports once the feature goes live (check out the client-side implementation), if not the full build log, although I promise we won’t stoop to a “Do you want to submit a build report?” query every single time a build fails: maybe only just the first time :) Submitting or not is more of a configuration option. Build reports will probably be anonymized for the public interface, but available in full to package maintainers through the requireHackageAuth authentication mechanism.
  • Reverse dependencies. This is a HackageFeature that doesn’t need to define any of its own persistent data, just its own compact index of depedencies that subscribes to a package uploading hook. You can peruse Roel’s revdeps Hackage, and if you feel like setting up hackage-scripts with Apache, you can apply his patch to run your own.
  • Hackage mirrors. It should be simple to write a mirroring client that polls hackage.haskell.org’s recent changes, retrieves the new tarballs, and HTTP PUTs them to a mirror with proper authentication.
  • Candidate package uploads: improved package checking. This would allow you to create a temporary package resource, perhaps available at /package/:package/candidate/, to augment the current checking system. Currently, checking gives you a preview of the package page with any Cabal-generated warnings. Here, you could set up a package on the Hackage website that’s not included on the package list or index tarball. It would employ its own mapping from PackageName to PkgInfo. You can make a candidate package go live at any time, even allowing others to install your candidate package before then. This is a slightly different idea from the ultra-unstable zoo of packages I mentioned with PkgInfo, but has similar quality assurance goals.

Thanks for reading what I’ve been up to. Critique is welcomed.

June 7, 2010. Uncategorized.

4 Comments

  1. Yitz replied:

    While there are obviously many details here that need to be thought through carefully, and I admittedly haven’t done that yet, I would like to (try to) be the first to congratulate you on some fantastic ideas. I hope you will be successful in carrying most or all of them to completion. Even just this design work is already a massive contribution to the Haskell community. Thank you for doing this!

  2. Tom Lokhorst replied:

    This looks great! I’m a big fan of REST, so I’m happy.

    Have you thought about a push interface on the package uploads (and maybe other meta data)? For example using pubsubhubbub. This would allow other services to subscribe to a hub and get updates when a new package is uploaded, instead of having to pull the Hackage server.

    It would be useful for things like the @Hackage bot on twitter, mirrors and other extensions to Hackage.

  3. Robert Massaioli replied:

    I read through everything and, surprisingly, I did not find any plan of action in there that I disagreed with, it was well reasoned. So I will just suffice to say keep up the good work and that I second Yitz’s response, I think Haskell will be that little bit *understatement* better for your efforts.

  4. Bas van Dijk replied:

    I’m looking forward to seeing this implemented!

    Some ideas:

    – It might be nice to use web-routes to handle your URL routing. See:
    http://hackage.haskell.org/package/web-routes
    http://hackage.haskell.org/package/web-routes-happstack

    – What I miss in the current hackage is the ability to quickly browse the files of a package without downloading and manually uncompressing the .tar.gz.

Leave a comment