Hackage GSoC status for the 3/7ths mark

A week and a half ago I talked about how the new Hackage server is internally structured. What I’d like to communicate now, to commemorate the Google Summer of Code coding period being exactly 3/7ths of the way through (36 days down, 48 to go), is more of a status report than anything else. (Granted, 3/7ths marks would be more significant if we used base 7 numbers.) I spent last week on a feature-implementing spree. With an emphasis on “spree”, and none on doing-anything-else-or-even-leaving-the-house. So I spent the weekend regenerating, and I think it’s time I reflected on how I’ve been doing.

The original schedule

When I applied to do the Hackage for Summer of Code, I included a tentative schedule. I have not strictly followed it so far, though I didn’t quite expect to. Here’s why.

1. 2 weeks. Become familiar with codebase and add documentation to declarations as I understand them. Find functionality not in the old server and not covered by the coming weeks and fully port it. Do the same for items in the hackage-server TODO list.

I didn’t anticipate all of the restructuring that needed to be done, thinking I could mostly append rather than modify. Well, I have substantially altered the already-great codebase into a modular form I’m pretty happy with, but it nonetheless takes a long time to do so when you’re starting with a 10,000-line codebase developed over 2 years (now it’s around 12,000 lines). The old server is mostly fully ported, although it wasn’t done within the space of these two weeks.

2. 1.5 weeks. Get build reports to display and gather useful information: already partially implemented. Use this feature as an opportunity to become even more comfortable refactoring and enhancing the hackage-server source.

Non-anonymous build reports are essentially complete. Anonymous ones are a bundle of privacy pitfalls, so we’ll have them as separate feature, using a variant on the data structure currently used to house per-package reports. The idea is to publish them to everyone but do so in a way that mostly eliminates identification or cross-referencing. More on this below.

3. 1.5 weeks. Get user accounts and settings working, writing a system for web forms, both the dynamic JavaScript kind and static kind. Use this system to get package configurating [sic] settings editable by both package maintainers and Hackage administrators.

I’ve written precious little HTML and no JavaScript, instead using curl to prod the server and setting up an Arch VM to ensure compliance with the current (soon to be old?) cabal-install. User accounts, digest authentication, and user groups — essentially access control lists — are all here. Most of this information is served in text/plain at the moment. Given that the new server will probably require a redesign by more design-minded Haskellers, I’d rather keep everything minimalistic for the time being. As I mentioned last post, I think the server architecture has a good separation of model and view.

4. 1 week. If a viable solutions for changelogs comes up by this point, I’ll implement it here. This might be as simple as a ./changelog file with a simple prescribed format.

That’s this week! At least 100 packages on Hackage already have changelogs. Of those, about two dozen are named changelog.md (they use markdown fix/feature structure, which git uses). The rest have whichever format the author chose, and these formats are all over the place. Some use darcs changes, which is too fine-grained for Hackage. All this is a too non-uniform for an automatic uniform interface. One approach that I can probably code up in a day or two is to have a changelog editable on Hackage. It could be inputted on upload and possibly edited afterwards by maintainers. Otherwise, I’ll leave this one until “a viable solution for changelogs comes up”.

What’s been done

All of the features I listed in the last blog post have been implemented, although not all of it’s exposed through HTML. Brief descriptions of them are there. The most interesting one, also proving to be the most challenging, is the candidate packages feature, which an enhanced version of the check-pkg CGI script. Here’s what you can do with it.

  • /packages/candidates/: see the list of candidate packages. POST here to upload a candidate package; candidates for existing packages can only be added by maintainers.
  • /package/{package}/candidate: see a preview package page for a candidate tarball, with any warnings or errors that would prevent putting it in the main index
  • /package/{package}/candidate/publish: POST here to put the package in the main index. It has to be a later version than the latest current existing under the name, and only maintainers can do this. If no package exists under the name, these restrictions don’t apply.
  • /package/{package}/candidate/{cabal}: get the cabal file for this package
  • /package/{package}/candidate/{tarball}: get the tarball

In the immediate future

I’d like to get the newer server ready for running on sparky by the end of the week. It doesn’t yet look very different from the current Hackage in terms of what web browsers can access.

Currently there are four ways to start up the server. The first is to initialize it on a blank slate and go from there with hackage-server --initialise. Second, you can start it normally with an existing dataset stored by happstack-state, just hackage-server. Otherwise, you can import from an existing source. You can import mostly everything from the old Hackage server, as I described in my first post. Alternatively, you can initialize it from a single backup tarball produced by the server.

I’d like to revamp the interface to make it easier to deploy. Instead of importing directly from old sources, there’s going to be an auxiliary mode to convert legacy data into a newer backup tarball. Then, the new tarball can be imported directly. I haven’t had any backup tarballs on hand to test the newer import/export system, though it compiles. This is next on the todo list.

Some features that I’d like to get done soon are uploading documentation and implementing deprecation. Deprecated packages might still be needed as dependencies, so they’re kept around and will probably go in the index tarball, but they won’t be highly visible on any of the HTML pages. Currently documentation will be implemented by uploading tarballs. This is compatible with the current solution, which is to have a dedicated build client. It would be easier to have users upload their own docs, and not have to deal with the build client not being able to do so. This would be simple if .haddock files provided everything neessary for generating HTML docs and linking them with hscolour pages. I’m not sure if this is the case. Holding onto .haddock files also makes documentation statistics a lot easier. For now, documentation tarball upload is the route I’m taking.

Another nice feature would be serving directly from package tarballs, preferrably without having to store them in memory or unpack them on the server filesystem. Like the documentation feature, it would use a data structure defined in the hackage-server source: a TarIndexMap. Given a file path, it can efficiently give you the byte offset of the tar entry where that file is stored, and from that retrieve the file directly. There are some downsides here. First, package tarballs are not .tar but .tar.gz, so this might more-than-double the amount of storage required, which unpacking would do anyway. Second, the TarIndexMap of every single package tarball would be kept in memory, although this uses an efficient trie structure, so it’s not so bad.

There are also some internal server design challenges, which I’ll describe in the next two paragraphs; skip them if you like. One of them is making URI generation less clunky. Every resource provides enough information to generate its canonical URI given an association list of string pairs. However, this requires passing around the resource itself, which also contains the server function and other things. I’m considering making a global map that, given the string name of a resource, gives a URI-generating function, which means either passing this mapping to every single server function or setting up a ReaderT monad around Happstack’s ServerPartT. The other issue is that a URI is not guaranteed; it’s wrapped in a Maybe, since this system doesn’t provide the type safety guarantees of libraries like web-routes: it’s ‘stringly typed’.

In addition, user groups are currently totally decentralized, but perhaps they could use some more coordination. The MediaWiki system of having a global mapping for which groups are allowed to execute which permissions is pretty good, though in a typical PHP manner, it uses strings to do this. It might be better for each type of group to list what permissions it can do, rather than having this check in code itself, but again this might require passing this mapping to every single server function.

Memory and performance

I’ve done some rudimentary statistics-gathering, but much more will need to be done soon.

For instance, importing from the old Hackage server causes the memory used by the server to reach around 700MB and stay there (any memory allocated by GHC always stays there), and this is only for the current tarball versions. However, this is only needed for initialization, as I mentioned I plan on making a separate mode for legacy import.

By contrast, starting up the server with the current set of package versions occupies 390 MB of memory, although only 148 MB is used by the RTS at any given time. When initializing the server in this mode, 40% of the CPU time is used on garbage collection, but things seem reasonably stable afterwards. The directory storage with the current tarball versions occupies 130 MB disk space, and the happstack-state database is just 17 MB. This database is pretty small comparatively, likely because it doesn’t include the parsed PackageDescription data structure, which contains lots of fields and lots of strings.

In general, I think I need some modifications to ensure that GHC isn’t too heap-hungry, I suspect. Heap profiling has proven suspect thus far, since apparently the sever has a special affinitify for ghc-prim:GHC.Types.:, and if I’m reading it right I find it somewhat hard to believe that over 90% of the sever’s memory is used on cons cells. On the other hand, maybe there are that many Strings and [Dependency]s. I think later on I’ll be asking the advice of some more senior Haskell hackers to keep memory usage down, even if one of the selling points of Happstack is that all data’s in memory. (Not entirely true here: the blob storage is used for package tarballs.)

In the eventual future

Build reports are a must-do, and at present authenticated clients can submit build reports and build logs. Anonymous reports are tricky though (but still immensely useful), and I know many of you guys wouldn’t submit reports without them. Statistics need to be done as well; how to take a large amount of these:

package: cabal-install-0.6.2
os: linux
arch: i386
compiler: ghc-6.10.4
client: cabal-install-0.6.0
flags: -bytestring-in-base -old-base
dependencies: Cabal-1.6.0.3 HTTP-4000.0.8 [...]
install-outcome: DependencyFailed zlib-0.5.2.0
docs-outcome: NotTried
tests-outcome: NotTried

and tell you something useful about them. Perhaps it could tell you that the above report is not recent.

Also, a solution for systematic client-side and server-side caching of HTML hasn’t come up yet, if this is in the cards at all. Making an ETag-generating function is not a simple matter, particularly when multiple representations of the same resource are served in multiple formats at multiple URIs (sadly, I can’t rely solely on the Accept header, because browser implementers seemingly read RFCs highlighted with black markers).

Finally, there’s no clear procedure for migrating data, and I’m still not fully familiar with Happstack state’s data versioning system. Apparently both data types need to exist at the same time, and then the old one can be discarded. I could probably write a startup mode for this.

The most eventual of future elements is more shiny features. This future will extend beyond this summer, so while some individual features might deserve Summer of Code projects in their own right, I’ll try to knock out as many of the others as possible. Let the other 4/7ths begin!

June 29, 2010. Uncategorized. 1 comment.

Hackage GSoC: Feature graphs and URI trees

Hello Haskellers! I’ve made lengthy strides in the internal structure of the new Hackage server. Amidst implementing features, I’ve also implemented a reasonably solid top-level organization for them. I’ll describe some of the technical details of the structure here.

Feature graphs

Each feature has a listing of the URIs it provides, the user groups it needs to authenticate, and the data it needs to store with methods to back up and restore that data. A feature might also define caches for its pages, IO hooks to execute on certain events (like uploading a package), and pretty much anything else: features are arbitary datatypes that implement a HackageFeature typeclass. If feature A depends on feature B, then feature A can extend B’s URIs with new formats and HTTP methods, use B’s data and user groups, and register for any of B’s hooks.

The barebones features are:

  • core: the central functionality set for something to reasonably be called a Hackage server. This serves tarballs, cabal files, and basic listings. The data it maintains are the user database and a map from PackageName to [PkgInfo] (see previous post). It is possible to create a core-only server with an archive.tar, but it’s effectively immutable after initialization.
  • mirror: this allows tarballs to be uploaded directly by special clients, and it is intended for use by secondary Hackages (if any) which need to stay up to date without having to support a userbase. This doesn’t use its own data, instead manipulating the core’s.

Now, take a look at the packages, upload, check, users, distros, and build features. Some of then depend on each other. They all depend on core. html depends on all of them. One way to look at the organization is that they provide the model and controller for data and html provides a view. They are interfaces which provide their own data in a way which html/json/xml/yaml/whichever other features can render in their particular format with a minimal amount of effort.

For example, the packages feature doesn’t define any of its own URIs, but has a function, PackageId -> IO (Maybe PackageRender), which the HTML package page calls. The PackageRender type is essentially the One True Resource Representation of a package, and it looks like this:

data PackageRender = PackageRender {
    -- using the most recently uploaded package as of now
    rendPkgId :: PackageIdentifier,
      -- Vec-0.9.8
    rendAllVersions :: [Version],
      -- [0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4,
          0.9.5, 0.9.6, 0.9.7, 0.9.8]
    rendDepends :: [[Dependency]],
      -- [[array, base (≤5), ghc-prim, QuickCheck (2.*)]]
    rendExecNames :: [String],
      -- [] (no executables)
    rendLicenseName :: String,
      -- BSD 3
    rendMaintainer :: Maybe String,
      -- Just "Scott"
    rendCategory :: [String],
      -- ["Data", "Math"]
    rendRepoHeads :: [(RepoType, String, SourceRepo)],
      -- [] (no repository)
    rendModules :: Maybe ModuleForest,
      -- Just a tree containing Data.Vec.*
    rendHasTarball :: Bool,
      -- True
    rendUploadInfo :: (UTCTime, String),
      -- (Jun 17 2010, "Scott")
    rendOther :: PackageDescription
      -- the package description
}

From this, the html feature can make a package page that looks like the current one, where 95% of its work is HTML formatting via Text.XHtml.Strict. A json feature could use the same information to make a data-rich nest of curly braces.

Now, a paragraph or two about a failed approach. I also considered having each feature provide its own HTML. This is perhaps the simplest approach on the face of it. However, it gets tricky to, say, define a package page and then later append to it for newer features. I considered HTML hooks where a feature could provide an interface to anyone who wants to inject Html blocks into its pages. For example, a build reports feature would have to register for a hook so that the main package page can link to the reports page.

This has several disadvantages, the most prominent of which is that it makes it cumbersome to switch to a different HTML-generating library or add new formats. I just accepted that HTML was an exceptionally unmodular thing. Instead, what the HTML feature now does is depend on both the build reports feature and the packages feature, and this also allows free-form HTML instead of copy+paste amalgamations, which I’ve heard can be rather ugly. The metric to go by here is “out of all modifications one could imagine making to the server, how can I make them implementable modifying the minimum number of modules?” (I haven’t considered using partial derivatives to optimize the minimum… yet.)

Here is a brief description of the middle features:

  • packages: just package pages
  • upload: authenticated users can upload new packages, with some checking in place: can’t overwrite packages, can only upload a new version if a maintainer, and so on. Adds a maintainer/author group for each package. By contrast, the mirror feature overwrites packages without question.
  • check: checking packages before indexing them and providing candidate packages (see previous post)
  • users: user pages, password-changing, currently using core and not storing any data of its own
  • distros: linking Hackage with Arch, Debian, and any other distribution with package repositories with Haskell binaries. These distributions can PUT and DELETE to Hackage to indicate the addition and removal of these packages.
  • build: submission of build reports, both anonymous and with full compilation logs

And finally, an ad hoc but nonetheless important feature:

  • legacy: a pile of 301 redirects so that old URIs can mostly work (in particular, links to /cgi-bin/hackage-scripts/package/foobar posted on mailing lists 4 years ago will still work)

Features each have their own particular init functions. For instance, the function to initialize the HTML module is currently:

initHtmlFeature :: CoreFeature -> PackagesFeature -> IO HtmlFeature

URI trees

I would have written this up yesterday but I’ve spent the last 24 hours implementing a new and improved routing system. All of the magic happens in impl, the ServerPart Response which is given to Happstack’s simpleHTTP.

impl :: Server -> ServerPart Response
impl server =
    renderServerTree (serverConfig server) []
  -- ServerTree ServerResponse
  . fmap serveResource
  -- ServerTree Resource
  . foldl' (\acc res -> addServerNode (resourceLocation res) res acc)
           serverTreeEmpty
  -- [Resource]
  $ concatMap resources (serverFeatures server)

This seems pretty terse for what’s effectively the server’s main method, but complexity lurks just beneath the surface. It all starts with lists of resources, each server feature providing its own list, which are concatenated into a [Resource]. A Resource contains a URI, and how to respond when that URI is visited for certain combinations of HTTP methods and content-types. Although I’ve never coded a line of Ruby in my life, I stole some of Rails’ routing syntax for this task (also stolen by the Pylons web framework, apparently). Here’s how it works:

  • A resource at “/users/login” will be run only when /users/login is visited, assuming it’s a GET request.
  • A resource at “/package/:package” will be run when /package/HDBC is visited, but also when /package/nonexistent-1.0 is entered. It’s passed [("package", "HDBC")] in the former case, and there are combinators to turn assoc lists into data values (type DynamicPath = [(String, String)] and a combinator withPackagePath :: DynamicPath -> (PackageId -> PkgInfo -> ServerPart Response) -> ServerPart Response). It’s up to the resource to return a 404 if it can’t abide by the URI.
  • A resource at “/package/:package/doc/…” will be run when /package/uvector/doc/ or any subdirectory is visited, and it’s likewise passed an appropriate assoc list.
  • I can specify “/package/:package/:cabal.cabal”, and when /package/parsec-3.1.0/parsec.cabal is visited, the resource is given [("package", "parsec-3.1.0"), ("cabal", "parsec")] (the extension is stripped off).
  • And the most complicated one: “/package/:package.:format”. This works for /package/QuickCheck ([("package", "QuickCheck"), ("format", "")]), or /package/llvm-0.8.0.2.json ([("package", "llvm-0.8.0.2"), ("format", "json")]). An empty format means to go for the default, in this case HTML.

Server trees provide a way for efficiently serving an entire tree of URIs. Starting from an empty server tree, resources are incrementally added, and when two share the same URI format they are combined. For example, the simplified Hackage URI tree is:

The relevant types are:

data ServerTree a = ServerTree {
    nodeResponse :: Maybe a,
    nodeForest :: Map BranchComponent (ServerTree a)
}
data BranchComponent = StaticBranch String -- /foo
                     | DynamicBranch String -- /:bar
                     | TrailingBranch -- /...
addServerNode :: Monoid a => [BranchComponent] -> a
              -> ServerTree a -> ServerTree a

Finally, I have a 100-line function for converting resources into something Happstack can read (to be broken up shortly, I hope). It’s called serveResource, and it’s how I convert a ServerTree Resource into a ServerTree ServerResponse via ServerTree‘s Functor instance. Then the tree is converted to its final flat form, using Happstack’s path-munching combinators to traverse each node’s forest.

serveResource :: Resource -> ServerResponse
renderServerTree :: Config -> DynamicPath
                 -> ServerTree ServerResponse
                 -> ServerPart Response

If this effort is a success, I won’t have to deal with the ServerTree type in any great detail for the rest of the summer. I’ve pushed all of the above code to the hackage-server repository.

Thanks for perusing my run-through of some of the internal server design and my exploration of the problem domain. I’ve also been reading the other GSOC blogs, including Marco’s progress on Immix. I had idly considered applying for that, but given my near-total unfamiliarity with the GHC RTS, I think it would’ve been more than a bit difficult for me. I can see he’s doing a great job, too. Still, there are some things I appreciate about using Haskell and not C in my project. Not only does the type system prevent a host of runtime errors, it also forces me to consider all possible sorts of values which can inhabit a given type and write a case for each one. This is something that’s come in handy a lot in the past few days. Well, now on to actually implementing features in detail. I’ll keep you all posted.

June 17, 2010. Uncategorized. 1 comment.

Hackage GSoC: Beginnings of an Infrastructure for Modular Features

Hi everyone. The process of transforming the Hackage codebase is ongoing, and I’ve made some sweeping changes and introduced a few regressions. (I’d more optimistically call them “todo list items”.) I’ve spent 3/4 of the last week working on the code base, and 1/4 reading books and websites and working on the site design. Site design here refers to the interface that everyone who uses Hackage sees, and I’ve posted a preliminary draft of the proposed URIs.

The goal is to make a REST API which can be read and manipulated by automated clients, and of course perused by web browsers just like the current Hackage. One of the purposes of REST (Representational State Transfer) is simplifying state between the client and server by manipulating representation of resources using plain old HTTP.

Those Haskellers who are familiar with REST might point out that documenting an API and setting up URI conventions (like I did on the Hackage wiki) are partly antithetical to the goals of REST, which eschew servers that only understand highly specific remote procedural calls and clients which construct URIs based on hard-coded conventions (coupling). Roy Fielding, the inventor of REST, stresses that REST APIs must be hypertext-driven. Don’t worry: my intention is to make all of the URIs fully discoverable from the server root, whether browsing the HTML representation or, say, a JSON version. The URI page is an aid in design, not documentation. Since there tends to be a one-to-one mapping between each URI/method pair I’ve listed and each feature I’d like to implement, it tells me what I have left to do.

Fine-tuning data structures

To this end, I’ve made and committed some changes to hackage-server. Some of time was spent adjusting a few important types, and the rest dealing with the subsequent code breakages. It is still better and safer than doing similar things in a dynamically typed programming language, where I’d end up either sweeping the entire code base or analyzing call graphs manually to determine what broke. Here’s an example of a type I altered, the PkgInfo type, which holds information about a specific package version:

data PkgInfo = PkgInfo {
    -- | The name and version represented here
    pkgInfoId :: !PackageIdentifier,
    -- | Parsed information from the cabal file.
    pkgDesc   :: !GenericPackageDescription,
    -- | The current .cabal file text.
    pkgData   :: !ByteString,
    -- | The actual package .tar.gz file, where BlobId
    -- is a filename in the state/blobs/ folder.
    -- The head of the list is the current package tarball.
    pkgTarball :: ![(BlobId, UploadInfo)],
    -- | Previous .cabal file texts, and when they were uploaded.
    pkgDataOld :: ![(ByteString, UploadInfo)],
    -- | When the package was created with the .cabal file.
    pkgUploadData :: !UploadInfo
} deriving (Typeable, Show)
type UploadInfo = (UTCTime, UserId)

The global Hackage state defines a mapping from PackageName to [PkgInfo]. Subtle differences in which types of values are allowed to inhabit PkgInfo have important consequences for package uploading policy. There are a few notable results of this definition.

  1. A package can exist without a tarball. This is more significant for importing data to create secondary Hackages than the normal upload process. The more incrementally importing can happen, the simpler it will be. Alternatively, this would allow for a metadata-only Hackage mirror.
  2. Cabal files can be updated, with a complete history, without having to change the version number. This would allow maintainers to expand version brackets or compiler flags, so long as the changes don't break anything (constricting version brackets is more dangerous).
  3. Tarballs can be updated, also with a complete history, without having to change the version number. This probably won't be enabled on the main Hackage, but exceptions can be granted by admins. If an ultra-unstable Hackage mirror came about, as opposed to the somewhat-unstable model we have now, this might be allowed.

Modular RESTful features

The HackageFeature data structure is intended to encapsulate the behavior of a feature and its state. Features include the core feature set—the minimal functionality that a server must have to be considered a Hackage server, which is serving package tarballs and cabal files—supplemented by user accounts, package pages, reverse dependencies, Linux distro integration, and so on.

The most important field of a feature is the locations :: [(BranchPath, ServerResponse)]. The BranchPath is the generic form of a URI, a list of BranchComponents. Taking inspiration from Ruby on Rails routing, you can construct one with the syntax "/package/:package/reports/:id", where visiting http://hackage.haskell.org/HDBC/reports/4/ will pass [("package", "HDBC"), ("id", "4")] to the code serving build reports. You can define arbitrary ServerPart Responses at a path, or you can use a Resource abstraction which lets you specify different HTTP methods (GET, POST, PUT, and DELETE). This system is still in development.

HTTP goodies

Because each resource defines its method set upfront, it's possible to make an HTTP OPTIONS method for each one. This is an example of something you get "for free" by structuring resources in certain ways. As I've discovered, there can be an unfortunate trade-off: requiring too much structure makes it unpleasant to extend Hackage with new functionality (having to deal with all of the guts of the server). Too little structure means that those implementing new features can accidentally break the site's design principles and generally cause havoc. A reasonable middle ground is the convention over configuration approach: I'd have plenty of configurable structure internally, and combinators which build on that structure by filling in recommended conventions. This applies particularly to getting the most out of HTTP.

The idea of content negotiation in HTTP is simple, although there's no clear path ahead for implementing it yet. For Hackage, content negotiation consists of responding to preferences in the client's Accept header, which contains MIME types with various priorities. (Other sorts of negotiation include those for languages and encoding.) A web browser like Firefox might send
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, and a hypothetical newer cabal-install client would send application/json for miscellaneous data it needs to read.

Authentication functionality is essentially done, although I had originally planned to work on it later this month. The types might still need some tweaking, of course. There is a system of access control lists called UserLists, each of which is a Data.IntSet of UserIds. With this, we can have an extensible hierarchy of user groups, such as site administrators, per-package maintainers, and trustees (allowed to manipulate all packages without possessing other admin functionality). The type signature for the main authentication function is:

requireHackageAuth :: MonadIO m => Users -> Maybe UserList
                        -> Maybe AuthType -> ServerPartT m UserId

Users is the type containing the site's entire user database. AuthType, either BasicAuth or DigestAuth, can be passed in to force the type of authentication: either basic or digest. Since all passwords are currently hashed in crypt form. This method either returns the UserId, assuming authentication succeeded, or forces a 401 Unauthorized or 403 Forbidden. With this, we can easily extend it to handle specific tasks:

requirePackageAuth :: (MonadIO m, Package pkg) => pkg
                                       -> ServerPartT m UserId
requirePackageAuth pkg = do
    userDb <- query $ GetUserDb
    pkgm   <- query $ GetPackageMaintainers
                            (packageName pkg)
    trust  <- query $ GetHackageTrustees
    let groupSum = Groups.unions [trust, fromMaybe Groups.empty pkgm]
    requireHackageAuth userDb (Just groupSum) Nothing

Import/export

To paraphrase Don's comment on my previous post, we absolutely can't afford to lose any data. Although the state/db/ directory contains all of happstack-state's MACID data, and can be periodically backed-up off-site, binary data is by nature easy to mess up and hard to recover. A bit of redundancy in storage is a reasonable safeguard, and there's little more redundant than English, at least compared to bit-packing.

Antoine had implemented an extensive Hackage-to-CSV export/import system, where instead of e.g. having a single bit represent whether an account is enabled or disabled, we use the words "enabled" and "disabled", and put the resulting export tarball in a safe place. Instead of having one centralized system, each HackageFeature should take care of its own data, and so I'd like to work on decentralizing the system in the days ahead. The type signatures, suggested by Duncan, are:

data HackageFeature = {
    ...
    dumpBackup     :: IO [BackupEntry],
    restoreBackup  :: [BackupEntry] -> IO (),
    ...
}
type BackupEntry = ([FilePath], ByteString)

Bringing features together

There are notable tasks remaining for the basic infrastructure, such as implementing this import/export system. Another major one is creating a hook system with the usual dual nature of one part that responds to actions (like uploading pages) and another which edits documents on the fly (like adding sections to a package page). If you have experience with website plugin systems, what are your thoughts on getting this done this in a strongly, safety typed manner?

Having taken a brief tour of the internal server proto-design and the types of functionality that can be implemented with it, I'd like to show how we can leverage these to implement some useful features, some this summer if we as a community approve of them:

  • Build reports, to see if a given package builds on your OS (might save time for unfortunately oft-neglected Windows-users), on your architecture, with your set of dependencies. I would strongly encourage all of you to at least submit anonymized build reports once the feature goes live (check out the client-side implementation), if not the full build log, although I promise we won't stoop to a "Do you want to submit a build report?" query every single time a build fails: maybe only just the first time :) Submitting or not is more of a configuration option. Build reports will probably be anonymized for the public interface, but available in full to package maintainers through the requireHackageAuth authentication mechanism.
  • Reverse dependencies. This is a HackageFeature that doesn't need to define any of its own persistent data, just its own compact index of depedencies that subscribes to a package uploading hook. You can peruse Roel's revdeps Hackage, and if you feel like setting up hackage-scripts with Apache, you can apply his patch to run your own.
  • Hackage mirrors. It should be simple to write a mirroring client that polls hackage.haskell.org's recent changes, retrieves the new tarballs, and HTTP PUTs them to a mirror with proper authentication.
  • Candidate package uploads: improved package checking. This would allow you to create a temporary package resource, perhaps available at /package/:package/candidate/, to augment the current checking system. Currently, checking gives you a preview of the package page with any Cabal-generated warnings. Here, you could set up a package on the Hackage website that's not included on the package list or index tarball. It would employ its own mapping from PackageName to PkgInfo. You can make a candidate package go live at any time, even allowing others to install your candidate package before then. This is a slightly different idea from the ultra-unstable zoo of packages I mentioned with PkgInfo, but has similar quality assurance goals.

Thanks for reading what I've been up to. Critique is welcomed.

June 7, 2010. Uncategorized. 4 comments.

Follow

Get every new post delivered to your Inbox.