The world crab is no longer just a learning project but shaping up to be quite functional!

The crab is maturing thanks to Hack Week

Another year, another Hack Week project. This time the goal is really to get the world crab as much as possible to a point where it can deal with real world sources without tripping up too easily.

Adding much needed features

One major goal was support for metadata. Different source formats specify the same type of metadata in incompatible ways. For instance tags and categories are defined differently between markdown, ATOM and RSS yet in a collection of blogs there should be a superset of tags.

Authorship is another aspect that becomes crucial when it isn’t just one blog that might be written by one person. Individual posts can be written by multiple persons, one person or rely on the blog configuration for authorship.

The cherry on the metadata cake is sources with incomplete metadata, in other words RSS and ATOM feeds often omit fields which makes it necessary to provide a way to specify metadata along with the sources - unfortunately this feature request did not make the cut this time.

Making the processing robust enough

From the start the selection of sources was a wild mix. The world crab was conceived to collect various blogs created by different tools. Somehow the end result should be consistent but even within each source format there is variety. This includes different date formats, markup used for the content or the format of the blog metadata.

Another challenge is making sure updates are pulled in smoothly. Local files, git repositories and remote HTTP URLs have different requirements and pulling in changes to existing files isn’t the same as starting from scratch.

The solution to both is of course TDD. To avoid making things too easy it is also recommended to start with erroneous tests (or at least I tell myself I did it on purpose).

Bonus challenge: Native HTML rendering

This one may or may not turn out to be useful in the long run. Since theming in static site generators such as Hugo isn’t consistent preparing something like a list of authors or tags is surprisingly non-trivial. The obvious alternative is to generate the site out of the box! Collecting posts and metadata is easy but theming still needs to be implemented somewhere. For now this is quite basic and writes barebones HTML without any styling.