Tuesday, March 17, 2009

Creating Clean URLs With IBM WCM

One of the challenges with many content management systems, and IBM’s is no exception, is creating short, clean URLs. As part of structuring and managing your content, the URL segments tend to build up to very long URLs. While most systems have a way to provide shorter aliases they need to be manually created and tend to just redirect to real URL rather than being the one canonical URL for content.

For example, while during redeveloping LiveWorks! to be served from an IBM WCM server instead of a WordPress blog (there’s really only so far you can push WordPress before it breaks), we wound up with URLs like: http://liveworks.ephox.com:10038/wps/wcm/connect/LiveWorks/lw/home/mailing-list/ instead of the desired http://liveworks.ephox.com/mailing-list/

Fortunately, Apache 2 has a few modules that can help out here, as well as allowing the non-IBM content (mailing list management, download files etc) to be served from the same domain easily. The basic idea is that clients connect to the Apache server which translates the URL for WCM and proxies it through. When the content is returned, the Apache server modifies the URLs so that the links go to our nice URLs instead of the ugly long ones.

To get started we need to load the module that we’re going to use:

LoadModule proxy_module /usr/lib/apache2-prefork/mod_proxy.so
LoadModule proxy_http_module /usr/lib/apache2-prefork/mod_proxy_http.so
LoadModule ext_filter_module /usr/lib/apache2-prefork/mod_ext_filter.so

Since we’re using the mod_proxy module, the next thing we need to do is make sure we don’t have an open proxy server as that has some rather bad security consequences for our server and the internet as a whole:

ProxyRequests Off

Now we enable a specific pass through to the WCM server, adding in all that extra URL cruft that it requires:

ProxyPass / http://localhost:10038/wps/wcm/connect/LiveWorks/lw/

So now the URL “http://liveworks.ephox.com/hints-tips/article-name/” will go to “http://liveworks.ephox.com:10038/wps/wcm/connect/LiveWorks/lw/hints-tips/article-name/” and display the right content. We’re not done yet though, our mailing-list page from the original example is meant to be in the root level, but all content in WCM has to be in a site-area so we’ve had to add a “home” site area to hold it. That means our pretty URL is currently http://liveworks.ephox.com/home/mailing-list/

We’ll add a specific ProxyPass directive for the mailing-list page so we get:

ProxyPass /mailing-list http://localhost:10038/wps/wcm/connect/LiveWorks/lw/home/mailing-list
ProxyPass / http://localhost:10038/wps/wcm/connect/LiveWorks/lw/

Now our mailing list URL is right, but as soon as we click a link we get the ugly URLs back! This is where we need Apache to rewrite the URLs for us:

ExtFilterDefine change-urls mode=output intype=text/html \
cmd="/usr/bin/sed -e s:/wps/wcm/connect/LiveWorks/lw::g"
ExtFilterDefine change-home-urls mode=output intype=text/html \
cmd="/usr/bin/sed -e s:/wps/wcm/connect/LiveWorks/lw/home::g"

SetOutputFilter change-home-urls;change-urls


Wow that’s ugly - both in the way the configuration looks and the way it works. Hopefully some Apache gurus will be able to suggest a better way of going about this. It works by adding two output filters, both of which run ‘sed’ over the content before it’s returned to the client. The change-home-urls filter changes all the URLs to documents in our home site area so they appear as if they weren’t in a site area and the second one changes any other links to get rid of the usual URL cruft at the start. Now we can happily click links and everything works nicely.

One slight oddity in this set up that I’ve found, is that it actually works correctly even if the URL rewriting doesn’t pick up every URL. For example, the URLs to components like stylesheets and images don’t include the site name (the lw part of the URL) so they don’t get rewritten. Somehow they still work though and it would be reasonably simple to devise a filter if they ever start causing problems so we’ll just ignore them for now.

Finally, what about those sections of the site that we want Apache to serve directly? The ProxyPass config has an option specifically to prevent URL patterns from being proxied:

ProxyPass /downloads !

Put that before the other ProxyPass configuration items and any URLs that start with /downloads will be served directly by Apache.

Some caveats:

1. This only handles unauthenticated users accessing the site, it breaks if you try to access the site through Apache from a browser that’s logged in to the Portal server. IBM Portal tries to redirect you to /wps/wcm/myconnect/ and our filters don’t handle that at all. You could switch it so that only authenticated users can access the site by changing /connect/ to /myconnect/ but there isn’t a clean way to make it available to both at once.
2. By adding two output filters that run sed, every time a request is made for a HTML page, two instances of sed are run to process the content. For small sites that probably won’t matter but if your server is under load it’s very likely to be a major bottleneck.
3. Since the output filter doesn’t parse the HTML it will rewrite the URLs anywhere it sees them. If you write /wps/wcm/connect/LiveWorks/lw/ in your content it will be converted to just / but any variant like /wps/wcm/connect/LibraryName/SiteName/ would work just fine.

For the Apache people out there, get some options on how to avoid the use of separate sed processes all the time. tried the mod_proxy_html module but had trouble getting it to compile on SUSE 10. Plus rather like the simplicity of the regex instead of attempting to actually parse the HTML given how unique the URL strings are.