Importing Existing HTML Content into MOSS

Sunday, July 29, 2007

I had a requirement to import about 1800 pages from our existing Intranet into a new MOSS based site. Here are a few resources I've used which might be useful if you're doing something similar.

  1. Programmatically Adding Pages to a MOSS Publishing Site
    I decided to put together a tool that would walk an XML file exported from the existing site describing the structure of the content, creating MOSS sites (from custom site definitions) and importing pages as it went. At its core is code based on Andrew Connell's Programmatically adding pages to a MOSS Publishing site post.

  2. Html Agility Pack - A .NET Html Parser
    To ensure that links within our content continue to function it was necessary to parse each Html page and pull out the href attribute from each A tag so that the Url could be rewritten. Html Agility Pack is a wonderful .NET library that makes this simple; you load Html from files, streams or strings and query for Html nodes using XPATH. A simple href rewriting sample would look something like this:

       1:  HtmlDocument doc = new HtmlDocument();
       2:  doc.Load("file.htm");
       3:  foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
       4:  {
       5:    HtmlAttribute href = link.Attributes["href]";
       6:    href.Value = FixLink(att);
       7:  }
       8:  doc.Save("file.htm");

    I've ended up using this for a lot more than rewriting links as this proved the perfect opportunity to modify some ids that are better applied as classes, set certain links to open in new windows and replace tokens used by the existing CMS.

  3. Automatically Publishing All Items in a Publishing Site
    If you're doing anything like this you'll invariably find yourself with a few hundred pages to check-in, publish or approve at some point. Well, Mr. Connell has saved us some work again, his extensions to stsadm.exe make publishing all of your pages as simple as:

    stsadm.exe -o publishallitems -url http://localhost -list Pages -includesubsites

    Note: There's a bug in the version I downloaded which results in an infinite loop when using the 'includesubsites' option; the solution is documented in the comments of Andrew's post.

  4. 3rd Party Tools
    Depending on your specific situation it might be appropriate to look at 3rd party tools and although I've never used it, Metalogix Migration Manager looks to be a comprehensive solution. (Their site is down at the time of posting. In the meantime, you can read more in this post by Stefan Gossner.) If there are any other tools I should mention leave a comment and I'll add them here.

One Comment | SharePoint

Comments

Vikram said... Publishing the page can also be done soon after it creation by the following code

SPFile pageFile = page.ListItem.File;

if (pageFile.CheckOutStatus != SPFile.SPCheckOutStatus.None)
{
pageFile.CheckIn("");
}
pageFile.Publish("");

Tuesday, April 22, 2008

Comments have been closed on this topic.