active network espn

Currently Being Moderated

Generating Sitemaps

Posted by JeremyGThomas on Apr 10, 2009 11:07:38 AM

active.com is an aggregator of data from disparate sites within the Active Network.  Because the source of content is distributed and because we have so much of it, creating a sitemap.xml file for SEO purposes has become challenging.  We're working on a new search solution (which you can access in Beta at http://labs.active.com/search) , and the engine behind that has a sitemap.xml generating capability.

 

But there's only one problem: the file it generates is huge.

 

According to https://www.google.com/webmasters/tools/docs/en/protocol.html, a sitemap file can only be 10MB in size or up to 30,000 URLs (whichever comes first). The sitemap for active.com has over 220,000 URLs and is about 34MB in size.

 

We needed an application that would split the sitemap.xml file, according to the constraints above.  I searched all over the internet for something that would do this, and was found only with commercial applications that wanted to crawl my site before generating/splitting the sitemap.xml files.

 

So, my team developed a simple, .NET-based application that splits a large sitemap.xml file into smaller ones, and also creates the sitemap index file which references them.

 

Because I'm feeling philanthropic, I decided to give you access to this tool, free of charge.  Download it here.

Attachments:
1,091 Views Tags: active, seo, sitemap


Apr 10, 2009 6:34 PM Jason Aloia Jason Aloia    says:

Ching!  I'm sharing this with all my friends silly enough to work on .NET

Jun 26, 2009 7:56 AM Janis32512 Janis32512    says:

Jeremy, I accidentally deleted the 5k Walk/Run schedule that I got as a weight watcher member. I was trying to move the dates and deleted all the work out information. I tried the support page, but it said that I was Unauthorized. Do you have any helpful ideas? Should I try to download the whole thing again? Thanks, Janis

Sep 15, 2009 3:24 PM sweetpea80 sweetpea80    says:

wow 220,000 thats a lot of urls.  are those just all the pages on active?  what does google suggest? can you just add the urls which are main pages? or have a mix of main pages and hidden pages? do you really need to have all pages listed?

Sep 17, 2009 3:34 PM JeremyGThomas JeremyGThomas    says in response to sweetpea80:

yeah, we want all of those pages generated as the majority are for specific events, which we want to be discoverable through google.com.  So, in this case, it pays (literally) to be comprehensive.