1. home
  2. e-mail
  3. About Me
  4. Music Index

Sitemap Builder

Languages used: Python (for web scraping and HTML generation)


My site is slowly (quickly) turning into a spaghetti-fied metropolis and I need some order. Or at least a sitemap.

I want something visual - none of this boring texty XML conversion crap. Give me some sexy site screenshots.

Something that just now dawned on me is that Neocities generates screenshots of each page of your site. Like, I knew this. But I’m just now realising this is super useful. Let’s inspect:

sitemap_terms_update_screenshot

That gives me this URL. You can access it too!

https://neocities.org/site_screenshots/86/51/carcercitymall/terms-of-service/index.html.540x405.jpg

Hugo (the static site generator I use to manage my site) generates a sitemap for me, but it’s not extensive. I have other sites hosted up here, such as archived versions of my old site, as well as recreated versions of existing sites (see Sprunk Soda and Two Guys and a Girl on the right). Hugo excludes these. I just need a list of every single page on my site.

I used xml-sitemaps.com, a free site I found, to generate a list of links on my site. A nice bonus is that it spat out a full XML sitemap for me too.

So now I have the format to search for screenshots. The link above works for them all - just replace

/terms-of-service/index.html

…with the relative permalink for any other page. Now I have thumbnails without needing to screencap each site!

Generate all the page screencaps

A super simple find-and-replace in Notepad++ to uglify my URLs gives me 118 total screenshot links I can download.

sitemap_uglify_urls_01

sitemap_uglify_urls_02

sitemap_uglify_urls_03

Stunning!

Generate image titles/captions

I want to caption each image with the title of the page it leads to. I’m using Beautiful Soup to scrape each page and nab its <title> attribute. Here’s a snip of Python, just showing how to retrieve a webpage given a URL, and return its title.

from bs4 import BeautifulSoup
import urllib.request
import requests

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    title = soup.find("title")
    return(title.get_text())

I know that this data is always gonna be sound due to my Hugo setup.

{{ block "title" . }}
    {{ with .Params.Title }}{{ . }} :: {{ end }}
    {{ .Site.Title }}
{{ end }}

The tab you’re viewing this page in should show Sitemap Builder :: Carcer City Mall, so the code above should be simple enough to make sense of.

Anywho, I can get that whole list of <title> tags and dump them in a master CSV I can use later.

Generating the actual page

Now I’m in a bit of a conundrum. I have a bunch of images and links to where each one needs to go. I now need to try and find a way to automate as much of this as I can as I don’t want to manually have to create a shit ton of HTML to house everything. Another question is organisation. I want to try and separate each ‘realm’ so the sitemap is easy to navigate.

I’ve decided to go with figures as they’re photos with captions, and the only things on this page will be photos with captions. This sitemap page is going to be stripped to the bone in terms of design so it’s as simple as possible. I have my complete CSV, now I just need to generate some content.

A figure will need to fill out this format:

<figure>
    <img src="[image url]" alt="[image name]">
    <figcaption>
        <a title="[page title]" href="[rel link]">
            [page title]
        </a>
    </figcaption>
</figure>

I spent a bit of time cleaning up links and shoving them into different formats so my code will be really simple in the end.

'''
row[0] = image name (for alt + name tags)
row[1] = image url (for linking in html)
row[2] = rel link (for linking in html)
row[3] = full url (for beautifulsoup)
row[4] = page title (for image description)
'''

for row in data:
    fw.write('<figure><img src="' 
    + row[1] + '" alt="' 
    + row[0] + '"><figcaption><a title="' 
    + row[4] + '" href="' 
    + row[2] +'">' 
    + row[4] + '</a></figcaption></figure>')

Clean up

The rest is piss, really. No more automation on my end. I have simplest of CSS that I’ll just embed into the page, and links that are nested really deep (such as a magazine scan in an archived version of my site) are lopped out because they’re unnecessary.

At the end I have a pretty bare-bones visual sitemap! It’s linked in the footer if you’d like to peruse.