Using Python to Clean, Query, and Catch SEO Junk Data
Sorting Out Screwed-Up CSVs from Site Crawlers
I’ve got a folder called seo-dumps
with a dozen or so versioned CSVs from various crawl tools: Screaming Frog, Sitebulb, and some one-off Python scripts pulling from old sitemaps. They all say they export UTF-8 — they lie. Running head -n 5 some-export.csv
is enough to tell when it’s lying. This actually broke a parquet transform in Pandas for me because the BOM sneaks in silently. Fix? Force the read with encoding='utf-8-sig'
and strip garbage later. It’s a mess, but you’ve got to keep going.
df = pd.read_csv('site-data.csv', encoding='utf-8-sig')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
Also: don’t trust ‘Indexability’ columns from export tools. Sitebulb lists something as “Blocked by Robots.txt” even if the bot eventually hit the page via another path. Use Python to validate that against actual HTTP status codes (more on that later).
Finding Sneaky Redirect Chains
Some redirected URLs look fine until you realize they loop. This took hours one night: the old blog URLs redirected to vanity URLs that 301’d to UTM-decorated links that 301’d back to the originals, minus UTM garbage. Our redirect rules weren’t broken — they were too clever.
import requests
from urllib.parse import urljoin
def trace_redirects(url):
seen = set()
while url not in seen:
seen.add(url)
try:
r = requests.head(url, allow_redirects=False)
if 300 <= r.status_code < 400:
url = urljoin(url, r.headers.get('Location'))
else:
break
except Exception as e:
print(f"Error: {e}")
break
return seen
Eventually I ran this over ~8000 URLs. Found 33 that circled back to themselves via UTM-laced marketing nonsense. Not great for crawl budget and can tank canonical weight.
Page Title Collisions and Hashing Tricks
Titles were being duplicated across big chunks of our site, but no one saw it because they were in different categories — like /guides
and /reference
. The titles were technically different (“Guide to Widgets” vs “General Guide To Widgets”), but close enough that they were screwing with performance in Google Search Console.
I tossed the titles into a column, normalized them with a quick lowercasing and non-alphanum strip, and hashed ’em to see collisions:
import hashlib
seen = {}
for idx, row in df.iterrows():
norm = ''.join(e.lower() for e in row['title'] if e.isalnum())
h = hashlib.md5(norm.encode()).hexdigest()
seen.setdefault(h, []).append(row['url'])
for k, v in seen.items():
if len(v) > 1:
print(v)
This doesn’t catch nuance like synonyms or word order swaps, but it cleared enough garbage to justify rewriting some category logic for titles. Pro tip: check <title>
, og:title
, and H1 — they’re sometimes out of sync and cause misalignment with what Google indexes.
When Canonical Tags Stop Making Sense
We had a canonical tag pointing to a page that was 404’ing. It had been removed in a redesign sprint no one told SEO about. The Python crawler I built was returning HTTP 200 for the page (thanks to a custom Cloudflare page rule intercept), so everything looked fine — until I did a full browser emulation using Playwright.
Playwright reveals fake 200s
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto('https://example.com/404-page')
print(page.title())
print(page.content())
The page loaded, looked fine visually, and returned 200 on static crawl — but the actual DOM had text like “This page doesn’t exist.” We were canonicalizing to a lie.
Ended up adding a middleware layer in the app to throw a legit 410 when the underlying database reported deleted content. Google eventually respected the change, but only after resubmitting the sitemap and API-pinging GSC a couple times.
Keyword Collision in Meta Descriptions
I thought we’d escaped 2010s style keyword stuffing, but apparently one of our older CMS plugins was dynamically appending category names to meta descs via tags — and not deduping. So we had metas like:
“Widgets and gizmos to help with widgets, gadgets, gizmo optimization.”
I wrote a stupid little compression test — not normal compression, just character entropy density. Basically if the entropy fell below some threshold, I flagged it. Low entropy = repeated language.
Not precise, but anything with a density under .6 usually had garbage repeated words. Probably a terrible method in NLP terms, but good enough for SEO triage.
Dealing With GSC’s CSV Export Weirdness
Still not sure why this happens, but if you export data from Google Search Console using the interface (not API), the CSV will double-quote anything with commas or percent signs inconsistently. You get rows where the CTR column is quoted, sometimes not. Parsing this in Python silently miscasts the float if you’re not watching.
df = pd.read_csv("gsc_export.csv", quotechar='"', skipinitialspace=True)
df['ctr'] = df['ctr'].astype(str).str.replace('%','').astype(float) / 100
If you’re getting 0s for click-through rate, that line above is your lifeline. Also, fun bug: GSC sometimes truncates your export filename mid-sentence. I had a file called Performance-Report-(Page--All-time--expor
once. Not even mad, just amazed.
Regexing Sitemap URLs That Shouldn’t Be There
One of those classic moments where the sitemap.xml is autogenerated and someone forgot to exclude /wp-json
routes and preview drafts. In theory, Google says it ignores non-canonical routes if marked correctly. In practice? They’re indexed anyway, especially if they get linked accidentally from dev environments.
Useful patterns that saved a few headaches:
- Exclude anything with “preview=true” or “revision=…”
- Remove URLs containing “/wp-json/” or “/feed/”
- Check for mixed protocols (http & https point to same page)
- Detect sitemaps linking to localhost or dev domains (yes…)
- Log busted canonical chains where sitemap URL ≠ canonical
I just piped the sitemap through ElementTree
, threw down a few re.search
catchers on the <loc>
values, and dumped the gross ones into a blocklist. Helped with a triplicate indexing issue we had where both www and non-www were included at one point — even though Cloudflare forces one.
Detecting Soft 404s with Headless Audits
Here’s one I stumbled on way too late: thin pages — like ones showing only nav/footers after content deletion — still returned 200 and a page title. GSC labeled them soft 404s a month later, but during crawl they looked fine. To catch more of these early, I wrote a headless scan using Playwright to analyze character count inside the main content wrapper — whatever had the role “main” or id “content”.
If the inner content didn’t meet a threshold (I used ~250 characters), it flagged the page for review. Cheap, fast, and caught around 120 zombie pages that hadn’t been fully removed after deprecation.
main = page.query_selector("main") or page.query_selector("#content")
text = main.inner_text() if main else ""
if len(text.strip()) < 250:
print("Thin page:", url)
This also caught a weird behavior where localized versions of pages accidentally served completely empty divs due to a translation plugin bug. Same markup, zero text — and nothing in logs.
When Crawl Frequency Lies About Importance
This is a small point, but worth stabbing into the thread: just because a page gets hit often by Google’s crawler doesn’t mean it ranks, or is even valued. We had a pricing comparison page that got crawled daily. Looked important. Turns out it was getting linked from an old XML sitemap buried in a robots.txt-disallowed path — but Googlebot disrespected the robots.txt in this case. It still went in and crawled it, because we linked to the sitemap TSV from a blog post. Madness.
Once I removed the rogue sitemap from the blog post and updated robots.txt to explicitly 404 that whole directory, crawl rate dipped in two days. Pages outside that loop saw an increase in crawl frequency a week later. Nothing documented about this behavior, but logs don’t lie.