Midnight Pub

Retrieving and Cleaning the Limyaael Rants

~starbreaker

One of my projects this month has been retrieving and archiving in Geminispace the old rants on writing fantasy fiction by Limyaael from LiveJournal. It's an unauthorized archive sourced from my former publisher's equally unauthorized archive, but Limyaael apparently dropped off the net in 2010 and hasn't been heard from since.

There's a lot of advice in these rants that I found useful and I don't want these articles to end up lost if LiveJournal goes offline. And since my publisher has evidently gone out of business and the domain registration for curiosityquills.com expires this month, I suspect their website will soon disappear as well.

original post on starbreaker.org

First post about Limyaael's Rants

Grabbing 424 blog posts and making them halfway presentable for Gemini readers is not an easy task, especially when the source is a crufty and unmaintained WordPress site full of promotional clutter and social media buttons.

If I was only grabbing a dozen posts, I might have simply opened each in Firefox's reader mode, manually copied and pasted the text, and then reformatted as needed. That's not a viable approach for 400+ posts; I'd be at it for weeks, giving myself RSI in the bargain.

Besides, I'm a programmer with big old computer running a Unix-like OS. There are ways to automate retrieval and the worst of the text cleanup.

Automated Retrieval and Text Massage

Here's a bash shell function I created for the task:

The rant() function takes a URL and a string containing a number as arguments. I included it inside a batch file that repeatedly called it with a different URL and number. I probably could have parallelized this, but it was less risky to let it run serially; parallel retrieval might have seemed like a denial-of-service attempt.

The "lynx -dump -nolist" does most of the work for me. Given a URL, it will grab it, render it, and dump the text without the usual list of links at the end to standard output, which I pipe into a series of sed commands.

The "sed -e 's/ //g' -e 's/ /> /g'" command is how I deal with the way lynx indents text when called with "-dump". Paragraphs are indented three spaces, and blockquotes five. So this expression eliminates the paragraphs' indentation and reformats blockquotes with gemtext markup.

The "sed -e '1,40d' -e '/»/d' -e '/«/d' -e '/\[\_/d' -e '/> \*/d'" command eliminates the first 40 lines, which are always irrelevant, and also remove some irrelevant UI cruft.

The "sed -e '/\[limyaael_cover_500.png\]/d'" and "sed -e '/\[limyaael\]/d'" operations remove image placeholders. I could probably have combined these, but I didn't think of it at the time.

The "sed -e '/[0-9] Comment/,$d'" and "sed -e '/Leave a comment/,$d'" expressions are how I get rid of the vast majority of the publishers; promotional fluff at the end of each article.

The last expression just makes the rant title a first-level header, and then we write the output to a file. You can see the results already.

Limyaael's Rants Archive on starbreaker.org

Generating the Index

Of course, I didn't generate the index myself. I had another shell script for that.

The downside to this script is that it includes "index.gmi" if it's already present, but it was easier for me to manually delete one line than to test for and exclude the index file.

Work to Do

As of 30 April, the texts are still quite rough, and need further editing, but the remaining edits seem too chancy to be trusted to sed expressions targeting an entire directory.

Write a reply

Replies

~maya wrote (thread):

I will be bookmarking this! Her stuff was really really good, and I hadn't paid attention to linkrot over the years. Thanks for this work!

Proxied content from gemini://midnight.pub/posts/438.
Get a proper gemini browser and visit!
merveilles webring (external content)

Gemini request details:

Original URL
gemini://midnight.pub/posts/438
Status code
20
Meta
text/gemini
Proxied by
kineto

Be advised that no attempt was made to verify the remote SSL certificate.