Grab: Rip textual content from webpages with ease!

One of the most important skills a professional should hone in on is recording and retrieving information efficiently. With that said, I tend to take notes a little too fast, which typically results in messy, obscure, and difficult to decipher scribbles. This leads me to two options, either slow down my note taking process, and struggle to pay attention to the actual presentation, or copy content directly. I take the ladder option, which as lead to a pretty large collection of PDFs, eBooks, and other reference materials that occupy my digital bookshelf. I find one of the best resources for interesting information is on the Internet. Blogers like Ken Shirriff, Julia Evans, and Bruce Schneier produce mind blowing content that leads me to learning interesting and exciting topics. I like to archive this content locally so that I can refer to it during times I'm not able to secure a stable Internet connection. On top of that, contrary to popular belief, content can become lost in a literal flip of a switch. Sometimes it's due to pure and wholesome mistakes stemming from an overworked SA, other times we lose content due to sabotage. The aforementioned blogs rarely see link-rot, but the same can't be said for the little guys.

Many tools exist to archive content on the web, both online and offline repositories can be built that are full of amazing features and are capable of rendering the archived site as it would have looked if it was live. However, these tools are fairly complex. Most developers are writing locally hosted web-applications that require you to run a [LMW]AMP stack on your system. Then you have to deal with how the sites are archived, usually the files are compressed and full of meta-files that make simple text parsing rather annoying. On top of that, the ability to export the data in other formats is rather limited.

I wrote a simple tool in Python called grab, which takes a few parameters and a URL, and returns some formatted text from the target. The requests library allows me to easily feed HTML documents to a port of arc90's readability library for processing HTML documents, cleaning up unnecessary data and returning the main body of a webpage.

I was inspired by the reading-mode feature in internet browsers like Firefox, which uses a library called readability. Perhaps I'll do a write up on how it works later down the line, but essentially its an library that employees heuristics which score DOM objects based on observable properties like number of periods in a <p> tag, links in a <div>, and other elements that can be easily parsed. From there, content can be separated from structural elements of the webpage.

Currently grab is an experimental project with a rather simple implementation. If you'd like to test it and give me feedback, I'd love hear from you. Reach out via the normal channels (Email, GitLab Issue/pull-requests).