2021-05-19

Looking Into BeautifulSoup4

I've mentioned that I have an idea about writing my own search engine.

There's a lot of questions to answer for a project like that, like "what should be indexed?", "how should search result relevance be measured?" and much much more. The most basic thing for me, however, is "how do I get (only relevant) info out of a HTML file?"

(Another relevant question is "do I have to write this myself or can I use existing software?" to which the answer of course is that there is existing software, but I'm still going to write my own rather than trim, customize, and butcher some existing code base until it's unrecognizable anyway.)

A week or so ago I joined a zoom meeting with complete strangers to discuss the awesomeness of having your own homepage. This event was courtesy of the IndieWeb community, and is a bi-weekly recurring thing. It turns out that people are awesome, ingenious, and helpful. I left the meeting energized and having learned the word "BeautifulSoup". That was the final puzzle piece I needed to make sense of the SearchMySite codebase, which is written in python3. I'm gravitating more and more towards python3 as a default home project language. The availability of packages for everything, ubiquity of python3 on all my platforms, easy installation through OS package managers, and the fact that I can run without having to compile, makes it a great choice.

I fiddled a little with BeatifulSoup last night, and it's incredibly easy to get started with. It'll definitely help me for all kinds of projects going forward: Webmention, search engine, and probably more that I haven't thought of yet.

Relevant links to this post (is this more accessible than footnotes in brackets?):

IndieWeb Homebrew Website Club Europe London.

BeautifulSoup4 python lib.

SearchMySite search engine.

Wiby.me search engine.

Lieu Community Search Engine.

-- CC0 Björn Andersson