Civic Band
Really cool project called CivicBand. It uses publicly available APIs to grab PDFs of meeting notes, then OCRs them, then uploads them via Datasette for exploration. Open data!!
- We fetch PDFs of civic minutes anywhere we can get easy API access. We don’t “scrape” the listing sites, at least not right now.
- We break those PDFs up into images of each page of the PDF
- We use tesseract to OCR those images into text
- We put each page of now-text into a sqlite database
- Each site is a datasette instance. We have a generation script that creates the Caddyfile for the whole collection, and the metadata.json for each instance
- The whole thing is deployed to one VPS in Oregon.