Local Announcements Pipeline

( Image text to Structured Data)

This project processes screenshots of local government and utility

service announcements and converts them into structured data.

The pipeline currently focuses on OCR text extraction from images

to prepare the data for further analysis and visualization.

PROBLEM AND MOTIVATION

Local government and utility announcements are often shared as images on social media platforms, making the information difficult to search, filter, or integrate into digital systems. Important details such as schedules,locations, and service interruptions remain unstructured, requiring manualeffort to interpret. This project is motivated by the need to make these announcementsmore accessible and usable by transforming them into structured, machine-readable data that can support better information tracking and decision-making.

SOLUTION OVERVIEW

The Local Announcements Pipeline simulates a real-world ETL workflow for unstructured data. It processes announcement images by extracting text using OCR, then cleans and structures the data into organized formats.


The pipeline prepares outputs such as tables and calendar-ready datasets, enabling easier visualization and potential integration into applications. This approach demonstrates how raw, image-based information can be converted into structured data for practical use.

DATASET

The dataset consists of publicly available announcement images sourced from official social media pages of local government units and utility service providers. These posts typically include schedules, service interruptions, and community notices shared as image-based content.

The collected data reflects real-world variability in format, layout, and text quality, providing a practical basis for testing OCR extraction and preprocessing techniques in an unstructured data pipeline.

METHODOLOGY

The system was developed using a structured data processing pipeline focused on extracting and transforming unstructured image-based announcements into usable data formats.

Pipeline:


  • Image collection: sourcing announcement images from public social media pages

  • Preprocessing: image cleaning and preparation for OCR (resizing, contrast adjustment if needed)

  • Text extraction: OCR-based extraction of textual content from images

  • Data cleaning: removing noise, correcting formatting, and standardizing extracted text

  • Structuring: parsing key details into organized formats (e.g., tables, date-time fields)

  • Output generation: preparing structured data for visualization (tables, calendar-ready datasets)

RESULTS

The pipeline was able to successfully extract and transform text from image-based announcements into structured formats. Key information such as dates, locations, and service details were organized into clean, tabular outputs.


The results demonstrate that unstructured social media announcements can be converted into usable datasets, enabling easier visualization and potential integration into applications such as calendars or monitoring dashboards.


Limitations were observed in cases of low image quality, inconsistent layouts, or dense text, which affected OCR accuracy and required additional cleaning.

GITHUB

Create a free website with Framer, the website builder loved by startups, designers and agencies.