Project

Automated Market Intelligence Pipeline for Sri Lankan Food Prices

A production-focused ETL workflow that gathers daily CBSL commodity reports, parses difficult PDF layouts, and publishes clean structured food-price data for analysis and downstream use.

PythonETLGitHub ActionsHugging FacePDF Parsing

Timeline

March 2026 - April 2026

Outcome

Recovered 1,200+ days of historical data and shipped an automated dataset pipeline with 75+ downloads in two weeks.

Problem

Commodity price reports were available in PDF form, but the formats were inconsistent and unsuitable for direct analysis. Scraping approaches were unreliable, and historical recovery needed a more deterministic ingestion strategy.

Outcome achieved

Collected and structured daily data for 24 products across five markets.
Recovered more than 1,200 days of historical data using URL-based extraction instead of fragile scraping.
Published a reusable dataset pipeline with schema validation and automated updates.

Challenges faced

• The CBSL PDF uses a split-glyph font encoding where every number is split across two separate text tokens (e.g. '200.00' renders as '2' + '00.00'), requiring custom token reassembly logic before any values could be extracted.
• The CBSL listing page uses Drupal's year-filter which silently scoped pagination to the current year only, causing ~50% of 2020–2023 records to be missed entirely with no error.
• PDF table structure changed over time — early 2020 reports had physically shorter rows with absent market columns for some commodities, causing strict parsers to crash rather than degrade gracefully.
• CBSL occasionally publishes a temporary 1-page summary PDF before replacing it with the full 2-page report, causing the pipeline to crash mid-run when scheduled too early.
• The GitHub Actions cache and Hugging Face could silently desync — a failed upload would leave the CSV stuck in cache, with every subsequent run seeing today's date already present and exiting without retrying the push.

How I solved them

• Built a custom pdfplumber-based parser with regex token reassembly that handles all split-glyph number patterns including 2-digit prices ('8'+'0.00'→80), thousands-comma splits ('1'+',000.00'→1000), and n.a. markers — validated against 24 items × 5 fields = 120/120 correct values on real PDFs.
• Replaced the HTML scraper entirely with direct deterministic URL generation (.../price_report_YYYYMMDD_e.pdf) and lightweight HEAD requests to confirm PDF existence, achieving complete historical coverage without any dependency on the listing page.
• Made the parser fully tolerant of structural variation — truncated rows (fewer than 10 values) and completely absent commodities both resolve to NaN rather than errors, so the pipeline always produces a row for every date regardless of report completeness.
• Handled 1-page PDFs by exiting cleanly with code 0 in daily mode so GitHub Actions does not mark the run as failed, with the next scheduled run automatically retrying when the full report is available.

Technical details

Daily extraction workflow for commodity reports from CBSL documents.
Structured transformation layer for products, markets, and historical time series output.
Validation checks to prevent malformed records from being published.
Automated publishing pipeline for dataset refreshes and reproducible releases.

Project links

GitHub Repository Hugging Face Dataset