Local PDF Parsing with AWS Textract & Python (Part 1)
# βοΈ Introduction
Throughout my experience working with clients from domains like `healthcare`, `insurance`, and `legal`, I often found myself curious about how certain backend document workflows functioned, especially in healthcare. While supporting these systems, Iβd often get paged for incidents related to PDF pipelines: upload failures, script errors, or extraction gaps. At that stage, like many in support roles, weβre limited to handling outcomes rather than building or understanding the full solution.
Over time, as we gain more experience, build trust, and make people feel confident in our abilities, we gradually get the opportunity to be part of architecture discussions and solution design conversations. But that curiosity about how these pipelines actually work β from PDF upload to raw text extraction β always stayed with me.
So I decided to finally explore this from scratch, hands-on, and document it as a small weekend project. This repository reflects that journey β one that started with a question and ended with deeper insights, hands-on practice, and a working prototype. My hope is that others who share this curiosity will find this just as helpful.
## π What This Project Is
This project focuses on extracting structured data from scanned or uploaded PDFs using `AWS Textract`, starting with a local Python-based flow.
It simulates real-world use cases commonly seen in **healthcare** , **legal** , and **insurance** sectors β where physical documents like visit summaries or forms need to be digitized and stored in structured formats like databases.
The goal?
To break down what typically happens behind the scenes β from raw scanned input to clean, queryable output β using AWS-native services.
## π Why Document Parsing Matters
In many industries, large volumes of information are still locked inside unstructured files, like PDFs or images.
For example:
* A **hospital** stores patient visit summaries scanned from handwritten or printed forms.
* An **insurance company** receives thousands of claim forms uploaded as PDFs every week.
* A **legal team** scans documents, contracts, and evidence that need to be searchable.
Without parsing, this data remains buried and unusable.
Document parsing β especially automated parsing β allows organizations to:
* Extract critical fields (like patient name, ID, diagnosis)
* Store them in structured systems (like `DynamoDB`)
* Enable downstream use (dashboards, alerts, summaries, etc.)
This project is a hands-on way to explore how that all comes together.
## π§ͺ Local First: Why I Didnβt Start with Automation
While itβs tempting to jump straight into Lambda functions and triggers, I deliberately started with a **local-first mindset**.
Why?
* It helps build intuition: you understand exactly what Textract returns and how the parsing logic works.
* Easier to test and debug before handing things off to automation.
* You stay in control, tweaking and improving the logic before putting it behind an event trigger.
This mirrors how real-world teams prototype internally before scaling.
In my case:
* Took a sample patient visit summary in PDF format.
* Wrote a simple `Python` script to call `AWS Textract`.
* Parsed the returned lines into structured fields.
* The script automatically saved the extracted text as a `.txt` file inside `output-texts/`. I opened it to manually check if Textract returned the expected content.
That local foundation made automation smoother and more predictable.
# π§± Prerequisites
To follow along or replicate this project, ensure the following are in place:
* An `AWS` account (root access only used for visual verification)
* A dedicated `IAM` user with the following permission:
* `AmazonTextractFullAccess`
* `AWS CLI` installed and configured with the IAM user credentials
* `Python` 3.9+ installed
* `virtualenv` installed
* `VS Code` (or any preferred IDE)
You should also:
* Create a virtual environment (`python -m venv venv`) and activate it.
* Install boto3 (`pip install boto3`) and freeze dependencies into requirements.txt
## π Project Structure
pdf-to-text-extractor/
βββ input-pdfs/ # Local test PDFs
βββ output-texts/ # Extracted raw text output
βββ scripts/ # Python scripts
β βββ extract_textract.py
βββ venv/ # Virtual environment (ignored in git)
βββ requirements.txt
βββ .gitignore
βββ README.md # Project Documentation
Why `venv` and `requirements.txt` matter:
* Using a `venv/` keeps dependencies isolated β itβs a clean, repeatable habit in Python workflows.
* The `requirements.txt` file lists all the packages I used, so anyone can recreate the same environment instantly.
## π§ͺ What the Local Script Does
In this phase, wrote a simple Python script to:
* Load a PDF from the `input-pdfs/` folder
* Send it to `Textract` for text extraction
* Save the output to `output-texts/` as a `.txt` file
This helped validate if Textract could read and extract meaningful content, before jumping into parsing or automation.
## β
Outcome and Whatβs Next
By the end of this phase, I had a working local prototype that:
* Pulled a PDF from `input-pdfs/`
* Extracted raw text using `AWS Textract`
* Saved it to `output-texts/`
* Gave me a chance to test and fine-tune the logic manually
This local-first phase gave me the space to deeply understand what each piece does before scaling up.
## π References
* Python `boto3` Documentation
* Amazon Textract Documentation
* Setting up a Python Virtual Environment
* What is `requirements.txt`
π **Explore the Full Codebase**
All the files used in this local setup are available here:
π GitHub Repo β pdf-to-text-extractor/local
### π Coming Up in Part 2:
Weβll build on this by:
* Triggering Textract via **AWS Lambda**
* Parsing and storing results in **DynamoDB**
* Weβll automate everything after the upload β using **AWS services** to handle extraction, parsing, and storage β just like a real-world backend system would.
π _Stay tuned for Part 2: βBuilding a Serverless PDF Ingestion Flowβ_