Medical Image ETL Pipeline

Modular, config-driven ETL pipeline for medical imaging datasets. Reads raw data from Kaggle, transforms it into a standardized format, and writes a consistent output structure. New datasets are added via YAML config only, no core code changes required.

GitHub: https://github.com/ritanaums/oop_final_project_public

Datasets

Dataset Task
Chest X-ray Lungs Pathology classification with segmentation masks
Skin Cancer (HAM10000) Skin lesion classification
RSNA Pneumonia Pneumonia detection with bounding boxes

Architecture

Reader -> Transformer -> Writer chain, driven by YAML config:

reader:
  type: "pipeline.readers.chest_xray_reader.ChestXRayReader"
transformers:
  - type: "pipeline.transformers.chest_xray_transformer.ChestXRayTransformer"
  - type: "pipeline.transformers.label_field_mapper.LabelFieldMapper"
    config:
      field_mappings:
        sex: { M: "male", F: "female" }
writer:
  type: "pipeline.writers.standard_dataset_writer.StandardDatasetWriter"

Classes are resolved by dotted import path at runtime. The orchestrator has no direct imports of dataset-specific code.

OOP Design

Pattern Implementation
Abstract Base Class BaseReader, BaseTransformer, BaseWriter enforce the plugin interface
Strategy MetadataGuidedSubsetStrategy / ImageListingSubsetStrategy for dataset download
Protocol KaggleDatasetApi - structural typing for loose coupling and testability
Frozen dataclass DatasetDefinition - immutable, validated dataset configuration
Pydantic models RawSample, StandardizedSample - runtime-validated data contracts
Factory _resolve_class() - instantiates plugins from dotted class path
Generator chaining Transformers compose as lazy iterators for memory-efficient processing
Finalize hook Cleanup runs in reverse order in a finally block, guaranteed on error
Dependency injection DatasetImportService receives API and config provider at construction

Tech Stack

Python 3.11, Pydantic v2, Typer, PyYAML, Pillow, uv, pytest, Ruff