Medical Image ETL Pipeline
Modular, config-driven ETL pipeline for medical imaging datasets. Reads raw data from Kaggle, transforms it into a standardized format, and writes a consistent output structure. New datasets are added via YAML config only, no core code changes required.
GitHub: https://github.com/ritanaums/oop_final_project_public
Datasets
| Dataset | Task |
|---|---|
| Chest X-ray Lungs | Pathology classification with segmentation masks |
| Skin Cancer (HAM10000) | Skin lesion classification |
| RSNA Pneumonia | Pneumonia detection with bounding boxes |
Architecture
Reader -> Transformer -> Writer chain, driven by YAML config:
reader:
type: "pipeline.readers.chest_xray_reader.ChestXRayReader"
transformers:
- type: "pipeline.transformers.chest_xray_transformer.ChestXRayTransformer"
- type: "pipeline.transformers.label_field_mapper.LabelFieldMapper"
config:
field_mappings:
sex: { M: "male", F: "female" }
writer:
type: "pipeline.writers.standard_dataset_writer.StandardDatasetWriter"Classes are resolved by dotted import path at runtime. The orchestrator has no direct imports of dataset-specific code.
OOP Design
| Pattern | Implementation |
|---|---|
| Abstract Base Class | BaseReader, BaseTransformer, BaseWriter enforce the plugin interface |
| Strategy | MetadataGuidedSubsetStrategy / ImageListingSubsetStrategy for dataset download |
| Protocol | KaggleDatasetApi - structural typing for loose coupling and testability |
| Frozen dataclass | DatasetDefinition - immutable, validated dataset configuration |
| Pydantic models | RawSample, StandardizedSample - runtime-validated data contracts |
| Factory | _resolve_class() - instantiates plugins from dotted class path |
| Generator chaining | Transformers compose as lazy iterators for memory-efficient processing |
| Finalize hook | Cleanup runs in reverse order in a finally block, guaranteed on error |
| Dependency injection | DatasetImportService receives API and config provider at construction |
Tech Stack
Python 3.11, Pydantic v2, Typer, PyYAML, Pillow, uv, pytest, Ruff