pyCTAKES
Python-native clinical NLP framework that mirrors and extends Apache cTAKES
๐ฅ What is pyCTAKES?
pyCTAKES is a comprehensive, modern clinical Natural Language Processing framework built entirely in Python. It provides end-to-end clinical text processing capabilities that match and extend Apache cTAKES functionality while being easier to install, use, and extend.
โจ Key Features
๐ฌ Complete Clinical NLP Pipeline
- Sentence Segmentation: Clinical-aware boundary detection with abbreviation handling
- Tokenization: Advanced tokenization with POS tagging and lemmatization
- Section Detection: Automatic identification of clinical sections (History, Medications, Assessment, etc.)
- Named Entity Recognition: Medical entity extraction (disorders, medications, procedures, anatomy)
- Negation & Assertion: pyConText-style negation and assertion detection
- Concept Mapping: UMLS integration with CUI normalization
โก Multiple Pipeline Configurations
- Default Pipeline: Full clinical NLP with all features
- Fast Pipeline: Speed-optimized with rule-based components
- Basic Pipeline: Minimal set for simple use cases
- Custom Pipeline: Build your own with configurable annotators
๐ Production Ready
- Pure Python implementation (no Java dependencies)
- Command-line interface and comprehensive Python API
- Extensive error handling and fallback mechanisms
- Comprehensive testing and documentation
๐ฆ Quick Start
Installation
Basic Usage
import pyctakes
# Create a clinical NLP pipeline
pipeline = pyctakes.create_default_pipeline()
# Process clinical text
clinical_text = """
Patient is a 65-year-old male with diabetes and hypertension.
He presents with chest pain but denies shortness of breath.
Current medications include metformin and lisinopril.
"""
result = pipeline.process_text(clinical_text)
# Access different types of annotations
entities = result.document.get_annotations("NAMED_ENTITY")
for entity in entities:
print(f"{entity.text} -> {entity.entity_type.value}")
Output:
diabetes -> disorder
hypertension -> disorder
chest pain -> disorder
shortness of breath -> sign_symptom
metformin -> medication
lisinopril -> medication
Command Line Usage
# Annotate a clinical note
pyctakes annotate clinical_note.txt --output annotations.json
# Use different pipeline types
pyctakes annotate clinical_note.txt --pipeline fast --format text
๐๏ธ Architecture
pyCTAKES follows a modular, pipeline-based architecture:
graph LR
A[Clinical Text] --> B[Sentence Segmentation]
B --> C[Tokenization]
C --> D[Section Detection]
D --> E[Named Entity Recognition]
E --> F[Assertion Detection]
F --> G[Concept Mapping]
G --> H[Structured Output]
๐ Performance
Pipeline Type | Speed | Features | Use Case |
---|---|---|---|
Basic | โกโกโก | Essential NLP | Development, Testing |
Fast | โกโก | Rule-based | High-throughput Processing |
Default | โก | Complete Clinical NLP | Production, Research |
๐ Use Cases
Clinical Research
- Electronic Health Record (EHR) Analysis: Extract structured data from clinical notes
- Phenotyping: Identify patient cohorts based on clinical criteria
- Clinical Trial Recruitment: Automated patient screening
Healthcare Analytics
- Quality Metrics: Extract quality indicators from clinical documentation
- Population Health: Analyze health trends across patient populations
- Clinical Decision Support: Real-time analysis of clinical text
NLP Research
- Baseline Framework: Comprehensive clinical NLP baseline
- Custom Development: Extensible platform for clinical NLP research
- Benchmark Comparisons: Standardized evaluation framework
๐ Documentation
- Installation Guide - Detailed setup instructions
- Quick Start Tutorial - Get up and running in minutes
- User Guide - Comprehensive usage documentation
- API Reference - Complete API documentation
- Examples - Real-world usage examples
๐ค Community & Support
- ๐ GitHub: https://github.com/sonish777/pyctakes
- ๐ Documentation: https://sonish777.github.io/pyctakes
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
๐ฃ๏ธ Roadmap
- v1.0 โ Core clinical NLP pipeline
- v1.1 ๐ Relation extraction, Docker support
- v2.0 ๐ LLM integration, advanced features
- Future ๐ฎ Real-time processing, EHR integration
๐ License
pyCTAKES is released under the Apache-2.0 License. See LICENSE for details.
๐ Acknowledgments
pyCTAKES is inspired by Apache cTAKES and builds upon the excellent work of the clinical NLP community. Special thanks to the developers of spaCy, scispaCy, and other open-source libraries that make this project possible.
Ready to get started? Check out our Quick Start Guide or explore the Examples!
pyCTAKES is an entirely Python-based clinical NLP framework that mirrors and extends Apache cTAKES' rich functionality, while delivering an easy-to-install, easy-to-use package on PyPI.
Key Features
- Full UIMA-style Pipeline: Sentence segmentation, tokenization, POS tagging, chunking
- Named Entity Recognition: Disorders, medications, procedures, anatomy
- Concept Mapping: Automatic mapping to UMLS (SNOMED CT, RxNorm, LOINC)
- Assertion & Negation: Integrated pyConText-style rule engine
- Relation Extraction: Rule-based and transformer-based approaches
- Agentic LLM Layer: LangChain-powered intelligent processing
- Easy Installation:
pip install pyctakes
Quick Example
from pyctakes import Pipeline
# Initialize pipeline
pipeline = Pipeline()
# Process clinical text
text = "Patient has diabetes and hypertension. No known allergies."
result = pipeline.process_text(text)
# Access annotations
for annotation in result.document.annotations:
print(f"{annotation.text} -> {annotation.annotation_type}")
Indices and Tables
- {ref}
genindex
- {ref}
modindex
- {ref}
search