pyCTAKES

Python-native clinical NLP framework that mirrors and extends Apache cTAKES

🏥 What is pyCTAKES?

pyCTAKES is a comprehensive, modern clinical Natural Language Processing framework built entirely in Python. It provides end-to-end clinical text processing capabilities that match and extend Apache cTAKES functionality while being easier to install, use, and extend.

✨ Key Features

🔬 Complete Clinical NLP Pipeline

Sentence Segmentation: Clinical-aware boundary detection with abbreviation handling
Tokenization: Advanced tokenization with POS tagging and lemmatization
Section Detection: Automatic identification of clinical sections (History, Medications, Assessment, etc.)
Named Entity Recognition: Medical entity extraction (disorders, medications, procedures, anatomy)
Negation & Assertion: pyConText-style negation and assertion detection
Concept Mapping: UMLS integration with CUI normalization

⚡ Multiple Pipeline Configurations

Default Pipeline: Full clinical NLP with all features
Fast Pipeline: Speed-optimized with rule-based components
Basic Pipeline: Minimal set for simple use cases
Custom Pipeline: Build your own with configurable annotators

🚀 Production Ready

Pure Python implementation (no Java dependencies)
Command-line interface and comprehensive Python API
Extensive error handling and fallback mechanisms
Comprehensive testing and documentation

📦 Quick Start

Installation

pip install pyctakes

Basic Usage

import pyctakes

# Create a clinical NLP pipeline
pipeline = pyctakes.create_default_pipeline()

# Process clinical text
clinical_text = """
Patient is a 65-year-old male with diabetes and hypertension.
He presents with chest pain but denies shortness of breath.
Current medications include metformin and lisinopril.
"""

result = pipeline.process_text(clinical_text)

# Access different types of annotations
entities = result.document.get_annotations("NAMED_ENTITY")
for entity in entities:
    print(f"{entity.text} -> {entity.entity_type.value}")

Output:

diabetes -> disorder
hypertension -> disorder  
chest pain -> disorder
shortness of breath -> sign_symptom
metformin -> medication
lisinopril -> medication

Command Line Usage

# Annotate a clinical note
pyctakes annotate clinical_note.txt --output annotations.json

# Use different pipeline types
pyctakes annotate clinical_note.txt --pipeline fast --format text

🏗️ Architecture

pyCTAKES follows a modular, pipeline-based architecture:

graph LR
    A[Clinical Text] --> B[Sentence Segmentation]
    B --> C[Tokenization]
    C --> D[Section Detection]
    D --> E[Named Entity Recognition]
    E --> F[Assertion Detection]
    F --> G[Concept Mapping]
    G --> H[Structured Output]

📊 Performance

Pipeline Type	Speed	Features	Use Case
Basic	⚡⚡⚡	Essential NLP	Development, Testing
Fast	⚡⚡	Rule-based	High-throughput Processing
Default	⚡	Complete Clinical NLP	Production, Research

🌟 Use Cases

Clinical Research

Electronic Health Record (EHR) Analysis: Extract structured data from clinical notes
Phenotyping: Identify patient cohorts based on clinical criteria
Clinical Trial Recruitment: Automated patient screening

Healthcare Analytics

Quality Metrics: Extract quality indicators from clinical documentation
Population Health: Analyze health trends across patient populations
Clinical Decision Support: Real-time analysis of clinical text

NLP Research

Baseline Framework: Comprehensive clinical NLP baseline
Custom Development: Extensible platform for clinical NLP research
Benchmark Comparisons: Standardized evaluation framework

📚 Documentation

Installation Guide - Detailed setup instructions
Quick Start Tutorial - Get up and running in minutes
User Guide - Comprehensive usage documentation
API Reference - Complete API documentation
Examples - Real-world usage examples

🤝 Community & Support

📂 GitHub: https://github.com/sonish777/pyctakes
📖 Documentation: https://sonish777.github.io/pyctakes
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

🛣️ Roadmap

v1.0 ✅ Core clinical NLP pipeline
v1.1 🔄 Relation extraction, Docker support
v2.0 📋 LLM integration, advanced features
Future 🔮 Real-time processing, EHR integration

📄 License

pyCTAKES is released under the Apache-2.0 License. See LICENSE for details.

🙏 Acknowledgments

pyCTAKES is inspired by Apache cTAKES and builds upon the excellent work of the clinical NLP community. Special thanks to the developers of spaCy, scispaCy, and other open-source libraries that make this project possible.

Ready to get started? Check out our Quick Start Guide or explore the Examples!

pyCTAKES is an entirely Python-based clinical NLP framework that mirrors and extends Apache cTAKES' rich functionality, while delivering an easy-to-install, easy-to-use package on PyPI.

Key Features

Full UIMA-style Pipeline: Sentence segmentation, tokenization, POS tagging, chunking
Named Entity Recognition: Disorders, medications, procedures, anatomy
Concept Mapping: Automatic mapping to UMLS (SNOMED CT, RxNorm, LOINC)
Assertion & Negation: Integrated pyConText-style rule engine
Relation Extraction: Rule-based and transformer-based approaches
Agentic LLM Layer: LangChain-powered intelligent processing
Easy Installation: pip install pyctakes

Quick Example

from pyctakes import Pipeline

# Initialize pipeline
pipeline = Pipeline()

# Process clinical text
text = "Patient has diabetes and hypertension. No known allergies."
result = pipeline.process_text(text)

# Access annotations
for annotation in result.document.annotations:
    print(f"{annotation.text} -> {annotation.annotation_type}")

Indices and Tables

{ref}genindex
{ref}modindex
{ref}search