Skip to content

pyCTAKES

pyCTAKES Logo

Python-native clinical NLP framework that mirrors and extends Apache cTAKES

PyPI version Python versions License Build Status Documentation Status


๐Ÿฅ What is pyCTAKES?

pyCTAKES is a comprehensive, modern clinical Natural Language Processing framework built entirely in Python. It provides end-to-end clinical text processing capabilities that match and extend Apache cTAKES functionality while being easier to install, use, and extend.

โœจ Key Features

๐Ÿ”ฌ Complete Clinical NLP Pipeline

  • Sentence Segmentation: Clinical-aware boundary detection with abbreviation handling
  • Tokenization: Advanced tokenization with POS tagging and lemmatization
  • Section Detection: Automatic identification of clinical sections (History, Medications, Assessment, etc.)
  • Named Entity Recognition: Medical entity extraction (disorders, medications, procedures, anatomy)
  • Negation & Assertion: pyConText-style negation and assertion detection
  • Concept Mapping: UMLS integration with CUI normalization

โšก Multiple Pipeline Configurations

  • Default Pipeline: Full clinical NLP with all features
  • Fast Pipeline: Speed-optimized with rule-based components
  • Basic Pipeline: Minimal set for simple use cases
  • Custom Pipeline: Build your own with configurable annotators

๐Ÿš€ Production Ready

  • Pure Python implementation (no Java dependencies)
  • Command-line interface and comprehensive Python API
  • Extensive error handling and fallback mechanisms
  • Comprehensive testing and documentation

๐Ÿ“ฆ Quick Start

Installation

pip install pyctakes

Basic Usage

import pyctakes

# Create a clinical NLP pipeline
pipeline = pyctakes.create_default_pipeline()

# Process clinical text
clinical_text = """
Patient is a 65-year-old male with diabetes and hypertension.
He presents with chest pain but denies shortness of breath.
Current medications include metformin and lisinopril.
"""

result = pipeline.process_text(clinical_text)

# Access different types of annotations
entities = result.document.get_annotations("NAMED_ENTITY")
for entity in entities:
    print(f"{entity.text} -> {entity.entity_type.value}")

Output:

diabetes -> disorder
hypertension -> disorder  
chest pain -> disorder
shortness of breath -> sign_symptom
metformin -> medication
lisinopril -> medication

Command Line Usage

# Annotate a clinical note
pyctakes annotate clinical_note.txt --output annotations.json

# Use different pipeline types
pyctakes annotate clinical_note.txt --pipeline fast --format text

๐Ÿ—๏ธ Architecture

pyCTAKES follows a modular, pipeline-based architecture:

graph LR
    A[Clinical Text] --> B[Sentence Segmentation]
    B --> C[Tokenization]
    C --> D[Section Detection]
    D --> E[Named Entity Recognition]
    E --> F[Assertion Detection]
    F --> G[Concept Mapping]
    G --> H[Structured Output]

๐Ÿ“Š Performance

Pipeline Type Speed Features Use Case
Basic โšกโšกโšก Essential NLP Development, Testing
Fast โšกโšก Rule-based High-throughput Processing
Default โšก Complete Clinical NLP Production, Research

๐ŸŒŸ Use Cases

Clinical Research

  • Electronic Health Record (EHR) Analysis: Extract structured data from clinical notes
  • Phenotyping: Identify patient cohorts based on clinical criteria
  • Clinical Trial Recruitment: Automated patient screening

Healthcare Analytics

  • Quality Metrics: Extract quality indicators from clinical documentation
  • Population Health: Analyze health trends across patient populations
  • Clinical Decision Support: Real-time analysis of clinical text

NLP Research

  • Baseline Framework: Comprehensive clinical NLP baseline
  • Custom Development: Extensible platform for clinical NLP research
  • Benchmark Comparisons: Standardized evaluation framework

๐Ÿ“š Documentation

๐Ÿค Community & Support

๐Ÿ›ฃ๏ธ Roadmap

  • v1.0 โœ… Core clinical NLP pipeline
  • v1.1 ๐Ÿ”„ Relation extraction, Docker support
  • v2.0 ๐Ÿ“‹ LLM integration, advanced features
  • Future ๐Ÿ”ฎ Real-time processing, EHR integration

๐Ÿ“„ License

pyCTAKES is released under the Apache-2.0 License. See LICENSE for details.

๐Ÿ™ Acknowledgments

pyCTAKES is inspired by Apache cTAKES and builds upon the excellent work of the clinical NLP community. Special thanks to the developers of spaCy, scispaCy, and other open-source libraries that make this project possible.


Ready to get started? Check out our Quick Start Guide or explore the Examples!

pyCTAKES is an entirely Python-based clinical NLP framework that mirrors and extends Apache cTAKES' rich functionality, while delivering an easy-to-install, easy-to-use package on PyPI.

Key Features

  • Full UIMA-style Pipeline: Sentence segmentation, tokenization, POS tagging, chunking
  • Named Entity Recognition: Disorders, medications, procedures, anatomy
  • Concept Mapping: Automatic mapping to UMLS (SNOMED CT, RxNorm, LOINC)
  • Assertion & Negation: Integrated pyConText-style rule engine
  • Relation Extraction: Rule-based and transformer-based approaches
  • Agentic LLM Layer: LangChain-powered intelligent processing
  • Easy Installation: pip install pyctakes

Quick Example

from pyctakes import Pipeline

# Initialize pipeline
pipeline = Pipeline()

# Process clinical text
text = "Patient has diabetes and hypertension. No known allergies."
result = pipeline.process_text(text)

# Access annotations
for annotation in result.document.annotations:
    print(f"{annotation.text} -> {annotation.annotation_type}")

Indices and Tables

  • {ref}genindex
  • {ref}modindex
  • {ref}search