Pipeline Overview

pyCTAKES uses a pipeline-based architecture where clinical text flows through a series of annotators, each adding specific types of annotations.

Pipeline Architecture

graph TB
    A[Clinical Text] --> B[Sentence Segmentation]
    B --> C[Tokenization]
    C --> D[Section Detection]
    D --> E[Named Entity Recognition]
    E --> F[Assertion Detection]
    F --> G[Concept Mapping]
    G --> H[Structured Output]

    style A fill:#e1f5fe
    style H fill:#e8f5e8
    style B,C,D,E,F,G fill:#fff3e0

Pipeline Types

pyCTAKES provides three pre-configured pipeline types to suit different needs:

Default Pipeline

The complete clinical NLP pipeline with all features enabled:

import pyctakes

pipeline = pyctakes.create_default_pipeline()

Components: - Clinical Sentence Segmentation (spaCy/Stanza backend) - Clinical Tokenization with POS/lemma - Section Detection - Clinical NER (model + rule-based) - Negation/Assertion Detection - UMLS Concept Mapping

Best for: Production use, research, comprehensive analysis

Fast Pipeline

Speed-optimized pipeline using rule-based components:

pipeline = pyctakes.create_fast_pipeline()

Components: - Rule-based Sentence Segmentation - Rule-based Tokenization (no POS/lemma) - Simple Clinical NER - Dictionary-based Concept Mapping

Best for: High-throughput processing, real-time applications

Basic Pipeline

Minimal pipeline for simple use cases:

pipeline = pyctakes.create_basic_pipeline()

Components: - Rule-based Sentence Segmentation - Rule-based Tokenization - Rule-based Clinical NER

Best for: Development, testing, simple applications

Custom Pipelines

Build your own pipeline with specific annotators:

from pyctakes import Pipeline
from pyctakes.annotators import (
    ClinicalSentenceSegmenter,
    ClinicalTokenizer,
    ClinicalNERAnnotator,
    NegationAssertionAnnotator
)

# Create empty pipeline
pipeline = Pipeline()

# Add annotators in order
pipeline.add_annotator(ClinicalSentenceSegmenter())
pipeline.add_annotator(ClinicalTokenizer({
    "include_pos": True,
    "include_lemma": True
}))
pipeline.add_annotator(ClinicalNERAnnotator({
    "use_model": True,
    "use_rules": True
}))
pipeline.add_annotator(NegationAssertionAnnotator())

Processing Flow

1. Input Processing

# Process text with document ID
result = pipeline.process_text(
    text="Patient has diabetes and hypertension.",
    doc_id="example_001"
)

2. Document Processing

The pipeline processes the text through each annotator:

# Each annotator adds annotations to the document
for annotator in pipeline.annotators:
    annotations = annotator.process(document)
    document.add_annotations(annotations)

3. Output Generation

# Access processing results
print(f"Processing time: {result.processing_time}")
print(f"Document ID: {result.document.doc_id}")
print(f"Annotations: {len(result.document.annotations)}")

# Handle any errors
if result.errors:
    for error in result.errors:
        print(f"Error: {error}")

Pipeline Configuration

Configure pipeline behavior using JSON configuration:

config = {
    "sentence_segmenter": {
        "backend": "spacy"  # or "stanza" or "rule"
    },
    "tokenizer": {
        "backend": "spacy",
        "include_pos": True,
        "include_lemma": True
    },
    "ner": {
        "use_model": True,
        "use_rules": True,
        "model_name": "en_core_sci_sm"
    },
    "assertion": {
        "max_scope": 10,
        "custom_cues": []
    }
}

pipeline = pyctakes.create_default_pipeline(config)

Pipeline Management

Adding Annotators

from pyctakes.annotators import ClinicalSectionAnnotator

# Add annotator to existing pipeline
pipeline.add_annotator(ClinicalSectionAnnotator())

Removing Annotators

# Remove annotator by name
pipeline.remove_annotator("ClinicalSectionAnnotator")

Listing Annotators

# Get list of annotator names
annotator_names = pipeline.list_annotators()
print(f"Pipeline annotators: {annotator_names}")

Getting Specific Annotators

# Get annotator by name
ner_annotator = pipeline.get_annotator("ClinicalNERAnnotator")
if ner_annotator:
    # Modify annotator configuration
    ner_annotator.set_config("use_model", False)

Performance Considerations

Backend Selection

Choose the appropriate backend for your needs:

Backend	Speed	Features	Memory
Rule-based	⚡⚡⚡	Basic	Low
spaCy	⚡⚡	Advanced	Medium
Stanza	⚡	Advanced	High

Pipeline Optimization

Use Fast Pipeline for high-throughput scenarios
Disable unused features in custom configurations
Batch processing for multiple documents
Model caching for repeated pipeline creation

# Example: Optimized configuration for speed
speed_config = {
    "tokenizer": {
        "backend": "rule",
        "include_pos": False,
        "include_lemma": False
    },
    "ner": {
        "use_model": False,
        "use_rules": True
    }
}

fast_pipeline = pyctakes.create_default_pipeline(speed_config)

Error Handling

Pipelines include comprehensive error handling:

try:
    result = pipeline.process_text(clinical_text)

    # Check for processing errors
    if result.errors:
        print(f"Processing completed with {len(result.errors)} errors")
        for error in result.errors:
            print(f"  - {error}")

    # Continue with successful annotations
    successful_annotations = [
        ann for ann in result.document.annotations 
        if ann.confidence > 0.5
    ]

except Exception as e:
    print(f"Pipeline failed completely: {e}")

Batch Processing

Process multiple documents efficiently:

from pathlib import Path

# Process multiple files
documents = []
for file_path in Path("clinical_notes").glob("*.txt"):
    text = file_path.read_text()
    result = pipeline.process_text(text, doc_id=file_path.stem)
    documents.append(result.document)

print(f"Processed {len(documents)} documents")

Pipeline Serialization

Save and load pipeline configurations:

import json

# Save pipeline configuration
config = {
    "pipeline_type": "default",
    "annotators": [
        {"name": "ClinicalSentenceSegmenter", "config": {}},
        {"name": "ClinicalTokenizer", "config": {"include_pos": True}},
        # ... other annotators
    ]
}

with open("pipeline_config.json", "w") as f:
    json.dump(config, f, indent=2)

# Load configuration
with open("pipeline_config.json", "r") as f:
    loaded_config = json.load(f)

# Create pipeline from configuration
pipeline = pyctakes.create_default_pipeline(loaded_config)

Next Steps

Annotators Guide - Deep dive into individual annotators
Configuration - Advanced configuration options
Command Line Interface - Using pyCTAKES from the command line