Annotators

pyCTAKES provides a comprehensive set of annotators for clinical natural language processing. Each annotator specializes in extracting specific types of information from clinical text.

Available Annotators

TokenizationAnnotator

Handles sentence segmentation and tokenization of clinical text.

Features: - Clinical sentence segmentation (respects medical abbreviations) - Word tokenization optimized for clinical text - Multiple backend support (spaCy, Stanza, rule-based)

Configuration:

{
  "tokenization": {
    "backend": "spacy",  // or "stanza", "rule_based"
    "model": "en_core_web_sm"
  }
}

Usage:

from pyctakes.annotators.tokenization import TokenizationAnnotator

annotator = TokenizationAnnotator(backend="spacy")
doc = annotator.process(doc)

# Access tokens and sentences
for sentence in doc.sentences:
    print(f"Sentence: {sentence.text}")
    for token in sentence.tokens:
        print(f"  Token: {token.text}")

SectionAnnotator

Identifies and classifies clinical document sections.

Detected Sections: - Chief Complaint - History of Present Illness
- Past Medical History - Medications - Allergies - Social History - Family History - Review of Systems - Physical Examination - Assessment and Plan - Discharge Instructions

Configuration:

{
  "sections": {
    "custom_patterns": {
      "CUSTOM_SECTION": ["custom header:", "another pattern"]
    }
  }
}

Usage:

from pyctakes.annotators.sections import SectionAnnotator

annotator = SectionAnnotator()
doc = annotator.process(doc)

for section in doc.sections:
    print(f"Section: {section.section_type} ({section.start}-{section.end})")

NERAnnotator

Performs Named Entity Recognition for clinical entities.

Entity Types: - MEDICATION - DOSAGE
- FREQUENCY - CONDITION - SYMPTOM - ANATOMY - PROCEDURE - TEST_RESULT

Approaches: - Rule-based: Pattern matching with clinical dictionaries - Model-based: Using pre-trained clinical NLP models

Configuration:

{
  "ner": {
    "approach": "rule_based",  // or "model_based"
    "model_name": "clinical_ner_model",
    "custom_patterns": {
      "MEDICATION": ["aspirin", "ibuprofen", "metformin"]
    }
  }
}

Usage:

from pyctakes.annotators.ner import NERAnnotator

annotator = NERAnnotator(approach="rule_based")
doc = annotator.process(doc)

for entity in doc.entities:
    print(f"Entity: {entity.text} ({entity.label}) at {entity.start}-{entity.end}")

AssertionAnnotator

Detects negation, uncertainty, and other assertion attributes.

Assertion Types: - Polarity: POSITIVE, NEGATIVE - Uncertainty: CERTAIN, UNCERTAIN
- Temporality: PRESENT, PAST, FUTURE - Experiencer: PATIENT, FAMILY, OTHER

Algorithm: Based on pyConText with clinical-specific rules.

Configuration:

{
  "assertion": {
    "window_size": 10,
    "custom_negation_terms": ["denies", "negative for"],
    "custom_uncertainty_terms": ["possible", "maybe"]
  }
}

Usage:

from pyctakes.annotators.assertion import AssertionAnnotator

annotator = AssertionAnnotator()
doc = annotator.process(doc)

for entity in doc.entities:
    assertion = entity.assertion
    print(f"{entity.text}: polarity={assertion.polarity}, "
          f"uncertainty={assertion.uncertainty}")

UMLSAnnotator

Maps clinical concepts to UMLS (Unified Medical Language System) codes.

Features: - Concept normalization - CUI (Concept Unique Identifier) assignment - Semantic type mapping - Approximate string matching

Configuration:

{
  "umls": {
    "similarity_threshold": 0.8,
    "max_candidates": 5,
    "semantic_types": ["T047", "T184"]  // Diseases, Signs/Symptoms
  }
}

Usage:

from pyctakes.annotators.umls import UMLSAnnotator

annotator = UMLSAnnotator()
doc = annotator.process(doc)

for entity in doc.entities:
    if entity.umls_concept:
        print(f"{entity.text} -> {entity.umls_concept.cui} ({entity.umls_concept.preferred_term})")

Annotation Lifecycle

Text Input: Raw clinical text
Tokenization: Sentence segmentation and tokenization
Section Detection: Identify document structure
Named Entity Recognition: Extract clinical entities
Assertion Detection: Determine entity attributes
Concept Mapping: Map to standard vocabularies

Custom Annotators

You can create custom annotators by extending the base annotator class:

from pyctakes.annotators.base import BaseAnnotator
from pyctakes.types import Document, Annotation

class CustomAnnotator(BaseAnnotator):
    def __init__(self, **kwargs):
        super().__init__()
        # Initialize your annotator

    def process(self, doc: Document) -> Document:
        # Your processing logic
        annotation = Annotation(
            start=0,
            end=10,
            text="example",
            label="CUSTOM"
        )
        doc.annotations.append(annotation)
        return doc

Performance Considerations

Tokenization: spaCy backend is fastest, Stanza most accurate
NER: Rule-based is faster, model-based more accurate
UMLS: Enable caching for repeated concept lookups
Pipeline Order: Tokenization → Sections → NER → Assertion → UMLS

Error Handling

All annotators include robust error handling:

try:
    doc = annotator.process(doc)
except AnnotationError as e:
    print(f"Annotation failed: {e}")
    # Handle gracefully

Best Practices

Configure appropriately: Adjust parameters for your use case
Process in order: Follow the recommended pipeline sequence
Validate outputs: Check annotation quality on sample data
Monitor performance: Profile for bottlenecks in production
Handle errors: Implement robust error handling