Annotators API Reference

pyCTAKES provides a comprehensive set of annotators for clinical natural language processing.

Base Annotator

`class BaseAnnotator`

Abstract base class for all pyCTAKES annotators.

Methods: - __init__(): Initialize annotator - process(doc): Process document and return annotated document

Tokenization

`class TokenizationAnnotator(BaseAnnotator)`

Handles sentence segmentation and tokenization of clinical text.

Parameters: - backend (str): Backend to use ("spacy", "stanza", "rule_based") - model (str): Model name for spacy/stanza backends

Methods: - process(doc): Add sentences and tokens to document

`class ClinicalSentenceSegmenter`

Clinical-aware sentence segmentation.

Features: - Respects medical abbreviations - Handles clinical note formatting - Configurable sentence boundary detection

`class ClinicalTokenizer`

Advanced tokenization for clinical text.

Features: - POS tagging and lemmatization - Clinical pattern recognition - Multiple backend support

Section Detection

`class SectionAnnotator(BaseAnnotator)`

Base class for section detection.

`class ClinicalSectionAnnotator(SectionAnnotator)`

Identifies clinical document sections.

Detected Sections: - Chief Complaint - History of Present Illness - Past Medical History - Medications - Physical Examination - Assessment and Plan

Named Entity Recognition

`class NERAnnotator(BaseAnnotator)`

Base class for named entity recognition.

`class ClinicalNERAnnotator(NERAnnotator)`

Clinical entity recognition with hybrid approach.

Parameters: - approach (str): "rule_based" or "model_based" - model_name (str): Model for model-based approach - custom_patterns (dict): Custom entity patterns

Entity Types: - MEDICATION - CONDITION - SYMPTOM - ANATOMY - PROCEDURE - TEST_RESULT

`class SimpleClinicalNER(NERAnnotator)`

Fast pattern-based entity recognition.

Features: - High-speed processing - Pattern matching - Optimized for speed over accuracy

Assertion Detection

`class AssertionAnnotator(BaseAnnotator)`

Base class for assertion detection.

`class NegationAssertionAnnotator(AssertionAnnotator)`

pyConText-style assertion and negation detection.

Parameters: - window_size (int): Context window size - custom_negation_terms (list): Additional negation terms - custom_uncertainty_terms (list): Additional uncertainty terms

Assertion Types: - Polarity: POSITIVE, NEGATIVE - Uncertainty: CERTAIN, UNCERTAIN - Temporality: PRESENT, PAST, FUTURE - Experiencer: PATIENT, FAMILY, OTHER

UMLS Concept Mapping

`class UMLSAnnotator(BaseAnnotator)`

Base class for UMLS concept mapping.

`class UMLSConceptMapper(UMLSAnnotator)`

Maps entities to UMLS concepts.

Parameters: - umls_path (str): Path to UMLS data - similarity_threshold (float): Minimum similarity score - max_candidates (int): Maximum candidate concepts

`class SimpleDictionaryMapper(UMLSAnnotator)`

Fast dictionary-based concept mapping.

Parameters: - dictionary_path (str): Path to concept dictionary - similarity_threshold (float): Minimum similarity score

Usage Examples

Tokenization

from pyctakes.annotators.tokenization import TokenizationAnnotator

# Create annotator
annotator = TokenizationAnnotator(backend="spacy")

# Process document
doc = annotator.process(doc)

# Access results
for sentence in doc.sentences:
    print(f"Sentence: {sentence.text}")
    for token in sentence.tokens:
        print(f"  Token: {token.text} ({token.pos})")

Named Entity Recognition

from pyctakes.annotators.ner import ClinicalNERAnnotator

# Rule-based NER
ner = ClinicalNERAnnotator(approach="rule_based")
doc = ner.process(doc)

# Model-based NER
ner = ClinicalNERAnnotator(
    approach="model_based",
    model_name="en_ner_bc5cdr_md"
)
doc = ner.process(doc)

# Access entities
for entity in doc.entities:
    print(f"Entity: {entity.text} ({entity.label}) at {entity.start}-{entity.end}")

Assertion Detection

from pyctakes.annotators.assertion import NegationAssertionAnnotator

# Create assertion annotator
assertion = NegationAssertionAnnotator()
doc = assertion.process(doc)

# Check assertions
for entity in doc.entities:
    if hasattr(entity, 'assertion'):
        print(f"{entity.text}: {entity.assertion.polarity}")

Section Detection

from pyctakes.annotators.sections import ClinicalSectionAnnotator

# Create section annotator
sections = ClinicalSectionAnnotator()
doc = sections.process(doc)

# Access sections
for section in doc.sections:
    print(f"Section: {section.section_type} ({section.start}-{section.end})")

UMLS Concept Mapping

from pyctakes.annotators.umls import UMLSConceptMapper

# Create UMLS mapper
umls = UMLSConceptMapper()
doc = umls.process(doc)

# Access concepts
for entity in doc.entities:
    if hasattr(entity, 'umls_concept'):
        concept = entity.umls_concept
        print(f"{entity.text} -> {concept.cui} ({concept.preferred_term})")

Custom Annotators

Creating Custom Annotators

from pyctakes.annotators.base import BaseAnnotator
from pyctakes.types import Document, Annotation

class CustomAnnotator(BaseAnnotator):
    def __init__(self, **kwargs):
        super().__init__()
        self.config = kwargs

    def process(self, doc: Document) -> Document:
        # Your custom processing logic

        # Example: Add custom annotation
        annotation = Annotation(
            start=0,
            end=len(doc.text),
            text=doc.text,
            label="CUSTOM",
            confidence=1.0
        )
        doc.annotations.append(annotation)

        return doc

Using Custom Annotators

from pyctakes.pipeline import Pipeline

# Create pipeline with custom annotator
pipeline = Pipeline()
pipeline.add_annotator(CustomAnnotator(param1="value1"))

# Process text
result = pipeline.process_text("Some clinical text")

Annotator Configuration

Tokenization Configuration

# spaCy backend
tokenizer = TokenizationAnnotator(
    backend="spacy",
    model="en_core_web_sm"
)

# Stanza backend
tokenizer = TokenizationAnnotator(
    backend="stanza",
    model="en"
)

# Rule-based backend
tokenizer = TokenizationAnnotator(
    backend="rule_based"
)

NER Configuration

# Rule-based with custom patterns
ner = ClinicalNERAnnotator(
    approach="rule_based",
    custom_patterns={
        "MEDICATION": ["aspirin", "ibuprofen"],
        "CONDITION": ["diabetes", "hypertension"]
    }
)

# Model-based with specific model
ner = ClinicalNERAnnotator(
    approach="model_based",
    model_name="en_ner_bc5cdr_md"
)

Assertion Configuration

# Custom assertion rules
assertion = NegationAssertionAnnotator(
    window_size=10,
    custom_negation_terms=["denies", "negative for"],
    custom_uncertainty_terms=["possible", "likely"]
)

Error Handling

from pyctakes.annotators.base import AnnotationError

try:
    doc = annotator.process(doc)
except AnnotationError as e:
    print(f"Annotation failed: {e}")
    # Handle gracefully
except Exception as e:
    print(f"Unexpected error: {e}")

Performance Tips

Choose appropriate backends: spaCy for speed, Stanza for accuracy
Use rule-based for simple tasks: Faster than model-based approaches
Configure entity types: Limit NER to needed entity types
Adjust window sizes: Smaller windows for assertion detection are faster
Enable caching: For UMLS lookups in production