Skip to content

Pipeline API Reference

The pyCTAKES pipeline module provides the core functionality for processing clinical text through configurable annotator chains.

Pipeline Class

class Pipeline

Main pipeline class for clinical text processing.

Methods:

  • __init__(): Initialize empty pipeline
  • add_annotator(annotator): Add annotator to pipeline
  • process_text(text): Process single text string
  • process_batch(texts): Process multiple texts
  • from_config(config): Create pipeline from configuration

Pipeline Factory Functions

create_default_pipeline(config=None)

Create default clinical NLP pipeline with all annotators.

Parameters: - config (dict, optional): Configuration dictionary

Returns: - Pipeline: Configured pipeline instance

Includes: - Tokenization (spaCy backend) - Section detection - Named entity recognition - Assertion detection - UMLS concept mapping

create_fast_pipeline(config=None)

Create speed-optimized pipeline with rule-based components.

Parameters: - config (dict, optional): Configuration dictionary

Returns: - Pipeline: Configured pipeline instance

Includes: - Tokenization (rule-based) - Named entity recognition (rule-based) - Basic assertion detection

create_basic_pipeline(config=None)

Create minimal pipeline for simple entity extraction.

Parameters: - config (dict, optional): Configuration dictionary

Returns: - Pipeline: Configured pipeline instance

Includes: - Tokenization (rule-based) - Named entity recognition (rule-based)

Usage Examples

Basic Pipeline Usage

from pyctakes.pipeline import Pipeline, create_default_pipeline

# Create pipeline
pipeline = create_default_pipeline()

# Process text
result = pipeline.process_text("Patient has diabetes and hypertension.")

# Access results
print(f"Found {len(result.entities)} entities")
for entity in result.entities:
    print(f"- {entity.text} ({entity.label})")

Custom Pipeline

from pyctakes.pipeline import Pipeline
from pyctakes.annotators import TokenizationAnnotator, NERAnnotator

# Create custom pipeline
pipeline = Pipeline()
pipeline.add_annotator(TokenizationAnnotator(backend="spacy"))
pipeline.add_annotator(NERAnnotator(approach="rule_based"))

# Process document
doc = pipeline.process_text("Patient takes aspirin 81mg daily.")

Pipeline Configuration

from pyctakes.pipeline import Pipeline

# Load from configuration file
pipeline = Pipeline.from_config("config.json")

# Load from dictionary
config = {
    "tokenization": {"backend": "spacy"},
    "ner": {"approach": "rule_based"}
}
pipeline = Pipeline.from_config(config)

Batch Processing

from pyctakes.pipeline import Pipeline

pipeline = create_default_pipeline()

# Process multiple texts
texts = [
    "Patient has diabetes.",
    "No history of hypertension.",
    "Takes metformin 500mg twice daily."
]

results = pipeline.process_batch(texts)
for i, result in enumerate(results):
    print(f"Text {i+1}: {len(result.entities)} entities")

Pipeline Methods

add_annotator(annotator)

Add an annotator to the pipeline.

Parameters: - annotator: An instance of a pyCTAKES annotator

Example:

from pyctakes.annotators import TokenizationAnnotator

pipeline = Pipeline()
pipeline.add_annotator(TokenizationAnnotator())

process_text(text)

Process a single text string.

Parameters: - text (str): The clinical text to process

Returns: - Document: Processed document with annotations

Example:

doc = pipeline.process_text("Patient has diabetes.")

process_batch(texts)

Process multiple texts in batch.

Parameters: - texts (List[str]): List of clinical texts to process

Returns: - List[Document]: List of processed documents

Example:

docs = pipeline.process_batch(["Text 1", "Text 2"])

from_config(config)

Create pipeline from configuration.

Parameters: - config (str or dict): Configuration file path or dictionary

Returns: - Pipeline: Configured pipeline instance

Example:

pipeline = Pipeline.from_config("my_config.json")

Pipeline Types

Default Pipeline

Full-featured clinical NLP pipeline with all annotators:

pipeline = create_default_pipeline()

Includes: - Tokenization (spaCy backend) - Section detection - Named entity recognition - Assertion detection - UMLS concept mapping

Fast Pipeline

Speed-optimized pipeline with rule-based components:

pipeline = create_fast_pipeline()

Includes: - Tokenization (rule-based) - Named entity recognition (rule-based) - Basic assertion detection

Basic Pipeline

Minimal pipeline for simple entity extraction:

pipeline = create_basic_pipeline()

Includes: - Tokenization (rule-based) - Named entity recognition (rule-based)

Error Handling

from pyctakes.pipeline import Pipeline, PipelineError

try:
    pipeline = Pipeline.from_config("invalid_config.json")
    result = pipeline.process_text("Some text")
except PipelineError as e:
    print(f"Pipeline error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Performance Considerations

  • Tokenization Backend: spaCy fastest, Stanza most accurate
  • NER Approach: Rule-based faster, model-based more accurate
  • Batch Processing: More efficient for multiple documents
  • Configuration: Disable unused annotators for better performance