Pipeline API Reference
The pyCTAKES pipeline module provides the core functionality for processing clinical text through configurable annotator chains.
Pipeline Class
class Pipeline
Main pipeline class for clinical text processing.
Methods:
__init__()
: Initialize empty pipelineadd_annotator(annotator)
: Add annotator to pipelineprocess_text(text)
: Process single text stringprocess_batch(texts)
: Process multiple textsfrom_config(config)
: Create pipeline from configuration
Pipeline Factory Functions
create_default_pipeline(config=None)
Create default clinical NLP pipeline with all annotators.
Parameters:
- config
(dict, optional): Configuration dictionary
Returns:
- Pipeline
: Configured pipeline instance
Includes: - Tokenization (spaCy backend) - Section detection - Named entity recognition - Assertion detection - UMLS concept mapping
create_fast_pipeline(config=None)
Create speed-optimized pipeline with rule-based components.
Parameters:
- config
(dict, optional): Configuration dictionary
Returns:
- Pipeline
: Configured pipeline instance
Includes: - Tokenization (rule-based) - Named entity recognition (rule-based) - Basic assertion detection
create_basic_pipeline(config=None)
Create minimal pipeline for simple entity extraction.
Parameters:
- config
(dict, optional): Configuration dictionary
Returns:
- Pipeline
: Configured pipeline instance
Includes: - Tokenization (rule-based) - Named entity recognition (rule-based)
Usage Examples
Basic Pipeline Usage
from pyctakes.pipeline import Pipeline, create_default_pipeline
# Create pipeline
pipeline = create_default_pipeline()
# Process text
result = pipeline.process_text("Patient has diabetes and hypertension.")
# Access results
print(f"Found {len(result.entities)} entities")
for entity in result.entities:
print(f"- {entity.text} ({entity.label})")
Custom Pipeline
from pyctakes.pipeline import Pipeline
from pyctakes.annotators import TokenizationAnnotator, NERAnnotator
# Create custom pipeline
pipeline = Pipeline()
pipeline.add_annotator(TokenizationAnnotator(backend="spacy"))
pipeline.add_annotator(NERAnnotator(approach="rule_based"))
# Process document
doc = pipeline.process_text("Patient takes aspirin 81mg daily.")
Pipeline Configuration
from pyctakes.pipeline import Pipeline
# Load from configuration file
pipeline = Pipeline.from_config("config.json")
# Load from dictionary
config = {
"tokenization": {"backend": "spacy"},
"ner": {"approach": "rule_based"}
}
pipeline = Pipeline.from_config(config)
Batch Processing
from pyctakes.pipeline import Pipeline
pipeline = create_default_pipeline()
# Process multiple texts
texts = [
"Patient has diabetes.",
"No history of hypertension.",
"Takes metformin 500mg twice daily."
]
results = pipeline.process_batch(texts)
for i, result in enumerate(results):
print(f"Text {i+1}: {len(result.entities)} entities")
Pipeline Methods
add_annotator(annotator)
Add an annotator to the pipeline.
Parameters:
- annotator
: An instance of a pyCTAKES annotator
Example:
from pyctakes.annotators import TokenizationAnnotator
pipeline = Pipeline()
pipeline.add_annotator(TokenizationAnnotator())
process_text(text)
Process a single text string.
Parameters:
- text
(str): The clinical text to process
Returns:
- Document
: Processed document with annotations
Example:
process_batch(texts)
Process multiple texts in batch.
Parameters:
- texts
(List[str]): List of clinical texts to process
Returns:
- List[Document]
: List of processed documents
Example:
from_config(config)
Create pipeline from configuration.
Parameters:
- config
(str or dict): Configuration file path or dictionary
Returns:
- Pipeline
: Configured pipeline instance
Example:
Pipeline Types
Default Pipeline
Full-featured clinical NLP pipeline with all annotators:
Includes: - Tokenization (spaCy backend) - Section detection - Named entity recognition - Assertion detection - UMLS concept mapping
Fast Pipeline
Speed-optimized pipeline with rule-based components:
Includes: - Tokenization (rule-based) - Named entity recognition (rule-based) - Basic assertion detection
Basic Pipeline
Minimal pipeline for simple entity extraction:
Includes: - Tokenization (rule-based) - Named entity recognition (rule-based)
Error Handling
from pyctakes.pipeline import Pipeline, PipelineError
try:
pipeline = Pipeline.from_config("invalid_config.json")
result = pipeline.process_text("Some text")
except PipelineError as e:
print(f"Pipeline error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Performance Considerations
- Tokenization Backend: spaCy fastest, Stanza most accurate
- NER Approach: Rule-based faster, model-based more accurate
- Batch Processing: More efficient for multiple documents
- Configuration: Disable unused annotators for better performance