Configuration

pyCTAKES provides flexible configuration options to customize the behavior of annotators and pipelines.

Configuration File Format

pyCTAKES uses JSON configuration files:

{
  "tokenization": {
    "backend": "spacy",
    "model": "en_core_web_sm"
  },
  "sections": {
    "enabled": true,
    "custom_patterns": {}
  },
  "ner": {
    "approach": "rule_based",
    "custom_patterns": {}
  },
  "assertion": {
    "window_size": 10
  },
  "umls": {
    "similarity_threshold": 0.8
  }
}

Annotator Configuration

Tokenization Configuration

{
  "tokenization": {
    "backend": "spacy",           // Backend: "spacy", "stanza", "rule_based"
    "model": "en_core_web_sm",    // Model name (for spacy/stanza)
    "sentence_split": true,       // Enable sentence segmentation
    "tokenize": true,             // Enable tokenization
    "preserve_whitespace": false  // Preserve original whitespace
  }
}

Backend Options: - spacy: Fast, requires spaCy model installation - stanza: Most accurate, requires Stanza models
- rule_based: No dependencies, basic splitting

Section Configuration

{
  "sections": {
    "enabled": true,
    "case_sensitive": false,
    "custom_patterns": {
      "CHIEF_COMPLAINT": [
        "chief complaint:",
        "cc:",
        "presenting complaint"
      ],
      "CUSTOM_SECTION": [
        "my custom section:",
        "special notes:"
      ]
    },
    "disabled_sections": ["FAMILY_HISTORY"]
  }
}

NER Configuration

{
  "ner": {
    "approach": "rule_based",     // "rule_based" or "model_based"
    "model_name": null,           // Model for model_based approach
    "case_sensitive": false,
    "custom_patterns": {
      "MEDICATION": [
        "aspirin", "ibuprofen", "metformin",
        "lisinopril", "atorvastatin"
      ],
      "CONDITION": [
        "diabetes", "hypertension", "pneumonia"
      ]
    },
    "entity_types": [             // Limit to specific entity types
      "MEDICATION", "CONDITION", "SYMPTOM"
    ]
  }
}

Assertion Configuration

{
  "assertion": {
    "window_size": 10,            // Context window for assertion detection
    "custom_negation_terms": [
      "denies", "negative for", "ruled out",
      "no evidence of", "absent"
    ],
    "custom_uncertainty_terms": [
      "possible", "probable", "likely",
      "suspect", "consider"
    ],
    "custom_temporal_terms": {
      "PAST": ["history of", "previous", "prior"],
      "FUTURE": ["will", "plan to", "scheduled"]
    }
  }
}

UMLS Configuration

{
  "umls": {
    "similarity_threshold": 0.8,  // Minimum similarity for concept matching
    "max_candidates": 5,          // Maximum candidate concepts
    "semantic_types": [           // Filter by semantic types
      "T047",  // Disease or Syndrome
      "T184",  // Sign or Symptom
      "T121"   // Pharmacologic Substance
    ],
    "sources": [                  // Limit to specific vocabularies
      "SNOMEDCT_US", "RXNORM", "ICD10CM"
    ],
    "enable_caching": true        // Cache concept lookups
  }
}

Pipeline Configuration

Default Pipeline

{
  "pipeline": {
    "name": "default",
    "annotators": [
      {
        "name": "tokenization",
        "class": "TokenizationAnnotator",
        "config": {
          "backend": "spacy"
        }
      },
      {
        "name": "sections", 
        "class": "SectionAnnotator",
        "config": {}
      },
      {
        "name": "ner",
        "class": "NERAnnotator", 
        "config": {
          "approach": "rule_based"
        }
      },
      {
        "name": "assertion",
        "class": "AssertionAnnotator",
        "config": {}
      },
      {
        "name": "umls",
        "class": "UMLSAnnotator",
        "config": {}
      }
    ]
  }
}

Custom Pipeline

{
  "pipeline": {
    "name": "medication_only",
    "annotators": [
      {
        "name": "tokenization",
        "class": "TokenizationAnnotator",
        "config": {
          "backend": "rule_based"
        }
      },
      {
        "name": "ner",
        "class": "NERAnnotator",
        "config": {
          "approach": "rule_based",
          "entity_types": ["MEDICATION", "DOSAGE"]
        }
      }
    ]
  }
}

Environment Variables

pyCTAKES recognizes these environment variables:

# UMLS API Key (required for full UMLS functionality)
export UMLS_API_KEY="your-api-key"

# spaCy model path
export SPACY_MODEL_PATH="/path/to/models"

# Cache directory
export PYTAKES_CACHE_DIR="/path/to/cache"

# Log level
export PYTAKES_LOG_LEVEL="INFO"

Configuration Loading

From File

from pyctakes.pipeline import Pipeline

# Load from file
pipeline = Pipeline.from_config("config.json")

# Load with overrides
pipeline = Pipeline.from_config(
    "config.json",
    overrides={"ner.approach": "model_based"}
)

From Dictionary

config = {
    "tokenization": {"backend": "spacy"},
    "ner": {"approach": "rule_based"}
}

pipeline = Pipeline.from_config(config)

Programmatic Configuration

from pyctakes.pipeline import Pipeline
from pyctakes.annotators import TokenizationAnnotator, NERAnnotator

pipeline = Pipeline()
pipeline.add_annotator(TokenizationAnnotator(backend="spacy"))
pipeline.add_annotator(NERAnnotator(approach="rule_based"))

Configuration Validation

pyCTAKES validates configuration on load:

try:
    pipeline = Pipeline.from_config("config.json")
except ConfigurationError as e:
    print(f"Invalid configuration: {e}")

Common Configuration Patterns

High Performance Setup

{
  "tokenization": {"backend": "rule_based"},
  "ner": {"approach": "rule_based"},
  "assertion": {"window_size": 5},
  "umls": {"similarity_threshold": 0.9}
}

High Accuracy Setup

{
  "tokenization": {"backend": "stanza"},
  "ner": {"approach": "model_based"},
  "assertion": {"window_size": 15},
  "umls": {"similarity_threshold": 0.7}
}

Medication-Focused Pipeline

{
  "ner": {
    "entity_types": ["MEDICATION", "DOSAGE", "FREQUENCY"],
    "custom_patterns": {
      "MEDICATION": ["list", "of", "medications"]
    }
  },
  "umls": {
    "semantic_types": ["T121"],  // Pharmacologic Substance
    "sources": ["RXNORM"]
  }
}

Best Practices

Start Simple: Begin with default configuration
Validate Early: Test configuration on sample data
Monitor Performance: Profile with different settings
Version Control: Store configurations in version control
Document Changes: Keep notes on configuration rationale
Test Thoroughly: Validate changes don't break existing functionality