Features Example Install Performance Changelog PyPI ↗
NLP · Fact Extraction · RAG

VagaCore
Fact Extraction Engine

A production-ready NLP pipeline that converts messy, multi-sentence text into structured, time-aware facts for RAG, analytics, and automation.

version1.0.1
python≥ 3.8
licenseopen source
releasedMar 30, 2026
Get Started View Example →
$ pip install vagacore ⎘ copy
NER Entity Recognition
3 Output Modes
0.9+ Confidence Score
100% Offline

Built for Production

Every feature designed for financial-grade accuracy and enterprise reliability.

💰
Financial-Grade Extraction
Detects MONEY, PERCENT, DATE with unit preservation — handles M/B/k suffixes without loss of precision.
🧹
Entity Hygiene
Handles possessives and compounds intelligently — "Nvidia's revenue" resolves cleanly to "Nvidia".
🧠
Context Memory
Propagates time and subject context across sentences — "the same period" resolves correctly.
🚫
Safe Outputs
Automatically skips negated and hypothetical statements — no hallucinated facts in your pipeline.
🔗
Smart Pairing
Handles "respectively" constructs, parallel values, and key:value formats with semantic precision.
📊
Confidence Scoring
Every extracted fact ships with a calibrated confidence score from 0.0–1.0 for downstream filtering.

How the Pipeline Works

Five sequential modules transform raw text into clean, structured facts.

01
parser.py — spaCy Loading
Loads en_core_web_sm, tokenizes input, builds the dependency parse tree and POS tags.
02
utils.py — Noise Removal
Strips hypothetical phrases, negation guards, and deduplicates redundant surface forms.
03
extractor.py — SVO Extraction
Runs Subject-Verb-Object triples, entity recognition, and temporal value binding.
04
compressor.py — Orchestration
Chains modules, maintains cross-sentence memory state, resolves coreferences.
05
__init__.py — Output
Formats results as json / text / llm mode and attaches version metadata.
vagacore/
├── parser.py # spaCy loading & parsing
├── utils.py # cleaning & noise removal
├── extractor.py # SVO + entity extraction
├── compressor.py# pipeline + memory
└── __init__.py # exports & version

examples/
├── test_script.py
└── test_stress.py

See It In Action

Multi-sentence input → structured, time-aware facts with confidence scores.

input.py
from vagacore import compress
text = """ Apple reported $81.8 billion in revenue for Q3 2024. If Apple sells 1M units, it will reach $900M. Netflix and Disney reported $1B and $2B respectively. """
result = compress(text, mode="json") print(result)
output.json
{ "facts": [ { "entity": "Apple", "event": "reported", "value": "$81.8 billion", "time": "Q3 2024", "confidence": 0.92 }, { "entity": "Netflix", "event": "reported", "value": "$1B", "time": "Q3 2024", "confidence": 0.88 }, // hypothetical "If Apple..." → skipped ], "version": "1.0.1" }

Three Ways to Consume

mode="json"
Structured Facts
Full JSON with entity, event, value, time, and confidence. Ideal for RAG and databases.
mode="text"
Readable Summary
Human-readable bullet summary. Great for reports, dashboards, and quick reviews.
mode="llm"
LLM-Compact
Compressed token-efficient format for feeding into LLM prompts. Reduces context size.

Get Running in 60 Seconds

Python ≥ 3.8 required. Works fully offline after setup — no API keys needed.

1
Create & activate virtual environment
python -m venv venv
./venv/Scripts/activate   # Windows
source venv/bin/activate  # macOS / Linux
2
Install spaCy and download language model
pip install -U pip spacy
python -m spacy download en_core_web_sm
3
Install VagaCore from PyPI
pip install vagacore
4
Verify installation
python -m pytest
python examples/test_script.py

Where It Shines & Where It Doesn't

Honest assessment of current capabilities and known limitations.

✅ Works Well For
Multi-sentence financial documents
Earnings reports and KPI extraction
Temporal references ("same period", "Q3")
Entity-rich content with named orgs
Parallel value structures ("respectively")
Negation and hypothetical filtering
⚠️ Known Limitations
Complex multi-action sentences
Passive voice (partial support only)
English-only (multi-language planned)
Very long documents may need chunking
Sarcasm / implicit negation
Custom entity types (planned)
📄
RAG Systems
Feed clean structured facts instead of raw chunks into your retrieval pipeline.
📈
Financial Analysis
Parse earnings reports, market analysis, and KPI summaries at scale.
🗂️
Knowledge Bases
Index structured facts instead of raw text for better semantic retrieval.
📰
News Analysis
Extract entities and metrics from news feeds and build event timelines.

Release History

v1.0.1 Mar 30, 2026
Improved money regex — handles edge cases with multi-unit values
Better subject resolution across complex clause chains
Context memory enhancements for longer documents
v1.0.0 Initial release
Negation and hypothetical filtering
List and "respectively" construct handling
Fact deduplication pipeline
JSON / text / llm output modes
Confidence scoring for all extracted facts
🔮
Planned in v1.1.0
Multi-action sentence support · Passive voice improvements · Multi-language support · Custom entity types · Better confidence scoring

Ready to Extract Facts?

One pip install. Fully offline. Financial-grade accuracy.

Install from PyPI ↗ View on GitHub ↗
✓ Copied to clipboard