NLP · Fact Extraction · RAG

VagaCore
Fact Extraction Engine

A production-ready NLP pipeline that converts messy, multi-sentence text into structured, time-aware facts for RAG, analytics, and automation.

version1.0.1

python≥ 3.8

licenseopen source

releasedMar 30, 2026

Get Started View Example →

$ pip install vagacore ⎘ copy

NER Entity Recognition

3 Output Modes

0.9+ Confidence Score

100% Offline

features

Built for Production

Every feature designed for financial-grade accuracy and enterprise reliability.

💰

Financial-Grade Extraction

Detects MONEY, PERCENT, DATE with unit preservation — handles M/B/k suffixes without loss of precision.

🧹

Entity Hygiene

Handles possessives and compounds intelligently — "Nvidia's revenue" resolves cleanly to "Nvidia".

🧠

Context Memory

Propagates time and subject context across sentences — "the same period" resolves correctly.

🚫

Safe Outputs

Automatically skips negated and hypothetical statements — no hallucinated facts in your pipeline.

🔗

Smart Pairing

Handles "respectively" constructs, parallel values, and key:value formats with semantic precision.

📊

Confidence Scoring

Every extracted fact ships with a calibrated confidence score from 0.0–1.0 for downstream filtering.

architecture

How the Pipeline Works

Five sequential modules transform raw text into clean, structured facts.

parser.py — spaCy Loading

Loads en_core_web_sm, tokenizes input, builds the dependency parse tree and POS tags.

utils.py — Noise Removal

Strips hypothetical phrases, negation guards, and deduplicates redundant surface forms.

extractor.py — SVO Extraction

Runs Subject-Verb-Object triples, entity recognition, and temporal value binding.

compressor.py — Orchestration

Chains modules, maintains cross-sentence memory state, resolves coreferences.

__init__.py — Output

Formats results as json / text / llm mode and attaches version metadata.

vagacore/

│

├── parser.py # spaCy loading & parsing

├── utils.py # cleaning & noise removal

├── extractor.py # SVO + entity extraction

├── compressor.py# pipeline + memory

└── __init__.py # exports & version

examples/

├── test_script.py

└── test_stress.py

live example

See It In Action

Multi-sentence input → structured, time-aware facts with confidence scores.

input.py

from vagacore import compress


text = """
Apple reported $81.8 billion in revenue
for Q3 2024.
If Apple sells 1M units, it will reach $900M.
Netflix and Disney reported $1B and $2B
respectively.
"""


result = compress(text, mode="json")
print(result)
        

output.json

{
  "facts": [
    {
      "entity": "Apple",
      "event":  "reported",
      "value":  "$81.8 billion",
      "time":   "Q3 2024",
      "confidence": 0.92
    },
    {
      "entity": "Netflix",
      "event":  "reported",
      "value":  "$1B",
      "time":   "Q3 2024",
      "confidence": 0.88
    },
    // hypothetical "If Apple..." → skipped
  ],
  "version": "1.0.1"
}
        

output modes

Three Ways to Consume

mode="json"

Structured Facts

Full JSON with entity, event, value, time, and confidence. Ideal for RAG and databases.

mode="text"

Readable Summary

Human-readable bullet summary. Great for reports, dashboards, and quick reviews.

mode="llm"

LLM-Compact

Compressed token-efficient format for feeding into LLM prompts. Reduces context size.

installation

Get Running in 60 Seconds

Python ≥ 3.8 required. Works fully offline after setup — no API keys needed.

Create & activate virtual environment

python -m venv venv
./venv/Scripts/activate # Windows
source venv/bin/activate # macOS / Linux

Install spaCy and download language model

pip install -U pip spacy
python -m spacy download en_core_web_sm

Install VagaCore from PyPI

pip install vagacore

Verify installation

python -m pytest
python examples/test_script.py

performance

Where It Shines & Where It Doesn't

Honest assessment of current capabilities and known limitations.

✅ Works Well For

Multi-sentence financial documents

Earnings reports and KPI extraction

Temporal references ("same period", "Q3")

Entity-rich content with named orgs

Parallel value structures ("respectively")

Negation and hypothetical filtering

⚠️ Known Limitations

Complex multi-action sentences

Passive voice (partial support only)

English-only (multi-language planned)

Very long documents may need chunking

Sarcasm / implicit negation

Custom entity types (planned)

use cases

📄

RAG Systems

Feed clean structured facts instead of raw chunks into your retrieval pipeline.

📈

Financial Analysis

Parse earnings reports, market analysis, and KPI summaries at scale.

🗂️

Knowledge Bases

Index structured facts instead of raw text for better semantic retrieval.

📰

News Analysis

Extract entities and metrics from news feeds and build event timelines.

changelog

Release History

v1.0.1 Mar 30, 2026

Improved money regex — handles edge cases with multi-unit values

Better subject resolution across complex clause chains

Context memory enhancements for longer documents

v1.0.0 Initial release

Negation and hypothetical filtering

List and "respectively" construct handling

Fact deduplication pipeline

JSON / text / llm output modes

Confidence scoring for all extracted facts

🔮

Planned in v1.1.0

Multi-action sentence support · Passive voice improvements · Multi-language support · Custom entity types · Better confidence scoring

VagaCore Fact Extraction Engine