kotaemon/tests/test_post_processing.py
Nguyen Trung Duc (john) f8b8d86d4e Move LLM-related components into LLM module (#74)
* Move splitter into indexing module
* Rename post_processing module to parsers
* Migrate LLM-specific composite pipelines into llms module

This change moves the `splitters` module into `indexing` module. The `indexing` module will be created soon, to house `indexing`-related components.

This change renames `post_processing` module into `parsers` module. Post-processing is a generic term which provides very little information. In the future, we will add other extractors into the `parser` module, like Metadata extractor...

This change migrates the composite elements into `llms` module. These elements heavily assume that the internal nodes are llm-specific. As a result, migrating these elements into `llms` module will make them more discoverable, and simplify code base structure.
2023-11-15 16:26:53 +07:00

33 lines
971 B
Python

import pytest
from kotaemon.base import Document
from kotaemon.parsers import RegexExtractor
@pytest.fixture
def regex_extractor():
return RegexExtractor(
pattern=r"\d+", output_map={"1": "One", "2": "Two", "3": "Three"}
)
def test_run_document(regex_extractor):
document = Document(text="This is a test. 1 2 3")
extracted_document = regex_extractor(document)[0]
assert extracted_document.text == "One"
assert extracted_document.matches == ["One", "Two", "Three"]
def test_run_raw(regex_extractor):
output = regex_extractor("This is a test. 123")[0]
assert output.text == "123"
assert output.matches == ["123"]
def test_run_batch_raw(regex_extractor):
output = regex_extractor(["This is a test. 123", "456"])
extracted_text = [each.text for each in output]
extracted_matches = [each.matches for each in output]
assert extracted_text == ["123", "456"]
assert extracted_matches == [["123"], ["456"]]