kotaemon

Author	SHA1	Message	Date
Nguyen Trung Duc (john)	b159897ac6	Combine docstores and vectorstores within a storages component (#72 )	2023-11-14 17:50:57 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	640962e916	Update retrieving + agent pipeline (#71 )	2023-11-14 16:40:13 +07:00
Nguyen Trung Duc (john)	693ed39de4	Move prompts into LLMs module (#70 ) Since the only usage of prompt is within LLMs, it is reasonable to keep it within the LLM module. This way, it would be easier to discover module, and make the code base less complicated. Changes: * Move prompt components into llms * Bump version 0.3.1 * Make pip install dependencies in eager mode --------- Co-authored-by: ian <ian@cinnamon.is>	2023-11-14 16:00:10 +07:00
Nguyen Trung Duc (john)	8532138842	Move Document and other interface into base/schema (#69 )	2023-11-14 11:51:10 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	4704e2c11a	Add new OCRReader with PDF+OCR text merging (#66 ) This change speeds up OCR extraction by allowing bypassing OCR for texts that are irrelevant (not in table). --------- Co-authored-by: Nguyen Trung Duc (john) <trungduc1992@gmail.com>	2023-11-13 17:43:02 +07:00
Nguyen Trung Duc (john)	d79b3744cb	Simplify the `BaseComponent` inteface (#64 ) This change remove `BaseComponent`'s: - run_raw - run_batch_raw - run_document - run_batch_document - is_document - is_batch Each component is expected to support multiple types of inputs and a single type of output. Since we want the component to work out-of-the-box with both standardized and customized use cases, supporting multiple types of inputs are expected. At the same time, to reduce the complexity of understanding how to use a component, we restrict a component to only have a single output type. To accommodate these changes, we also refactor some components to remove their run_raw, run_batch_raw... methods, and to decide the common output interface for those components. Tests are updated accordingly. Commit changes: * Add kwargs to vector store's query * Simplify the BaseComponent * Update tests * Remove support for Python 3.8 and 3.9 * Bump version 0.3.0 * Fix github PR caching still use old environment after bumping version --------- Co-authored-by: ian <ian@cinnamon.is>	2023-11-13 15:10:18 +07:00
ian_Cin	6095526dc7	Add Huggingface embeddings and Cohere embeddings (#63 ) * Add huggingface embeddings and cohere embeddings * Update openai interface and the mock for newer OpenAI SDK --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-11-10 09:38:30 +07:00
Nguyen Trung Duc (john)	9035e25666	Upgrade the declarative pipeline for cleaner interface (#51 )	2023-10-24 11:12:22 +07:00
Nguyen Trung Duc (john)	aab982ddc4	Provide ready binary for Mac and Linux to do sharing tunneling (#49 )	2023-10-17 17:19:29 +07:00
ian_Cin	2b779926c6	Directly caching the python instead of creating virtual env; add option to ignore caching (#45 ) - Directly caching the python instead of creating virtual env - add option to ignore caching using `[ignore catch]` in the commit message	2023-10-16 15:27:14 +07:00
Nguyen Trung Duc (john)	da6b35f520	Allow persisting the expected output in the code (#46 ) By allowing specifying the UI outputs in the code, any time user runs `kh export ...`, that outputs in the code will be included in the UI YAML file. Otherwise, any time the user runs `kh export ...`, the output section in the UI YAML file will be reset to the default output.	2023-10-13 10:26:48 +07:00
Nguyen Trung Duc (john)	6e7905cbc0	[AUR-411] Adopt to Example2 project (#28 ) Add the chatbot from Example2. Create the UI for chat.	2023-10-12 15:13:25 +07:00
ian_Cin	533fffa6db	Enable caching for github actions (#43 )	2023-10-12 13:52:19 +07:00
ian_Cin	84f1fa8cbd	[AUR-395] Adopt Example1 disclaimer pipeline (#42 ) * Adopt Example1 disclaimer pipeline * Update Document class * Add composite components * Modify Extractor behaviours	2023-10-10 15:42:48 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	79cc60e6a2	[AUR-429] Add MVP pipeline with Ingestion and QA stage (#39 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update base reader with BaseComponent * add splitter * update agent and tool * update vectorstores * update load/save for indexing and retrieving pipeline * update test_agent for more use-cases * add missing dependency for test * update test case for in memory vectorstore * add TextSplitter to BaseComponent * update type hint basetool * add insurance mvp pipeline * update requirements * Remove redundant plugins param * Mock GoogleSearch --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-05 12:31:33 +07:00
ian_Cin	2638152054	[Feat] Add support for f-string syntax in PromptTemplate (#38 ) * Add support for f-string syntax in PromptTemplate	2023-10-04 16:40:09 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	56bc41b673	Update Base interface of Index/Retrieval pipeline (#36 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update base reader with BaseComponent * add splitter * update agent and tool * update vectorstores * update load/save for indexing and retrieving pipeline * update test_agent for more use-cases * add missing dependency for test * update test case for in memory vectorstore * add TextSplitter to BaseComponent * update type hint basetool --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-04 14:27:44 +07:00
Nguyen Trung Duc (john)	49ed3f6994	[AUR-405] Auto-generate markdown documentation from pipeline (#33 ) * Create a script to auto-generate markdown docs from pipeline * Clean up documentation for Chain-of-Thought	2023-10-04 10:54:24 +07:00
Nguyen Trung Duc (john)	6ab1854532	feat: Add chain-of-thought (#37 ) * Add chain-of-thought * Use BasePromptComponent * Add terminate callback for the chain-of-thought	2023-10-04 02:16:33 +07:00
Nguyen Trung Duc (john)	f80a4ea883	[AUR-425] Fix the cookiecutter command (#35 )	2023-10-03 12:13:10 +07:00
cin-jacky	205955b8a3	[AUR-387, AUR-425] Add start-project to CLI (#29 )	2023-10-03 11:55:34 +07:00
ian_Cin	d83c22aa4e	[AUR-395, AUR-415] Adopt Example1 Injury pipeline; add .flow() for enabling bottom-up pipeline execution (#32 ) * add example1/injury pipeline example * add dotenv * update various api	2023-10-02 16:24:56 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	3cceec63ef	[AUR-431] Add ReAct Agent (#34 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update docstring * fix max_iteration --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-02 11:29:12 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	91048770fa	[AUR-431, AUR-435] Add Agent Interface and ReWOO Agent implementation (#31 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-01 11:53:08 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	f9fc02a32a	[AUR-363, AUR-433, AUR-434] Add Base Tool interface with Wikipedia/Google tools (#30 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-09-29 10:18:49 +07:00
cin-jacky	317323c0e5	[AUR-424] Setup CLI interface (#25 ) * [AUR-424] Setup CLI interface * [AUR-424] fix test_vectorstore:test_query * [AUR-424] exclude examples when setup CLI * [AUR-424] create kh and kh --export * [AUR-426] revise cli by using click.group * Fix dynamic import * [AUR-426] revert the format of import packages * [AUR-426] set argument default * [AUR-426] set click dependencies in setup.py --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-09-27 16:44:38 +09:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	6c3d614973	[AUR-432] Add layout-aware table parsing PDF reader (#27 ) * add OCRReader, MathPixReader and ExcelReader * update test case for ocr reader * reformat * minor fix	2023-09-26 15:52:44 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	6207f4332a	[AUR-430] Add test case for Chroma VectoStore save/load (#26 ) * add test case for Chroma save/load * minor name change * add delete_collection support for chroma * move save load to chroma --------- Co-authored-by: Nguyen Trung Duc (john) <john@cinnamon.is>	2023-09-26 10:58:41 +07:00
Nguyen Trung Duc (john)	4f189dc931	[AUR-408] Export logs to Excel (#23 ) This CL implements: - The logic to export log to Excel. - Route the export logic in the UI. - Demonstrate this functionality in `./examples/promptui` project.	2023-09-25 17:20:03 +07:00
ian_Cin	08b6e5d3fb	[AUR-390] Add prompt template and prompt component (#24 ) * Export pipeline to config * Export the input to config * Preliminary creating UI dynamically * Add test for config export * Try out prompt UI * Add example projects * Fix test errors * Standardize interface for retrieval * Finalize the UI demo * Update README.md * Update README * Refactor according to main * Fix typing issue * Add openai key to git-secret * Add prompt template and prompt component * Update test * update tests * revert docstring --------- Co-authored-by: trducng <trungduc1992@gmail.com> Co-authored-by: Nguyen Trung Duc (john) <john@cinnamon.is>	2023-09-25 14:38:22 +07:00
Nguyen Trung Duc (john)	c6dd01e820	[AUR-338, AUR-406, AUR-407] Export pipeline to config for PromptUI. Construct PromptUI dynamically based on config. (#16 ) From pipeline > config > UI. Provide example project for promptui - Pipeline to config: `kotaemon.contribs.promptui.config.export_pipeline_to_config`. The config follows schema specified in this document: https://cinnamon-ai.atlassian.net/wiki/spaces/ATM/pages/2748711193/Technical+Detail. Note: this implementation exclude the logs, which will be handled in AUR-408. - Config to UI: `kotaemon.contribs.promptui.build_from_yaml` - Example project is located at `examples/promptui/`	2023-09-21 14:27:23 +07:00
cin-jacky	c329c4c03f	[AUR-362] Add In-memory vector store (#22 ) * [AUR-362] Add In-memory vector store * [AUR-362] fix delete fun input format * [AUR-362] revise persist and from persist path to save and load * [AUR-362] revise simple.py to in_memory.py	2023-09-20 17:51:50 +09:00
ian_Cin	b794051653	[AUR-421] base output post-processor that works using regex. (#20 )	2023-09-19 19:54:44 +07:00
Nguyen Trung Duc (john)	2a3a23ecd7	[AUR-420] Provide document store base interface and an in-memory version (#21 ) Document store handles storing and indexing Documents. It supports the following interfaces: - add: add 1 or more documents into document store - get: get a list of documents - get_all: get all documents in a document store - delete: delete 1 or more document - save: persist a document store into disk - load: load a document store from disk	2023-09-19 14:49:23 +07:00
Nguyen Trung Duc (john)	620b2b03ca	[AUR-392, AUR-413, AUR-414] Define base vector store, and make use of ChromaVectorStore from llama_index. Indexing and retrieving vectors with vector store (#18 ) Design the base interface of vector store, and apply it to the Chroma Vector Store (wrapped around llama_index's implementation). Provide the pipelines to populate and retrieve from vector store.	2023-09-14 14:18:20 +07:00
Nguyen Trung Duc (john)	c339912312	[AUR-389] Add base interface and embedding model (#17 ) This change provides the base interface of an embedding, and wrap the Langchain's OpenAI embedding. Usage as follow: ```python from kotaemon.embeddings import AzureOpenAIEmbeddings model = AzureOpenAIEmbeddings( model="text-embedding-ada-002", deployment="embedding-deployment", openai_api_base="https://test.openai.azure.com/", openai_api_key="some-key", ) output = model("Hello world") ```	2023-09-14 14:08:58 +07:00
ian_Cin	1061192731	[AUR-418] Add member public keys to git-secret: John, Ian, Tadashi, Jacky	2023-09-06 17:19:22 +07:00
trducng	f4596aa720	Fix import	2023-09-04 10:30:53 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	21350153d4	[AUR-391, AUR-393] Add Document and DocumentReader base (#6 ) * Declare BaseComponent * Brainstorming base class for LLM call * Define base LLM * Add tests * Clean telemetry environment for accurate testing * Fix README * Fix typing * add base document reader * update test * update requirements * Cosmetic change * update requirements * reformat --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-08-31 11:24:12 +07:00
Nguyen Trung Duc (john)	4211315a54	[AUR-396] Scaffold prompt engineering code base section (#5 )	2023-08-30 14:31:21 +07:00
ian_Cin	5241edbc46	[AUR-361] Setup pre-commit, pytest, GitHub actions, ssh-secret (#3 ) Co-authored-by: trducng <trungduc1992@gmail.com>	2023-08-30 07:22:01 +07:00
Nguyen Trung Duc (john)	c3c25db48c	[AUR-385, AUR-388] Declare BaseComponent and decide LLM call interface (#2 ) - Use cases related to LLM call: https://cinnamon-ai.atlassian.net/browse/AUR-388?focusedCommentId=34873 - Sample usages: `test_llms_chat_models.py` and `test_llms_completion_models.py`: ```python from kotaemon.llms.chats.openai import AzureChatOpenAI model = AzureChatOpenAI( openai_api_base="https://test.openai.azure.com/", openai_api_key="some-key", openai_api_version="2023-03-15-preview", deployment_name="gpt35turbo", temperature=0, request_timeout=60, ) output = model("hello world") ``` For the LLM-call component, I decide to wrap around Langchain's LLM models and Langchain's Chat models. And set the interface as follow: - Completion LLM component: ```python class CompletionLLM: def run_raw(self, text: str) -> LLMInterface: # Run text completion: str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run text completion in batch: list[str] in -> list[LLMInterface] out # run_document and run_batch_document just reuse run_raw and run_batch_raw, due to unclear use case ``` - Chat LLM component: ```python class ChatLLM: def run_raw(self, text: str) -> LLMInterface: # Run chat completion (no chat history): str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run chat completion in batch mode (no chat history): list[str] in -> list[LLMInterface] out def run_document(self, text: list[BaseMessage]) -> LLMInterface: # Run chat completion (with chat history): list[langchain's BaseMessage] in -> LLMInterface out def run_batch_document(self, text: list[list[BaseMessage]]) -> list[LLMInterface]: # Run chat completion in batch mode (with chat history): list[list[langchain's BaseMessage]] in -> list[LLMInterface] out ``` - The LLMInterface is as follow: ```python @dataclass class LLMInterface: text: list[str] completion_tokens: int = -1 total_tokens: int = -1 prompt_tokens: int = -1 logits: list[list[float]] = field(default_factory=list) ```	2023-08-29 15:47:12 +07:00
Nguyen Trung Duc (john)	e9d1d5c118	[AUR-401] Disable Haystack telemetry with monkey patching (#1 ) Sample Haystack log when running a pipeline. Note: the `pipeline.classname` can leak company information. ```json { "hardware.cpus": 16, "hardware.gpus": 0, "libraries.colab": false, "libraries.cuda": false, "libraries.haystack": "1.20.0rc0", "libraries.ipython": false, "libraries.pytest": false, "libraries.ray": false, "libraries.torch": false, "libraries.transformers": "4.31.0", "os.containerized": false, "os.family": "Linux", "os.machine": "x86_64", "os.version": "6.2.0-26-generic", "pipeline.classname": "TempPipeline", "pipeline.config_hash": "07a8eddd5a6e512c0d898c6d9f445ed9", "pipeline.nodes.PromptNode": 1, "pipeline.nodes.Shaper": 1, "pipeline.nodes.WebRetriever": 1, "pipeline.run_parameters.debug": false, "pipeline.run_parameters.documents": [ 0 ], "pipeline.run_parameters.file_paths": 0, "pipeline.run_parameters.labels": 0, "pipeline.run_parameters.meta": 1, "pipeline.run_parameters.params": false, "pipeline.run_parameters.queries": true, "pipeline.runs": 1, "pipeline.type": "Query", "python.version": "3.10.12" } ``` Solution: Haystack telemetry uses the `telemetry` variable, `posthog` library and `HAYSTACK_TELEMETRY_ENABLED` envar. We set the envar to False and make sure the relevant objects are disabled.	2023-08-22 10:02:46 +07:00
trducng	043209fda7	Initiate repository	2023-08-16 14:56:48 +07:00

44 Commits