Improve kotaemon based on insights from projects (#147)

- Include static files in the package. - More reliable information panel. Faster & not breaking randomly. - Add directory upload. - Enable zip file to upload. - Allow setting endpoint for the OCR reader using environment variable.
2024-02-28 22:18:29 +07:00
parent e1cf970a3d
commit 033e7e05cc
18 changed files with 618 additions and 56 deletions
--- a/docs/pages/app/customize-flows.md
+++ b/docs/pages/app/customize-flows.md
@@ -36,15 +36,6 @@ class SoSimple(BaseComponent):
        return self.arg1 * self.arg2 + arg3
 ```

-This pipeline is named `SoSimple`. It takes `arg1` and `arg2` as init argument.
-It takes `arg3` as run argument.
-
-```python
->> pipeline = SoSimple(arg1=2, arg2="ha")
->> pipeline("x")
-hahax
-```
-
 This pipeline is simple for demonstration purpose, but we can imagine pipelines
 with much more arguments, that can take other pipelines as arguments, and have
 more complicated logic in the `run` method.
@@ -52,6 +43,9 @@ more complicated logic in the `run` method.
 **_An indexing or reasoning pipeline is just a class subclass from
 `BaseComponent` like above._**

+For more detail on this topic, please refer to [Creating a
+Component](/create-a-component/)
+
 ## Run signatures

 **Note**: this section is tentative at the moment. We will finalize `def run`
@@ -97,7 +91,7 @@ file. This file locates at the current working directory where you start the
 ktem. In most use cases, it is this
 [one](https://github.com/Cinnamon/kotaemon/blob/main/libs/ktem/flowsettings.py).

-```
+```python
 KH_REASONING = ["<python.module.path.to.the.reasoning.class>"]

 KH_INDEX = "<python.module.path.to.the.indexing.class>"
@@ -121,7 +115,7 @@ In your pipeline class, add a classmethod `get_user_settings` that returns a
 setting dictionary, add a classmethod `get_info` that returns an info
 dictionary. Example:

-```
+```python
 class SoSimple(BaseComponent):

    ... # as above
--- a/docs/pages/app/customize-ui.md
+++ b/docs/pages/app/customize-ui.md
--- a/docs/pages/app/features.md
+++ b/docs/pages/app/features.md
@@ -0,0 +1,8 @@
+## Chat
+
+The kotaemon focuses on question and answering over a corpus of data. Below
+is the gentle introduction about the chat functionality.
+
+- Users can upload corpus of files.
+- Users can converse to the chatbot to ask questions about the corpus of files.
+- Users can view the reference in the files.
--- a/docs/pages/app/functional-description.md
+++ b/docs/pages/app/functional-description.md
@@ -0,0 +1,366 @@
+## User group / tenant management
+
+### Create new user group
+
+(6 man-days)
+
+**Description**: each client has a dedicated user group. Each user group has an
+admin user who can do administrative tasks (e.g. creating user account in that
+user group...). The workflow for creating new user group is as follow:
+
+1. Cinnamon accesses the user group management UI.
+2. On "Create user group" panel, we supply:
+   a. Client name: e.g. Apple.
+   b. Sub-domain name: e.g. apple.
+   c. Admin email, username & password.
+3. The system will:
+   a. An Aurora Platform deployment with the specified sub-domain.
+   b. Send an email to the admin, with the username & password.
+
+**Expectation**:
+
+- The admin can go to the deployed Aurora Platform.
+- The admin can login with the specified username & password.
+
+**Condition**:
+
+- When sub-domain name already exists, raise error.
+- If error sending email to the client, raise the error, and delete the
+  newly-created user-group.
+- Password rule:
+  - Have at least 8 characters.
+  - Must contain uppercase, lowercase, number and symbols.
+
+---
+
+### Delete user group
+
+(2 man-days)
+
+**Description**: in the tenant management page, we can delete the selected user
+group. The user flow is as follow:
+
+1. Cinnamon accesses the user group management UI,
+2. View list of user groups.
+3. Next to target user group, click delete.
+4. Confirm whether to delete.
+5. If Yes, delete the user group. If No, cancel the operation.
+
+**Expectation**: when a user group is deleted, we expect to delete everything
+related to the user groups: domain, files, databases, caches, deployments.
+
+## User management
+
+---
+
+### Create user account (for admin user)
+
+(1 man-day)
+
+**Description**: the admin user in the client's account can create user account
+for that user group. To create the new user, the client admin do:
+
+1. Navigate to "Admin" > "Users"
+2. In the "Create user" panel, supply:
+   - Username
+   - Password
+   - Confirm password
+3. Click "Create"
+
+**Expectation**:
+
+- The user can create the account.
+- The username:
+  - Is case-insensitive (e.g. Moon and moon will be the same)
+  - Can only contains these characters: a-z A-Z 0-9 \_ + - .
+  - Has maximum length of 32 characters
+- The password is subjected to the following rule:
+  - 8-character minimum length
+  - Contains at least 1 number
+  - Contains at least 1 lowercase letter
+  - Contains at least 1 uppercase letter
+  - Contains at least 1 special character from the following set, or a
+    non-leading, non-trailing space character: `^ $ * . [ ] { } ( ) ? - " ! @ # % & / \ , > < ' : ; | _ ~ ` + =
+
+---
+
+### Delete user account (for admin user)
+
+**Description**: the admin user in the client's account can delete user account.
+Once an user account is deleted, he/she cannot login to Aurora Platform.
+
+1. The admin user navigates to "Admin" > "Users".
+2. In the user list panel, next to the username, the admin click on the "Delete"
+   button. The Confirmation dialog appears.
+3. If "Delete", the user account is deleted. If "Cancel", do nothing. The
+   Confirmation dialog disappears.
+
+**Expectation**:
+
+- Once the user is deleted, the following information relating to the user will
+  be deleted:
+  - His/her personal setting.
+  - His/her conversations.
+- The following information relating to the user will still be retained:
+  - His/her uploaded files.
+
+---
+
+### Edit user account (for admin user)
+
+**Description**: the admin user can change any information about the user
+account, including password. To change user information:
+
+1. The admin user navigates to "Admin" > "Users".
+2. In the user list panel, next to the username, the admin click on the "Edit"
+   button.
+3. The user list disappears, the user detail appears, with the following
+   information show up:
+   - Username: (prefilled the username)
+   - Password: (blank)
+   - Confirm password: (blank)
+4. The admin can edit any of the information, and click "Save" or "Cancel".
+   - If "Save": the information will be updated to the database, or show
+     error per Expectation below.
+   - If "Cancel": skip.
+5. If Save success or Cancel, transfer back to the user list UI, where the user
+   information is updated accordingly.
+
+**Expectation**:
+
+- If the "Password" & "Confirm password" are different from each other, show
+  error: "Password mismatch".
+- If both "Password" & \*"Confirm password" are blank, don't change the user
+  password.
+- If changing password, the password rule is subjected to the same rule when
+  creating user.
+- It's possible to change username. If changing username, the target user has to
+  use the new username.
+
+---
+
+### Sign-in
+
+(3 man-days)
+
+**Description**: the users can sign-in to Aurora Platform as follow:
+
+1. User navigates to the URL.
+2. If the user is not logged in, the UI just shows the login screen.
+3. User types username & password.
+4. If correct, the user will proceed to normal working UI.
+5. If incorrect, the login screen shows text error.
+
+---
+
+### Sign-out
+
+(1 man-day)
+
+**Description**: the user can sign-out of Aurora Platform as follow:
+
+1. User navigates to the Settings > User page.
+2. User click on logout.
+3. The user is signed out to the UI login screen.
+
+**Expectation**: the user is completely signed out. Next time he/she uses the
+Aurora Platform, he/she has to login again.
+
+---
+
+### Change password
+
+**Description**: the user can change their password as follow:
+
+1. User navigates to the Settings > User page.
+2. In the change password section, the user provides these info and click
+   Change:
+   - Current password
+   - New password
+   - Confirm new password
+3. If changing successfully, then the password is changed. Otherwise, show the
+   error on the UI.
+
+**Expectation**:
+
+- If changing password succeeds, next time they logout/login to the system, they
+  can use the new password.
+- Password rule (Same as normal password rule when creating user)
+- Errors:
+  - Password does not match.
+  - Violated password rules.
+
+---
+
+## Chat
+
+### Chat to the bot
+
+**Description**: the Aurora Platform focuses on question and answering over the
+uploaded data. Each chat has the following components:
+
+- Chat message: show the exchange between bots and humans.
+- Text input + send button: for the user to input the message.
+- Data source panel: for selecting the files that will scope the context for the
+  bot.
+- Information panel: showing evidence as the bot answers user's questions.
+
+The chat workflow looks as follow:
+
+1. [Optional] User select files that they want to scope the context for the bot.
+   If the user doesn't select any files, then all files on Aurora Platform will
+   be the context for the bot.
+   - The user can type multi-line messages, using "Shift + Enter" for
+     line-break.
+2. User sends the message (either clicking the Send button or hitting the Enter
+   key).
+3. The bot in the chat conversation will return "Thinking..." while it
+   processes.
+4. The information panel on the right begin to show data related to the user
+   message.
+5. The bot begins to generate answer. The "Thinking..." placeholder disappears..
+
+**Expecatation**:
+
+- Messages:
+  - User can send multi-line messages, using "Shift + Enter" for line-break.
+  - User can thumbs up, thumbs down the AI response. This information is
+    recorded in the database.
+  - User can click on a copy button on the chat message to copy the content to
+    clipboard.
+- Information panel:
+  - The information panel shows the latest evidence.
+  - The user can click on the message, and the reference for that message will
+    show up on the "Reference panel" (feature in-planning).
+  - The user can click on the title to show/hide the content.
+  - The whole information panel can be collapsed.
+- Chatbot quality:
+  - The user can converse with the bot. The bot answer the user's requests in a
+    natural manner.
+  - The bot message should be streamed to the UI. The bot don't wait to gather
+    alll the text response, then dump all of them at once.
+
+### Conversation - switch
+
+**Description**: users can jump around between different conversations. They can
+see the list of all conversations, can select an old converation, and continue
+the chat under the context of the old conversation. The switching workflow is
+like this:
+
+1. Users click on the conversation dropdown. It will show a list of
+   conversations.
+2. Within that dropdown, the user selects one conversation.
+3. The chat messages, information panel, and selected data will show the content
+   in that old chat.
+4. The user can continue chatting as normal under the context of this old chat.
+
+**Expectation**:
+
+- In the conversation drop down list, the conversations are ordered in created
+  date order.
+- When there is no conversation, the conversation list is empty.
+- When there is no conversation, the user can still converse with the chat bot.
+  When doing so, it automatically create new conversation.
+
+### Conversation - create
+
+**Description**: the user can explicitly start a new conversation with the
+chatbot:
+
+1. User click on the "New" button.
+2. The new conversation is automatically created.
+
+**Expectation**:
+
+- The default conversation name is the current datetime.
+- It become selected.
+- It is added to the conversation list.
+
+### Conversation - rename
+
+**Description**: user can rename the chatbot by typing the name, and click on
+the Rename button next to it.
+
+- If rename succeeds: the name shown in the 1st dropdown will change accordingly
+- If rename doesn't succeed: show error message in red color below the rename section
+
+**Condition**:
+
+- Name constraint:
+  - Min characters: 1
+  - Max characters: 40
+  - Could not having the same name with an existing conversation of the same
+    user.
+
+### Conversation - delete
+
+**Description**: user can delete the existing conversation as follow:
+
+1. Click on Delete button.
+2. The UI show confirmation with 2 buttons:
+   - Delete
+   - Cancel.
+3. If Delete, delete the conversation, switch to the next oldest conversation,
+   close the confirmation panel.
+4. If cancel, just close the confirmation panel.
+
+## File management
+
+The file management allows users to upload, list and delete files that they
+upload to the Aurora Platform
+
+### Upload file
+
+**Description**: the user can upload files to the Aurora Platform. The uploaded
+files will be served as context for our chatbot to refer to when it converses
+with the user. To upload file, the user:
+
+1. Navigate to the File tab.
+2. Within the File tab, there is an Upload section.
+3. User can add files to the Upload section through drag & drop, and or by click
+   on the file browser.
+4. User can select some options relating to uploading and indexing. Depending on
+   the project, these options can be different. Nevertheless, they will discuss
+   below.
+5. User click on "Upload and Index" button.
+6. The app show notifications when indexing starts and finishes, and when errors
+   happen on the top right corner.
+
+**Options**:
+
+- Force re-index file. When user tries to upload files that already exists on
+  the system:
+  - If this option is True: will re-index those files.
+  - If this option is False: will skip indexing those files.
+
+**Condition**:
+
+- Max number of files: 100 files.
+- Max number of pages per file: 500 pages
+- Max file size: 10 MB
+
+### List all files
+
+**Description**: the user can know which files are on the system by:
+
+1. Navigate to the File tab.
+2. By default, it will show all the uploaded files, each with the following
+   information: file name, file size, number of pages, uploaded date
+3. The UI also shows total number of pages, and total number of sizes in MB.
+
+### Delete file
+
+**Description**: users can delete files from this UI to free up the space, or to
+remove outdated information. To remove the files:
+
+1. User navigate to the File tab.
+2. In the list of file, next to each file, there is a Delete button.
+3. The user clicks on the Delete button. Confirmation dialog appear.
+4. If Delete, delete the file. If Cancel, close the confirmation dialog.
+
+**Expectation**: once the file is deleted:
+
+- The database entry of that file is deleted.
+- The file is removed from "Chat - Data source".
+- The total number of pages and MB sizes are reduced accordingly.
+- The reference to the file in the information panel is still retained.
--- a/libs/kotaemon/kotaemon/indices/ingests/files.py
+++ b/libs/kotaemon/kotaemon/indices/ingests/files.py
@@ -67,7 +67,7 @@ class DocumentIngestor(BaseComponent):

        main_reader = DirectoryReader(
            input_files=input_files,
-            file_extractor=file_extractors,  # type: ignore
+            file_extractor=file_extractors,
        )

        return main_reader
@@ -85,7 +85,9 @@ class DocumentIngestor(BaseComponent):
            file_paths = [file_paths]

        documents = self._get_reader(input_files=file_paths)()
+        print(f"Read {len(file_paths)} files into {len(documents)} documents.")
        nodes = self.text_splitter(documents)
+        print(f"Transform {len(documents)} documents into {len(nodes)} nodes.")
        self.log_progress(".num_docs", num_docs=len(nodes))

        # document parsers call
--- a/libs/kotaemon/kotaemon/indices/vectorindex.py
+++ b/libs/kotaemon/kotaemon/indices/vectorindex.py
@@ -59,12 +59,15 @@ class VectorIndexing(BaseIndexing):
                    f"Invalid input type {type(item)}, should be str or Document"
                )

+        print(f"Getting embeddings for {len(input_)} nodes")
        embeddings = self.embedding(input_)
+        print("Adding embeddings to vector store")
        self.vector_store.add(
            embeddings=embeddings,
            ids=[t.doc_id for t in input_],
        )
        if self.doc_store:
+            print("Adding documents to doc store")
            self.doc_store.add(input_)


--- a/libs/kotaemon/kotaemon/loaders/ocr_loader.py
+++ b/libs/kotaemon/kotaemon/loaders/ocr_loader.py
@@ -1,18 +1,34 @@
+import logging
+import os
 from pathlib import Path
 from typing import List, Optional
 from uuid import uuid4

 import requests
 from llama_index.readers.base import BaseReader
+from tenacity import after_log, retry, stop_after_attempt, wait_fixed, wait_random

 from kotaemon.base import Document

 from .utils.pdf_ocr import parse_ocr_output, read_pdf_unstructured
 from .utils.table import strip_special_chars_markdown

+logger = logging.getLogger(__name__)
+
 DEFAULT_OCR_ENDPOINT = "http://127.0.0.1:8000/v2/ai/infer/"


+@retry(
+    stop=stop_after_attempt(3),
+    wait=wait_fixed(5) + wait_random(0, 2),
+    after=after_log(logger, logging.DEBUG),
+)
+def tenacious_api_post(url, **kwargs):
+    resp = requests.post(url=url, **kwargs)
+    resp.raise_for_status()
+    return resp
+
+
 class OCRReader(BaseReader):
    """Read PDF using OCR, with high focus on table extraction

@@ -24,17 +40,20 @@ class OCRReader(BaseReader):
        ```

    Args:
-        endpoint: URL to FullOCR endpoint. Defaults to
+        endpoint: URL to FullOCR endpoint. If not provided, will look for
+            environment variable `OCR_READER_ENDPOINT` or use the default
            `kotaemon.loaders.ocr_loader.DEFAULT_OCR_ENDPOINT`
            (http://127.0.0.1:8000/v2/ai/infer/)
        use_ocr: whether to use OCR to read text (e.g: from images, tables) in the PDF
            If False, only the table and text within table cells will be extracted.
    """

-    def __init__(self, endpoint: str = DEFAULT_OCR_ENDPOINT, use_ocr=True):
+    def __init__(self, endpoint: Optional[str] = None, use_ocr=True):
        """Init the OCR reader with OCR endpoint (FullOCR pipeline)"""
        super().__init__()
-        self.ocr_endpoint = endpoint
+        self.ocr_endpoint = endpoint or os.getenv(
+            "OCR_READER_ENDPOINT", DEFAULT_OCR_ENDPOINT
+        )
        self.use_ocr = use_ocr

    def load_data(
@@ -62,7 +81,7 @@ class OCRReader(BaseReader):
                ocr_results = kwargs["response_content"]
            else:
                # call original API
-                resp = requests.post(url=self.ocr_endpoint, files=files, data=data)
+                resp = tenacious_api_post(url=self.ocr_endpoint, files=files, data=data)
                ocr_results = resp.json()["result"]

        debug_path = kwargs.pop("debug_path", None)
--- a/libs/kotaemon/pyproject.toml
+++ b/libs/kotaemon/pyproject.toml
@@ -26,6 +26,7 @@ dependencies = [
    "click",
    "pandas",
    "trogon",
+    "tenacity",
 ]
 readme = "README.md"
 license = { text = "MIT License" }
--- a/libs/ktem/MANIFEST.in
+++ b/libs/ktem/MANIFEST.in
@@ -0,0 +1,4 @@
+include ktem/assets/css/*.css
+include ktem/assets/img/*.svg
+include ktem/assets/js/*.js
+include ktem/assets/md/*.md
--- a/libs/ktem/ktem/assets/css/main.css
+++ b/libs/ktem/ktem/assets/css/main.css
@@ -44,3 +44,16 @@ footer {
 mark {
  background-color: #1496bb;
 }
+
+
+/* clpse */
+.clpse {
+  background-color: var(--background-fill-secondary);
+  font-weight: bold;
+  cursor: pointer;
+  padding: 3px;
+  width: 100%;
+  border: none;
+  text-align: left;
+  outline: none;
+}
--- a/libs/ktem/ktem/assets/js/main.js
+++ b/libs/ktem/ktem/assets/js/main.js
@@ -4,3 +4,16 @@ main_parent.childNodes[0].classList.add("header-bar");
 main_parent.style = "padding: 0; margin: 0";
 main_parent.parentNode.style = "gap: 0";
 main_parent.parentNode.parentNode.style = "padding: 0";
+
+
+// clpse
+globalThis.clpseFn = (id) => {
+  var obj = document.getElementById('clpse-btn-' + id);
+  obj.classList.toggle("clpse-active");
+  var content = obj.nextElementSibling;
+  if (content.style.display === "none") {
+    content.style.display = "block";
+  } else {
+    content.style.display = "none";
+  }
+}
--- a/libs/ktem/ktem/indexing/file.py
+++ b/libs/ktem/ktem/indexing/file.py
@@ -16,7 +16,6 @@ from ktem.components import (
 )
 from ktem.db.models import Index, Source, SourceTargetRelation, engine
 from ktem.indexing.base import BaseIndexing, BaseRetriever
-from ktem.indexing.exceptions import FileExistsError
 from llama_index.vector_stores import (
    FilterCondition,
    FilterOperator,
@@ -241,7 +240,7 @@ class IndexDocumentPipeline(BaseIndexing):
            to_index.append(abs_path)

        if errors:
-            raise FileExistsError(
+            print(
                "Files already exist. Please rename/remove them or enable reindex.\n"
                f"{errors}"
            )
@@ -258,14 +257,18 @@ class IndexDocumentPipeline(BaseIndexing):

        # extract the files
        nodes = self.file_ingestor(to_index)
+        print("Extracted", len(to_index), "files into", len(nodes), "nodes")
        for node in nodes:
            file_path = str(node.metadata["file_path"])
            node.source = file_to_source[file_path].id

        # index the files
+        print("Indexing the files into vector store")
        self.indexing_vector_pipeline(nodes)
+        print("Finishing indexing the files into vector store")

        # persist to the index
+        print("Persisting the vector and the document into index")
        file_ids = []
        with Session(engine) as session:
            for source in file_to_source.values():
@@ -291,6 +294,8 @@ class IndexDocumentPipeline(BaseIndexing):
                session.add(index)
            session.commit()

+        print("Finishing persisting the vector and the document into index")
+        print(f"{len(nodes)} nodes are indexed")
        return nodes, file_ids

    def get_user_settings(self) -> dict:
--- a/libs/ktem/ktem/pages/chat/init.py
+++ b/libs/ktem/ktem/pages/chat/init.py
@@ -4,9 +4,16 @@ from ktem.app import BasePage
 from .chat_panel import ChatPanel
 from .control import ConversationControl
 from .data_source import DataSource
-from .events import chat_fn, index_fn, is_liked, load_files, update_data_source
+from .events import (
+    chat_fn,
+    index_files_from_dir,
+    index_fn,
+    is_liked,
+    load_files,
+    update_data_source,
+)
 from .report import ReportIssue
-from .upload import FileUpload
+from .upload import DirectoryUpload, FileUpload


 class ChatPage(BasePage):
@@ -20,12 +27,13 @@ class ChatPage(BasePage):
                self.chat_control = ConversationControl(self._app)
                self.data_source = DataSource(self._app)
                self.file_upload = FileUpload(self._app)
+                self.dir_upload = DirectoryUpload(self._app)
                self.report_issue = ReportIssue(self._app)
            with gr.Column(scale=6):
                self.chat_panel = ChatPanel(self._app)
            with gr.Column(scale=3):
                with gr.Accordion(label="Information panel", open=True):
-                    self.info_panel = gr.Markdown(elem_id="chat-info-panel")
+                    self.info_panel = gr.HTML(elem_id="chat-info-panel")

    def on_register_events(self):
        self.chat_panel.submit_btn.click(
@@ -141,6 +149,17 @@ class ChatPage(BasePage):
            outputs=[self.file_upload.file_output, self.data_source.files],
        )

+        self.dir_upload.upload_button.click(
+            fn=index_files_from_dir,
+            inputs=[
+                self.dir_upload.path,
+                self.dir_upload.reindex,
+                self.data_source.files,
+                self._app.settings_state,
+            ],
+            outputs=[self.dir_upload.file_output, self.data_source.files],
+        )
+
        self._app.app.load(
            lambda: gr.update(choices=load_files()),
            inputs=None,
--- a/libs/ktem/ktem/pages/chat/events.py
+++ b/libs/ktem/ktem/pages/chat/events.py
@@ -2,7 +2,7 @@ import asyncio
 import os
 import tempfile
 from copy import deepcopy
-from typing import Optional
+from typing import Optional, Type

 import gradio as gr
 from ktem.components import llms, reasonings
@@ -127,14 +127,18 @@ async def chat_fn(conversation_id, chat_history, files, settings):
    asyncio.create_task(pipeline(chat_input, conversation_id, chat_history))
    text, refs = "", ""

+    len_ref = -1  # for logging purpose
+
    while True:
        try:
            response = queue.get_nowait()
        except Exception:
-            await asyncio.sleep(0)
+            yield "", chat_history + [(chat_input, text or "Thinking ...")], refs
            continue

        if response is None:
+            queue.task_done()
+            print("Chat completed")
            break

        if "output" in response:
@@ -142,6 +146,10 @@ async def chat_fn(conversation_id, chat_history, files, settings):
        if "evidence" in response:
            refs += response["evidence"]

+        if len(refs) > len_ref:
+            print(f"Len refs: {len(refs)}")
+            len_ref = len(refs)
+
    yield "", chat_history + [(chat_input, text)], refs


@@ -203,7 +211,9 @@ def index_fn(files, reindex: bool, selected_files, settings):
    gr.Info(f"Start indexing {len(files)} files...")

    # get the pipeline
-    indexing_cls: BaseIndexing = import_dotted_string(app_settings.KH_INDEX, safe=False)
+    indexing_cls: Type[BaseIndexing] = import_dotted_string(
+        app_settings.KH_INDEX, safe=False
+    )
    indexing_pipeline = indexing_cls.get_pipeline(settings)

    output_nodes, file_ids = indexing_pipeline(files, reindex=reindex)
@@ -225,5 +235,71 @@ def index_fn(files, reindex: bool, selected_files, settings):

    return (
        gr.update(value=file_path, visible=True),
-        gr.update(value=output, choices=file_list),
+        gr.update(value=output, choices=file_list),  # unnecessary
    )
+
+
+def index_files_from_dir(folder_path, reindex, selected_files, settings):
+    """This should be constructable by users
+
+    It means that the users can build their own index.
+    Build your own index:
+        - Input:
+            - Type: based on the type, then there are ranges of. Use can select multiple
+            panels:
+                - Panels
+                - Data sources
+                - Include patterns
+                - Exclude patterns
+            - Indexing functions. Can be a list of indexing functions. Each declared
+            function is:
+                - Condition (the source that will go through this indexing function)
+                - Function (the pipeline that run this)
+        - Output: artifacts that can be used to -> this is the artifacts that we wish
+            - Build the UI
+                - Upload page: fixed standard, based on the type
+                - Read page: fixed standard, based on the type
+                - Delete page: fixed standard, based on the type
+            - Build the index function
+            - Build the chat function
+
+    Step:
+        1. Decide on the artifacts
+        2. Implement the transformation from artifacts to UI
+    """
+    if not folder_path:
+        return
+
+    import fnmatch
+    from pathlib import Path
+
+    include_patterns: list[str] = []
+    exclude_patterns: list[str] = ["*.png", "*.gif", "*/.*"]
+    if include_patterns and exclude_patterns:
+        raise ValueError("Cannot have both include and exclude patterns")
+
+    # clean up the include patterns
+    for idx in range(len(include_patterns)):
+        if include_patterns[idx].startswith("*"):
+            include_patterns[idx] = str(Path.cwd() / "**" / include_patterns[idx])
+        else:
+            include_patterns[idx] = str(Path.cwd() / include_patterns[idx].strip("/"))
+
+    # clean up the exclude patterns
+    for idx in range(len(exclude_patterns)):
+        if exclude_patterns[idx].startswith("*"):
+            exclude_patterns[idx] = str(Path.cwd() / "**" / exclude_patterns[idx])
+        else:
+            exclude_patterns[idx] = str(Path.cwd() / exclude_patterns[idx].strip("/"))
+
+    # get the files
+    files: list[str] = [str(p) for p in Path(folder_path).glob("**/*.*")]
+    if include_patterns:
+        for p in include_patterns:
+            files = fnmatch.filter(names=files, pat=p)
+
+    if exclude_patterns:
+        for p in exclude_patterns:
+            files = [f for f in files if not fnmatch.fnmatch(name=f, pat=p)]
+
+    return index_fn(files, reindex, selected_files, settings)
--- a/libs/ktem/ktem/pages/chat/upload.py
+++ b/libs/ktem/ktem/pages/chat/upload.py
@@ -32,22 +32,46 @@ class FileUpload(BasePage):
            )
            with gr.Accordion("Advanced indexing options", open=False):
                with gr.Row():
-                    with gr.Column():
                    self.reindex = gr.Checkbox(
                        value=False, label="Force reindex file", container=False
                    )
-                    with gr.Column():
-                        self.parser = gr.Dropdown(
-                            choices=[
-                                ("PDF text parser", "normal"),
-                                ("lib-table", "table"),
-                                ("lib-table + OCR", "ocr"),
-                                ("MathPix", "mathpix"),
-                            ],
-                            value="normal",
-                            label="Use advance PDF parser (table+layout preserving)",
-                            container=True,
-                        )
+
+            self.upload_button = gr.Button("Upload and Index")
+            self.file_output = gr.File(
+                visible=False, label="Output files (debug purpose)"
+            )
+
+
+class DirectoryUpload(BasePage):
+    def __init__(self, app):
+        self._app = app
+        self._supported_file_types = [
+            "image",
+            ".pdf",
+            ".txt",
+            ".csv",
+            ".xlsx",
+            ".doc",
+            ".docx",
+            ".pptx",
+            ".html",
+            ".zip",
+        ]
+        self.on_building_ui()
+
+    def on_building_ui(self):
+        with gr.Accordion(label="Directory upload", open=False):
+            gr.Markdown(
+                f"Supported file types: {', '.join(self._supported_file_types)}",
+            )
+            self.path = gr.Textbox(
+                placeholder="Directory path...", lines=1, max_lines=1, container=False
+            )
+            with gr.Accordion("Advanced indexing options", open=False):
+                with gr.Row():
+                    self.reindex = gr.Checkbox(
+                        value=False, label="Force reindex file", container=False
+                    )

            self.upload_button = gr.Button("Upload and Index")
            self.file_output = gr.File(
--- a/libs/ktem/ktem/reasoning/simple.py
+++ b/libs/ktem/ktem/reasoning/simple.py
@@ -106,7 +106,7 @@ DEFAULT_QA_TEXT_PROMPT = (
    "Use the following pieces of context to answer the question at the end. "
    "If you don't know the answer, just say that you don't know, don't try to "
    "make up an answer. Keep the answer as concise as possible. Give answer in "
-    "{lang}. {system}\n\n"
+    "{lang}.\n\n"
    "{context}\n"
    "Question: {question}\n"
    "Helpful Answer:"
@@ -116,7 +116,7 @@ DEFAULT_QA_TABLE_PROMPT = (
    "List all rows (row number) from the table context that related to the question, "
    "then provide detail answer with clear explanation and citations. "
    "If you don't know the answer, just say that you don't know, "
-    "don't try to make up an answer. Give answer in {lang}. {system}\n\n"
+    "don't try to make up an answer. Give answer in {lang}.\n\n"
    "Context:\n"
    "{context}\n"
    "Question: {question}\n"
@@ -127,7 +127,7 @@ DEFAULT_QA_CHATBOT_PROMPT = (
    "Pick the most suitable chatbot scenarios to answer the question at the end, "
    "output the provided answer text. If you don't know the answer, "
    "just say that you don't know. Keep the answer as concise as possible. "
-    "Give answer in {lang}. {system}\n\n"
+    "Give answer in {lang}.\n\n"
    "Context:\n"
    "{context}\n"
    "Question: {question}\n"
@@ -198,13 +198,12 @@ class AnswerWithContextPipeline(BaseComponent):
            context=evidence,
            question=question,
            lang=self.lang,
-            system=self.system_prompt,
        )

-        messages = [
-            SystemMessage(content="You are a helpful assistant"),
-            HumanMessage(content=prompt),
-        ]
+        messages = []
+        if self.system_prompt:
+            messages.append(SystemMessage(content=self.system_prompt))
+        messages.append(HumanMessage(content=prompt))
        output = ""
        for text in self.llm(messages):
            output += text.text
@@ -316,11 +315,19 @@ class FullQAPipeline(BaseComponent):
            settings: the settings for the pipeline
            retrievers: the retrievers to use
        """
+        _id = cls.get_info()["id"]
+
        pipeline = FullQAPipeline(retrievers=retrievers)
        pipeline.answering_pipeline.llm = llms.get_highest_accuracy()
        pipeline.answering_pipeline.lang = {"en": "English", "ja": "Japanese"}.get(
            settings["reasoning.lang"], "English"
        )
+        pipeline.answering_pipeline.system_prompt = settings[
+            f"reasoning.options.{_id}.system_prompt"
+        ]
+        pipeline.answering_pipeline.qa_template = settings[
+            f"reasoning.options.{_id}.qa_prompt"
+        ]
        return pipeline

    @classmethod
@@ -345,10 +352,6 @@ class FullQAPipeline(BaseComponent):
                "value": True,
                "component": "checkbox",
            },
-            "system_prompt": {
-                "name": "System Prompt",
-                "value": "This is a question answering system",
-            },
            "citation_llm": {
                "name": "LLM for citation",
                "value": citation_llm,
@@ -361,6 +364,14 @@ class FullQAPipeline(BaseComponent):
                "component": "dropdown",
                "choices": main_llm_choices,
            },
+            "system_prompt": {
+                "name": "System Prompt",
+                "value": "This is a question answering system",
+            },
+            "qa_prompt": {
+                "name": "QA Prompt (contains {context}, {question}, {lang})",
+                "value": DEFAULT_QA_TEXT_PROMPT,
+            },
        }

    @classmethod
--- a/libs/ktem/pyproject.toml
+++ b/libs/ktem/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["setuptools >= 61.0"]
 build-backend = "setuptools.build_meta"

 [tool.setuptools]
-include-package-data = false
+include-package-data = true
 packages.find.include = ["ktem*"]
 packages.find.exclude = ["tests*", "env*"]

--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -8,11 +8,15 @@ nav:
      - Quick Start: index.md
      - Overview: overview.md
      - Contributing: contributing.md
+  - Application:
+      - Features: pages/app/features.md
+      - Customize flow logic: pages/app/customize-flows.md
+      - Customize UI: pages/app/customize-ui.md
+      - Functional description: pages/app/functional-description.md
  - Tutorial:
      - Data & Data Structure Components: data-components.md
      - Creating a Component: create-a-component.md
      - Utilities: ultilities.md
-      - Add new indexing and reasoning pipeline to the application: app.md
  # generated using gen-files + literate-nav
  - API Reference: reference/
  - Use Cases: examples/