Updating for a new version and adding the updated README

2025-04-16 19:50:33 -07:00
parent 2082dba946
commit 3d5acde487
2 changed files with 376 additions and 34 deletions
--- a/README.md
+++ b/README.md
@@ -1,9 +1,153 @@
-owui-site-crawler.py --token "your_api_token" \
-    --base-url "http://localhost:3000" \
-    --website-url "https://example.com" \
-    --kb-name "Example Website KB" \
-    --kb-purpose "Documentation and information from example.com" \
-    --depth 3 \
-    --delay 1.5 \
-    --exclude "/login" \
-    --exclude "/admin"
+
+# Web to Knowledge Base for Open WebUI
+
+A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
+
+## Features
+
+- Crawls websites to a specified depth while respecting domain boundaries
+- Converts HTML content to Markdown using MarkItDown
+- Preserves JSON content in its original format
+- Creates or updates knowledge bases in Open WebUI
+- Handles existing files through update or skip options
+- Customizable crawling with exclude patterns
+- Detailed logging of the process
+
+## Installation
+
+### Prerequisites
+
+- Python 3.10+
+- Open WebUI instance with API access
+
+### Dependencies
+
+Install the required packages:
+
+```bash
+pip install requests beautifulsoup4 markitdown
+```
+
+### Getting the Script
+
+Download the script and make it executable:
+
+```bash
+curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
+chmod +x web_to_kb.py
+```
+
+## Usage
+
+Basic usage:
+
+```bash
+python web_to_kb.py --token "YOUR_API_TOKEN" \
+                   --base-url "https://your-openwebui-instance.com" \
+                   --website-url "https://website-to-crawl.com" \
+                   --kb-name "My Website Knowledge Base"
+```
+
+### Command Line Arguments
+
+| Argument | Short | Description | Required | Default |
+|----------|-------|-------------|----------|---------|
+| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
+| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
+| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
+| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
+| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
+| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
+| `--delay` | | Delay between requests in seconds | No | 1.0 |
+| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
+| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
+| `--update` | | Update existing files in the knowledge base | No | False |
+| `--skip-existing` | | Skip existing files in the knowledge base | No | False |
+
+## Examples
+
+### Basic Crawl with Limited Depth
+
+```bash
+python web_to_kb.py -t "YOUR_API_TOKEN" \
+                   -u "https://your-openwebui-instance.com" \
+                   -w "https://docs.example.com" \
+                   -n "Example Docs KB" \
+                   -d 3
+```
+
+### Excluding Certain URL Patterns
+
+```bash
+python web_to_kb.py -t "YOUR_API_TOKEN" \
+                   -u "https://your-openwebui-instance.com" \
+                   -w "https://blog.example.com" \
+                   -n "Example Blog KB" \
+                   -e "/tags/" \
+                   -e "/author/" \
+                   -e "/search/"
+```
+
+### Including JSON Content
+
+```bash
+python web_to_kb.py -t "YOUR_API_TOKEN" \
+                   -u "https://your-openwebui-instance.com" \
+                   -w "https://api-docs.example.com" \
+                   -n "Example API Documentation" \
+                   -j
+```
+
+### Updating an Existing Knowledge Base
+
+```bash
+python web_to_kb.py -t "YOUR_API_TOKEN" \
+                   -u "https://your-openwebui-instance.com" \
+                   -w "https://knowledge-center.example.com" \
+                   -n "Knowledge Center" \
+                   --update
+```
+
+### Skipping Existing Files
+
+```bash
+python web_to_kb.py -t "YOUR_API_TOKEN" \
+                   -u "https://your-openwebui-instance.com" \
+                   -w "https://docs.example.com" \
+                   -n "Documentation KB" \
+                   --skip-existing
+```
+
+## How It Works
+
+1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
+
+2. **Content Processing**: 
+   - HTML content is converted to Markdown using MarkItDown
+   - JSON content is preserved in its native format (when `--include-json` is used)
+
+3. **Knowledge Base Management**:
+   - Checks if a knowledge base with the specified name already exists
+   - Creates a new knowledge base if none exists
+
+4. **File Upload**:
+   - Manages existing files based on the `--update` or `--skip-existing` flags
+   - Uploads new files to the knowledge base
+
+## Notes
+
+- The script respects domain boundaries and will not crawl external links
+- URLs are used to generate filenames, with special characters replaced
+- Add a delay between requests to be respectful of websites' resources
+- File updates are performed by uploading a new file and removing the old one
+
+## License
+
+This project is licensed under the MIT License - see the LICENSE file for details.
+
+## Acknowledgments
+
+- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
+- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
+- [Requests](https://requests.readthedocs.io/) for HTTP requests
+- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API