Updating for a new version and adding the updated README
This commit is contained in:
162
README.md
162
README.md
@@ -1,9 +1,153 @@
|
||||
owui-site-crawler.py --token "your_api_token" \
|
||||
--base-url "http://localhost:3000" \
|
||||
--website-url "https://example.com" \
|
||||
--kb-name "Example Website KB" \
|
||||
--kb-purpose "Documentation and information from example.com" \
|
||||
--depth 3 \
|
||||
--delay 1.5 \
|
||||
--exclude "/login" \
|
||||
--exclude "/admin"
|
||||
|
||||
# Web to Knowledge Base for Open WebUI
|
||||
|
||||
A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
|
||||
|
||||
## Features
|
||||
|
||||
- Crawls websites to a specified depth while respecting domain boundaries
|
||||
- Converts HTML content to Markdown using MarkItDown
|
||||
- Preserves JSON content in its original format
|
||||
- Creates or updates knowledge bases in Open WebUI
|
||||
- Handles existing files through update or skip options
|
||||
- Customizable crawling with exclude patterns
|
||||
- Detailed logging of the process
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- Open WebUI instance with API access
|
||||
|
||||
### Dependencies
|
||||
|
||||
Install the required packages:
|
||||
|
||||
```bash
|
||||
pip install requests beautifulsoup4 markitdown
|
||||
```
|
||||
|
||||
### Getting the Script
|
||||
|
||||
Download the script and make it executable:
|
||||
|
||||
```bash
|
||||
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
|
||||
chmod +x web_to_kb.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Basic usage:
|
||||
|
||||
```bash
|
||||
python web_to_kb.py --token "YOUR_API_TOKEN" \
|
||||
--base-url "https://your-openwebui-instance.com" \
|
||||
--website-url "https://website-to-crawl.com" \
|
||||
--kb-name "My Website Knowledge Base"
|
||||
```
|
||||
|
||||
### Command Line Arguments
|
||||
|
||||
| Argument | Short | Description | Required | Default |
|
||||
|----------|-------|-------------|----------|---------|
|
||||
| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
|
||||
| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
|
||||
| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
|
||||
| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
|
||||
| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
|
||||
| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
|
||||
| `--delay` | | Delay between requests in seconds | No | 1.0 |
|
||||
| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
|
||||
| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
|
||||
| `--update` | | Update existing files in the knowledge base | No | False |
|
||||
| `--skip-existing` | | Skip existing files in the knowledge base | No | False |
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Crawl with Limited Depth
|
||||
|
||||
```bash
|
||||
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
||||
-u "https://your-openwebui-instance.com" \
|
||||
-w "https://docs.example.com" \
|
||||
-n "Example Docs KB" \
|
||||
-d 3
|
||||
```
|
||||
|
||||
### Excluding Certain URL Patterns
|
||||
|
||||
```bash
|
||||
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
||||
-u "https://your-openwebui-instance.com" \
|
||||
-w "https://blog.example.com" \
|
||||
-n "Example Blog KB" \
|
||||
-e "/tags/" \
|
||||
-e "/author/" \
|
||||
-e "/search/"
|
||||
```
|
||||
|
||||
### Including JSON Content
|
||||
|
||||
```bash
|
||||
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
||||
-u "https://your-openwebui-instance.com" \
|
||||
-w "https://api-docs.example.com" \
|
||||
-n "Example API Documentation" \
|
||||
-j
|
||||
```
|
||||
|
||||
### Updating an Existing Knowledge Base
|
||||
|
||||
```bash
|
||||
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
||||
-u "https://your-openwebui-instance.com" \
|
||||
-w "https://knowledge-center.example.com" \
|
||||
-n "Knowledge Center" \
|
||||
--update
|
||||
```
|
||||
|
||||
### Skipping Existing Files
|
||||
|
||||
```bash
|
||||
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
||||
-u "https://your-openwebui-instance.com" \
|
||||
-w "https://docs.example.com" \
|
||||
-n "Documentation KB" \
|
||||
--skip-existing
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
|
||||
|
||||
2. **Content Processing**:
|
||||
- HTML content is converted to Markdown using MarkItDown
|
||||
- JSON content is preserved in its native format (when `--include-json` is used)
|
||||
|
||||
3. **Knowledge Base Management**:
|
||||
- Checks if a knowledge base with the specified name already exists
|
||||
- Creates a new knowledge base if none exists
|
||||
|
||||
4. **File Upload**:
|
||||
- Manages existing files based on the `--update` or `--skip-existing` flags
|
||||
- Uploads new files to the knowledge base
|
||||
|
||||
## Notes
|
||||
|
||||
- The script respects domain boundaries and will not crawl external links
|
||||
- URLs are used to generate filenames, with special characters replaced
|
||||
- Add a delay between requests to be respectful of websites' resources
|
||||
- File updates are performed by uploading a new file and removing the old one
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License - see the LICENSE file for details.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
|
||||
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
|
||||
- [Requests](https://requests.readthedocs.io/) for HTTP requests
|
||||
- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API
|
||||
Reference in New Issue
Block a user