first stage of extraction complete
This commit is contained in:
296
Documentation/troubleshooting.md
Normal file
296
Documentation/troubleshooting.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# N8N Workflow Documentation - Troubleshooting Guide
|
||||
|
||||
## Overview
|
||||
This document details the challenges encountered during the workflow documentation process and provides solutions for common issues. It serves as a guide for future documentation efforts and troubleshooting similar problems.
|
||||
|
||||
## Approaches That Failed
|
||||
|
||||
### 1. Browser Automation with Playwright
|
||||
|
||||
#### What We Tried
|
||||
```javascript
|
||||
// Attempted approach
|
||||
await page.goto('https://localhost:8000');
|
||||
await page.selectOption('#categoryFilter', 'Business Process Automation');
|
||||
await page.waitForLoadState('networkidle');
|
||||
```
|
||||
|
||||
#### Why It Failed
|
||||
- **Dynamic Loading Bottleneck**: The web application loads all 2,055 workflows before applying client-side filtering
|
||||
- **Timeout Issues**: Browser automation timed out waiting for the filtering process to complete
|
||||
- **Memory Constraints**: Loading all workflows simultaneously exceeded browser memory limits
|
||||
- **JavaScript Complexity**: The client-side filtering logic was too complex for reliable automation
|
||||
|
||||
#### Symptoms
|
||||
- Page loads but workflows never finish loading
|
||||
- Browser automation hangs on category selection
|
||||
- "Waiting for page to load" messages that never complete
|
||||
- Network timeouts after 2+ minutes
|
||||
|
||||
#### Error Messages
|
||||
```
|
||||
TimeoutError: page.waitForLoadState: Timeout 30000ms exceeded
|
||||
Waiting for load state to be NetworkIdle
|
||||
```
|
||||
|
||||
### 2. Firecrawl with Dynamic Filtering
|
||||
|
||||
#### What We Tried
|
||||
```javascript
|
||||
// Attempted approach
|
||||
firecrawl_scrape({
|
||||
url: "https://localhost:8000",
|
||||
actions: [
|
||||
{type: "wait", milliseconds: 5000},
|
||||
{type: "executeJavascript", script: "document.getElementById('categoryFilter').value = 'Business Process Automation'; document.getElementById('categoryFilter').dispatchEvent(new Event('change'));"},
|
||||
{type: "wait", milliseconds: 30000}
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
#### Why It Failed
|
||||
- **60-Second Timeout Limit**: Firecrawl's maximum wait time was insufficient for complete data loading
|
||||
- **JavaScript Execution Timing**: The filtering process required waiting for all workflows to load first
|
||||
- **Response Size Limits**: Filtered results still exceeded token limits for processing
|
||||
- **Inconsistent State**: Scraping occurred before filtering was complete
|
||||
|
||||
#### Symptoms
|
||||
- Firecrawl returns incomplete data (1 workflow instead of 77)
|
||||
- Timeout errors after 60 seconds
|
||||
- "Request timed out" or "Internal server error" responses
|
||||
- Inconsistent results between scraping attempts
|
||||
|
||||
#### Error Messages
|
||||
```
|
||||
Failed to scrape URL. Status code: 408. Error: Request timed out
|
||||
Failed to scrape URL. Status code: 500. Error: (Internal server error) - timeout
|
||||
Total wait time (waitFor + wait actions) cannot exceed 60 seconds
|
||||
```
|
||||
|
||||
### 3. Single Large Web Scraping
|
||||
|
||||
#### What We Tried
|
||||
Direct scraping of the entire page without category filtering:
|
||||
```bash
|
||||
curl -s "https://localhost:8000" | html2text
|
||||
```
|
||||
|
||||
#### Why It Failed
|
||||
- **Data Overload**: 2,055 workflows generated responses exceeding 25,000 token limits
|
||||
- **No Organization**: Results were unstructured and difficult to categorize
|
||||
- **Missing Metadata**: HTML scraping didn't provide structured workflow details
|
||||
- **Pagination Issues**: Workflows are loaded progressively, not all at once
|
||||
|
||||
#### Symptoms
|
||||
- "Response exceeds maximum allowed tokens" errors
|
||||
- Truncated or incomplete data
|
||||
- Missing workflow details and metadata
|
||||
- Unstructured output difficult to process
|
||||
|
||||
## What Worked: Direct API Strategy
|
||||
|
||||
### Why This Approach Succeeded
|
||||
|
||||
#### 1. Avoided JavaScript Complexity
|
||||
- **Direct Data Access**: API endpoints provided structured data without client-side processing
|
||||
- **No Dynamic Loading**: Each API call returned complete data immediately
|
||||
- **Reliable State**: No dependency on browser state or JavaScript execution
|
||||
|
||||
#### 2. Manageable Response Sizes
|
||||
- **Individual Requests**: Single workflow details fit within token limits
|
||||
- **Structured Data**: JSON responses were predictable and parseable
|
||||
- **Metadata Separation**: Workflow details were properly structured in API responses
|
||||
|
||||
#### 3. Rate Limiting Control
|
||||
- **Controlled Pacing**: Small delays between requests prevented server overload
|
||||
- **Batch Processing**: Category-based organization enabled logical processing
|
||||
- **Error Recovery**: Individual failures didn't stop the entire process
|
||||
|
||||
### Technical Implementation That Worked
|
||||
|
||||
```bash
|
||||
# Step 1: Get category mappings (single fast call)
|
||||
curl -s "${API_BASE}/category-mappings" | jq '.mappings'
|
||||
|
||||
# Step 2: Group by category
|
||||
jq 'to_entries | group_by(.value) | map({category: .[0].value, count: length, files: map(.key)})'
|
||||
|
||||
# Step 3: For each workflow, get details
|
||||
for file in $workflow_files; do
|
||||
curl -s "${API_BASE}/workflows/${file}" | jq '.metadata'
|
||||
sleep 0.05 # Small delay for rate limiting
|
||||
done
|
||||
```
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue 1: JSON Parsing Errors
|
||||
|
||||
#### Symptoms
|
||||
```
|
||||
jq: parse error: Invalid numeric literal at line 1, column 11
|
||||
```
|
||||
|
||||
#### Cause
|
||||
API returned non-JSON responses (HTML error pages, empty responses)
|
||||
|
||||
#### Solution
|
||||
```bash
|
||||
# Validate JSON before processing
|
||||
response=$(curl -s "${API_BASE}/workflows/${filename}")
|
||||
if echo "$response" | jq -e '.metadata' > /dev/null 2>&1; then
|
||||
echo "$response" | jq '.metadata'
|
||||
else
|
||||
echo "{\"error\": \"Failed to fetch $filename\", \"filename\": \"$filename\"}"
|
||||
fi
|
||||
```
|
||||
|
||||
### Issue 2: URL Encoding Problems
|
||||
|
||||
#### Symptoms
|
||||
- 404 errors for workflows with special characters in filenames
|
||||
- API calls failing for certain workflow files
|
||||
|
||||
#### Cause
|
||||
Workflow filenames contain special characters that need URL encoding
|
||||
|
||||
#### Solution
|
||||
```bash
|
||||
# Proper URL encoding
|
||||
encoded_filename=$(python3 -c "import urllib.parse; print(urllib.parse.quote('$filename'))")
|
||||
curl -s "${API_BASE}/workflows/${encoded_filename}"
|
||||
```
|
||||
|
||||
### Issue 3: Missing Workflow Data
|
||||
|
||||
#### Symptoms
|
||||
- Empty fields in generated documentation
|
||||
- "Unknown" values for workflow properties
|
||||
|
||||
#### Cause
|
||||
API response structure nested metadata under `.metadata` key
|
||||
|
||||
#### Solution
|
||||
```bash
|
||||
# Extract from correct path
|
||||
workflow_name=$(echo "$workflow_json" | jq -r '.name // "Unknown"')
|
||||
# Changed to:
|
||||
workflow_name=$(echo "$response" | jq -r '.metadata.name // "Unknown"')
|
||||
```
|
||||
|
||||
### Issue 4: Script Timeouts During Bulk Processing
|
||||
|
||||
#### Symptoms
|
||||
- Scripts timing out after 10 minutes
|
||||
- Incomplete documentation generation
|
||||
- Process stops mid-category
|
||||
|
||||
#### Cause
|
||||
Processing 2,055 API calls with delays takes significant time
|
||||
|
||||
#### Solution
|
||||
```bash
|
||||
# Process categories individually
|
||||
for category in $categories; do
|
||||
generate_single_category "$category"
|
||||
done
|
||||
|
||||
# Or use timeout command
|
||||
timeout 600 ./generate_all_categories.sh
|
||||
```
|
||||
|
||||
### Issue 5: Inconsistent Markdown Formatting
|
||||
|
||||
#### Symptoms
|
||||
- Trailing commas in integration lists
|
||||
- Missing or malformed data fields
|
||||
- Inconsistent status display
|
||||
|
||||
#### Cause
|
||||
Variable data quality and missing fallback handling
|
||||
|
||||
#### Solution
|
||||
```bash
|
||||
# Clean integration lists
|
||||
workflow_integrations=$(echo "$workflow_json" | jq -r '.integrations[]?' 2>/dev/null | tr '\n' ', ' | sed 's/, $//')
|
||||
|
||||
# Handle boolean fields properly
|
||||
workflow_active=$(echo "$workflow_json" | jq -r '.active // false')
|
||||
status=$([ "$workflow_active" = "1" ] && echo "Active" || echo "Inactive")
|
||||
```
|
||||
|
||||
## Prevention Strategies
|
||||
|
||||
### 1. API Response Validation
|
||||
Always validate API responses before processing:
|
||||
```bash
|
||||
if ! echo "$response" | jq -e . >/dev/null 2>&1; then
|
||||
echo "Invalid JSON response"
|
||||
continue
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. Graceful Error Handling
|
||||
Don't let individual failures stop the entire process:
|
||||
```bash
|
||||
workflow_data=$(fetch_workflow_details "$filename" || echo '{"error": "fetch_failed"}')
|
||||
```
|
||||
|
||||
### 3. Progress Tracking
|
||||
Include progress indicators for long-running processes:
|
||||
```bash
|
||||
echo "[$processed/$total] Processing $filename"
|
||||
```
|
||||
|
||||
### 4. Rate Limiting
|
||||
Always include delays to be respectful to APIs:
|
||||
```bash
|
||||
sleep 0.05 # Small delay between requests
|
||||
```
|
||||
|
||||
### 5. Data Quality Checks
|
||||
Verify counts and data integrity:
|
||||
```bash
|
||||
expected_count=77
|
||||
actual_count=$(grep "^###" output.md | wc -l)
|
||||
if [ "$actual_count" -ne "$expected_count" ]; then
|
||||
echo "Warning: Count mismatch"
|
||||
fi
|
||||
```
|
||||
|
||||
## Future Recommendations
|
||||
|
||||
### For Similar Projects
|
||||
1. **Start with API exploration** before attempting web scraping
|
||||
2. **Test with small datasets** before processing large volumes
|
||||
3. **Implement resume capability** for long-running processes
|
||||
4. **Use structured logging** for better debugging
|
||||
5. **Build in validation** at every step
|
||||
|
||||
### For API Improvements
|
||||
1. **Category filtering endpoints** would eliminate need for client-side filtering
|
||||
2. **Batch endpoints** could reduce the number of individual requests
|
||||
3. **Response pagination** for large category results
|
||||
4. **Rate limiting headers** to guide appropriate delays
|
||||
|
||||
### For Documentation Process
|
||||
1. **Automated validation** against source API counts
|
||||
2. **Incremental updates** rather than full regeneration
|
||||
3. **Parallel processing** where appropriate
|
||||
4. **Better error reporting** and recovery mechanisms
|
||||
|
||||
## Emergency Recovery Procedures
|
||||
|
||||
### If Process Fails Mid-Execution
|
||||
1. **Identify completed categories**: Check which markdown files exist
|
||||
2. **Resume from failure point**: Process only missing categories
|
||||
3. **Validate existing files**: Ensure completed files have correct counts
|
||||
4. **Manual intervention**: Handle problematic workflows individually
|
||||
|
||||
### If API Access Is Lost
|
||||
1. **Verify connectivity**: Check tunnel/proxy status
|
||||
2. **Test API endpoints**: Confirm they're still accessible
|
||||
3. **Switch to backup**: Use alternative access methods if available
|
||||
4. **Document outage**: Note any missing data for later completion
|
||||
|
||||
This troubleshooting guide ensures that future documentation efforts can avoid the pitfalls encountered and build upon the successful strategies identified.
|
||||
Reference in New Issue
Block a user