first stage of extraction complete

2025-07-28 09:44:28 -05:00
parent 307d530f9b
commit f57d8f0539
20 changed files with 17436 additions and 0 deletions
--- a/Documentation/troubleshooting.md
+++ b/Documentation/troubleshooting.md
@@ -0,0 +1,296 @@
+# N8N Workflow Documentation - Troubleshooting Guide
+
+## Overview
+This document details the challenges encountered during the workflow documentation process and provides solutions for common issues. It serves as a guide for future documentation efforts and troubleshooting similar problems.
+
+## Approaches That Failed
+
+### 1. Browser Automation with Playwright
+
+#### What We Tried
+```javascript
+// Attempted approach
+await page.goto('https://localhost:8000');
+await page.selectOption('#categoryFilter', 'Business Process Automation');
+await page.waitForLoadState('networkidle');
+```
+
+#### Why It Failed
+- **Dynamic Loading Bottleneck**: The web application loads all 2,055 workflows before applying client-side filtering
+- **Timeout Issues**: Browser automation timed out waiting for the filtering process to complete
+- **Memory Constraints**: Loading all workflows simultaneously exceeded browser memory limits
+- **JavaScript Complexity**: The client-side filtering logic was too complex for reliable automation
+
+#### Symptoms
+- Page loads but workflows never finish loading
+- Browser automation hangs on category selection
+- "Waiting for page to load" messages that never complete
+- Network timeouts after 2+ minutes
+
+#### Error Messages
+```
+TimeoutError: page.waitForLoadState: Timeout 30000ms exceeded
+Waiting for load state to be NetworkIdle
+```
+
+### 2. Firecrawl with Dynamic Filtering
+
+#### What We Tried
+```javascript
+// Attempted approach
+firecrawl_scrape({
+  url: "https://localhost:8000",
+  actions: [
+    {type: "wait", milliseconds: 5000},
+    {type: "executeJavascript", script: "document.getElementById('categoryFilter').value = 'Business Process Automation'; document.getElementById('categoryFilter').dispatchEvent(new Event('change'));"},
+    {type: "wait", milliseconds: 30000}
+  ]
+})
+```
+
+#### Why It Failed
+- **60-Second Timeout Limit**: Firecrawl's maximum wait time was insufficient for complete data loading
+- **JavaScript Execution Timing**: The filtering process required waiting for all workflows to load first
+- **Response Size Limits**: Filtered results still exceeded token limits for processing
+- **Inconsistent State**: Scraping occurred before filtering was complete
+
+#### Symptoms
+- Firecrawl returns incomplete data (1 workflow instead of 77)
+- Timeout errors after 60 seconds
+- "Request timed out" or "Internal server error" responses
+- Inconsistent results between scraping attempts
+
+#### Error Messages
+```
+Failed to scrape URL. Status code: 408. Error: Request timed out
+Failed to scrape URL. Status code: 500. Error: (Internal server error) - timeout
+Total wait time (waitFor + wait actions) cannot exceed 60 seconds
+```
+
+### 3. Single Large Web Scraping
+
+#### What We Tried
+Direct scraping of the entire page without category filtering:
+```bash
+curl -s "https://localhost:8000" | html2text
+```
+
+#### Why It Failed
+- **Data Overload**: 2,055 workflows generated responses exceeding 25,000 token limits
+- **No Organization**: Results were unstructured and difficult to categorize
+- **Missing Metadata**: HTML scraping didn't provide structured workflow details
+- **Pagination Issues**: Workflows are loaded progressively, not all at once
+
+#### Symptoms
+- "Response exceeds maximum allowed tokens" errors
+- Truncated or incomplete data
+- Missing workflow details and metadata
+- Unstructured output difficult to process
+
+## What Worked: Direct API Strategy
+
+### Why This Approach Succeeded
+
+#### 1. Avoided JavaScript Complexity
+- **Direct Data Access**: API endpoints provided structured data without client-side processing
+- **No Dynamic Loading**: Each API call returned complete data immediately
+- **Reliable State**: No dependency on browser state or JavaScript execution
+
+#### 2. Manageable Response Sizes
+- **Individual Requests**: Single workflow details fit within token limits
+- **Structured Data**: JSON responses were predictable and parseable
+- **Metadata Separation**: Workflow details were properly structured in API responses
+
+#### 3. Rate Limiting Control
+- **Controlled Pacing**: Small delays between requests prevented server overload
+- **Batch Processing**: Category-based organization enabled logical processing
+- **Error Recovery**: Individual failures didn't stop the entire process
+
+### Technical Implementation That Worked
+
+```bash
+# Step 1: Get category mappings (single fast call)
+curl -s "${API_BASE}/category-mappings" | jq '.mappings'
+
+# Step 2: Group by category  
+jq 'to_entries | group_by(.value) | map({category: .[0].value, count: length, files: map(.key)})'
+
+# Step 3: For each workflow, get details
+for file in $workflow_files; do
+    curl -s "${API_BASE}/workflows/${file}" | jq '.metadata'
+    sleep 0.05  # Small delay for rate limiting
+done
+```
+
+## Common Issues and Solutions
+
+### Issue 1: JSON Parsing Errors
+
+#### Symptoms
+```
+jq: parse error: Invalid numeric literal at line 1, column 11
+```
+
+#### Cause
+API returned non-JSON responses (HTML error pages, empty responses)
+
+#### Solution
+```bash
+# Validate JSON before processing
+response=$(curl -s "${API_BASE}/workflows/${filename}")
+if echo "$response" | jq -e '.metadata' > /dev/null 2>&1; then
+    echo "$response" | jq '.metadata'
+else
+    echo "{\"error\": \"Failed to fetch $filename\", \"filename\": \"$filename\"}"
+fi
+```
+
+### Issue 2: URL Encoding Problems
+
+#### Symptoms
+- 404 errors for workflows with special characters in filenames
+- API calls failing for certain workflow files
+
+#### Cause
+Workflow filenames contain special characters that need URL encoding
+
+#### Solution
+```bash
+# Proper URL encoding
+encoded_filename=$(python3 -c "import urllib.parse; print(urllib.parse.quote('$filename'))")
+curl -s "${API_BASE}/workflows/${encoded_filename}"
+```
+
+### Issue 3: Missing Workflow Data
+
+#### Symptoms
+- Empty fields in generated documentation
+- "Unknown" values for workflow properties
+
+#### Cause
+API response structure nested metadata under `.metadata` key
+
+#### Solution
+```bash
+# Extract from correct path
+workflow_name=$(echo "$workflow_json" | jq -r '.name // "Unknown"')
+# Changed to:
+workflow_name=$(echo "$response" | jq -r '.metadata.name // "Unknown"')
+```
+
+### Issue 4: Script Timeouts During Bulk Processing
+
+#### Symptoms
+- Scripts timing out after 10 minutes
+- Incomplete documentation generation
+- Process stops mid-category
+
+#### Cause
+Processing 2,055 API calls with delays takes significant time
+
+#### Solution
+```bash
+# Process categories individually
+for category in $categories; do
+    generate_single_category "$category"
+done
+
+# Or use timeout command
+timeout 600 ./generate_all_categories.sh
+```
+
+### Issue 5: Inconsistent Markdown Formatting
+
+#### Symptoms
+- Trailing commas in integration lists
+- Missing or malformed data fields
+- Inconsistent status display
+
+#### Cause
+Variable data quality and missing fallback handling
+
+#### Solution
+```bash
+# Clean integration lists
+workflow_integrations=$(echo "$workflow_json" | jq -r '.integrations[]?' 2>/dev/null | tr '\n' ', ' | sed 's/, $//')
+
+# Handle boolean fields properly
+workflow_active=$(echo "$workflow_json" | jq -r '.active // false')
+status=$([ "$workflow_active" = "1" ] && echo "Active" || echo "Inactive")
+```
+
+## Prevention Strategies
+
+### 1. API Response Validation
+Always validate API responses before processing:
+```bash
+if ! echo "$response" | jq -e . >/dev/null 2>&1; then
+    echo "Invalid JSON response"
+    continue
+fi
+```
+
+### 2. Graceful Error Handling
+Don't let individual failures stop the entire process:
+```bash
+workflow_data=$(fetch_workflow_details "$filename" || echo '{"error": "fetch_failed"}')
+```
+
+### 3. Progress Tracking
+Include progress indicators for long-running processes:
+```bash
+echo "[$processed/$total] Processing $filename"
+```
+
+### 4. Rate Limiting
+Always include delays to be respectful to APIs:
+```bash
+sleep 0.05  # Small delay between requests
+```
+
+### 5. Data Quality Checks
+Verify counts and data integrity:
+```bash
+expected_count=77
+actual_count=$(grep "^###" output.md | wc -l)
+if [ "$actual_count" -ne "$expected_count" ]; then
+    echo "Warning: Count mismatch"
+fi
+```
+
+## Future Recommendations
+
+### For Similar Projects
+1. **Start with API exploration** before attempting web scraping
+2. **Test with small datasets** before processing large volumes
+3. **Implement resume capability** for long-running processes
+4. **Use structured logging** for better debugging
+5. **Build in validation** at every step
+
+### For API Improvements
+1. **Category filtering endpoints** would eliminate need for client-side filtering
+2. **Batch endpoints** could reduce the number of individual requests
+3. **Response pagination** for large category results
+4. **Rate limiting headers** to guide appropriate delays
+
+### For Documentation Process
+1. **Automated validation** against source API counts
+2. **Incremental updates** rather than full regeneration
+3. **Parallel processing** where appropriate
+4. **Better error reporting** and recovery mechanisms
+
+## Emergency Recovery Procedures
+
+### If Process Fails Mid-Execution
+1. **Identify completed categories**: Check which markdown files exist
+2. **Resume from failure point**: Process only missing categories
+3. **Validate existing files**: Ensure completed files have correct counts
+4. **Manual intervention**: Handle problematic workflows individually
+
+### If API Access Is Lost
+1. **Verify connectivity**: Check tunnel/proxy status
+2. **Test API endpoints**: Confirm they're still accessible
+3. **Switch to backup**: Use alternative access methods if available
+4. **Document outage**: Note any missing data for later completion
+
+This troubleshooting guide ensures that future documentation efforts can avoid the pitfalls encountered and build upon the successful strategies identified.