# 20min.ch Comment Analysis System - Summary ## Overview This system has been specifically adapted to scrape and analyze comments from political news articles on 20min.ch, with the following key capabilities: 1. **Smart Comment Detection**: The scraper now detects comment counts and skips articles with no comments 2. **Rate Limiting Protection**: Built-in delays and exponential backoff to respect the website's limits 3. **Data Extraction**: Extracts article metadata, comments, and the relationships between comments 4. **Comprehensive Analysis**: A full suite of analysis tools for sentiment, network, and activity patterns ## Key Changes Made ### 1. Comment Detection We've implemented a robust comment detection system that: - Looks for comment count elements using multiple selectors: `span[data-testid='comment-count'], .comment-count` - Uses regex pattern matching to find and extract the comment count number in the format "X Kommentare" - Skips articles with fewer comments than the `MIN_COMMENTS` threshold (configurable in .env) ### 2. CSS Selectors We've updated all selectors based on inspecting the 20min.ch website structure: - **Article Links**: `a[href*='/story/']` to find political article links - **Article Metadata**: Updated selectors for titles and publication dates - **Comments**: `.comment, [data-testid='comment']` to find comment elements - **Comment Data**: Updated selectors for author names, content, timestamps, and parent/child relationships ### 3. Rate Limiting Handling Enhanced rate limiting protection: - Randomized delays between requests - Detection of 429 status codes and "captcha" responses - Exponential backoff when rate-limited (wait time increases with each retry) - Configurable delay parameters in the .env file ### 4. Multilingual Considerations Added support for multilingual content: - Updated stopwords to include German, French, and English words - Enhanced the sentiment analyzer to better handle non-English content - Added language detection capabilities to the data processor ## Usage Notes ### Recommended Configuration For optimal results with 20min.ch, we recommend: - `MIN_DELAY=5`: Minimum delay between requests (seconds) - `MAX_DELAY=10`: Maximum delay between requests (seconds) - `ARTICLE_LIMIT=10`: Maximum number of articles to scrape in one run - `MIN_COMMENTS=1`: Minimum number of comments required to process an article ### Analysis Considerations When analyzing 20min.ch comments, consider: 1. **Language Variations**: Comments may be in German, French, Italian, or English 2. **Sentiment Analysis Limitations**: The sentiment analysis works best with English, but makes best efforts with other languages 3. **Cultural Context**: Swiss political discussions have unique regional and cultural contexts ## Example Usage 1. **Configure Settings**: ``` cp .env.example .env # Edit settings as needed ``` 2. **Run the Scraper**: ``` python scraper.py ``` 3. **Analyze the Data**: ``` jupyter notebook notebooks/analysis.ipynb ``` ## Future Improvements 1. Add language detection to handle sentiment analysis differently for each language 2. Implement more advanced scraping techniques to handle pagination in comment sections 3. Add support for tracking articles over time to capture comment evolution 4. Enhance the network analysis to better identify potential coordinated activities