Scraping
Author: Michael Schmidmaier
OpenEU uses multiple scraper scripts to scrape publicly available information.
Most of the scraped information is about past and upcoming meetings and events, but there is also other data, such as legislative files. All scraper scripts are in the app/data_sources/scrapers
directory. More information on the individual data sources can be found in the data sources documentation.
π¦ Technologiesβ
- π·οΈ Scrapy: Python-based web scraping framework for extracting structured data from websites
- used for server rendered pages and static content like RSS feeds
- π₯£ BeautifulSoup: Python library for parsing HTML and XML documents
- used for parsing HTML content and extracting data
- π§ͺ Playwright: Automated browser testing tool that enables headless web scraping of dynamic content
- used for scraping dynamic content that requires JavaScript execution
- π€ Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper with built-in anti-bot-protection measures
- used for scraping websites with more advanced anti-bot measures
- supports both static and dynamic content
- offers many built-in features, but may be overkill for simpler scraping tasks
π¦ ScraperBase class (file)β
The ScraperBase
is an abstract base class that provides common functionality for all scrapers in the system. It handles data extraction, storage, embedding, and error handling in a consistent way.
Key Featuresβ
- Retry Mechanism: Automatically retries failed scraping attempts
- Logging: Unified logging for starting, retrying, stopping and errors
- Error Handling: Comprehensive error handling with notifications
- Data Storage: Built-in methods for storing scraped data in Supabase
- Topic Extraction: Automatic topic assignment for scraped meetings
- Embedding Generation: Creates embeddings for semantic search
Main Componentsβ
- ScraperResult: Data class for encapsulating the results of a scraping operation
- Store Methods:
store_entry
andstore_entry_returning_id
for database operations - Embedding:
embedd_entries
method to generate vector embeddings for scraped content - Error Notification: Alerts developers on scraper errors using email
Configuration Optionsβ
table_name
: The database table where scraped data will be storedmax_retries
: Maximum number of retry attempts for failed scrapingretry_delay
: Delay between retry attempts in secondsstop_event
: Event for graceful termination of scraping processes which scrapers should regularly check to stop gracefully on request (more details below)
Usageβ
Scraper implementations should inherit from ScraperBase
and at least implement the __init__
and scrape_once
methods:
class CustomScraper(ScraperBase):
# here you can pass additional needed properties like start and end date
def __init__(self, stop_event: multiprocessing.synchronize.Event, start_date: date, end_date: date):
super().__init__(table_name=XYZ_TABLE_NAME, stop_event=stop_event)
self.start_date = start_date
self.end_date = end_date
self.entries: list[XYZ_ENTRY] = []
def scrape_once(self, last_entry, **args) -> ScraperResult:
# Implementation of specific scraping logic
# Example with pseudo-code:
# scraped_entries = self.scrape_data()
# filtered_entries = self.remove_duplicates(scraped_entries)
# count = 0
# for entry in filtered_entries.values():
# try:
# self.store_entry(entry.model_dump(), embedd_entries=True)
# self.entries.append(entry)
# count += 1
# except Exception as e:
# error = f"Error storing entry {entry.title}: {e}"
# self.logger.error(f"Error inserting meeting {entry.title}: {e}")
# continue
return ScraperResult(success=True, lines_added=count, error=error, last_entry=last_entry)
# Additional methods for scraping, filtering, and processing entries can be implemented here
# ...
Graceful stopping (stop_event
)β
The scheduler starts scrapers in separate threads or processes. In case of threads, it is impossible to just kill thread in a safe way without leaking resources. To support graceful termination of scraping processes, for example on a scheduler timeout, there is the stop_event
. All scrapers should regularly check the stop_event
provided by the scheduler. This allows long-running scraper tasks to stop cleanly when requested, without losing data, leaving operations incomplete, or leaking resources.
Example implementationβ
if self.stop_event.is_set():
browser.close()
return ScraperResult(
success=False,
error=Exception("Scrape stopped by external stop event"),
last_entry=self.last_entry,
)
π« Handling Duplicatesβ
To avoid duplicate entries in the database, scrapers should implement a mechanism to check for existing entries before storing new ones.
If possible, this can be done by facilitating the upsert functionality of store_entry
in case the scraped information contains a unique identifier that can be used for upserting.
Example: upserting using a custom unique identifier (here: url
)β
self.store_entry(meeting.model_dump(), on_conflict="url", embedd_entries=False)
If there is no unique identifier, scrapers should implement a custom logic to filter out duplicates before storing entries. This can be done by comparing the title, date, and other relevant fields of the scraped entries against existing entries in the database.
fuzzy matching example from spanish_commission_scraper.py
β
result = supabase.table("spanish_commission_meetings").select("id, title").eq("date", entry.date).execute()
existing_entries = result.data or []
for existing in existing_entries:
existing_title = existing["title"]
if fuzz.token_sort_ratio(existing_title, entry.title) > 90:
self.logger.info(f"Duplicate found: {entry.title} matches {existing_title}")
return True # duplicate
return False # not a duplicate
π Languagesβ
As OpenEU's platform language is English, sometimes scraped data has to be translated from other languages before it can be stored. This can be done using the Translator class from app/data_sources/translator/translator.py
.
Translation Exampleβ
from app.data_sources.translator.translator import Translator
translator = Translator()
try:
translated_text = translator.translate("Voici un exemple de texte en franΓ§ais")
print(translated_text) # "Here is an example of text in French"
except Exception as e:
logger.warning(f"Translation failed: {e}")
β° Scheduling Scrapersβ
Scrapers are usually scheduled to run at specific intervals. The jobs are defined in app/core/jobs.py
. More information on how the scheduler works can be found in the scheduler documentation.
Example of a scheduled scraper job:β
def scrape_ec_res_inno_meetings(stop_event: multiprocessing.synchronize.Event):
today = datetime.now().date()
end_date = today + timedelta(days=365)
scraper = EcResInnoMeetingsScraper(start_date=today, end_date=end_date, stop_event=stop_event)
return scraper.scrape()
# register the job to run daily at 03:30
scheduler.register(
"scrape_ec_res_inno_meetings",
scrape_ec_res_inno_meetings,
schedule.every().day.at("03:30"),
run_in_process=True, # Crawl4AI scrapers need to run in a separate process, threads do not work
)