I’m working on core infrastructure for one of Europe’s leading search engines, specializing in web-scale crawling systems and content processing pipelines using Rust.

Key Contributions:

  • I implemented a high-performance HTML content extraction microservice that extracts clean text from web documents at scale, achieving quality metrics comparable to industry-standard solutions like Trafilatura while delivering several orders of magnitude faster processing speeds.

  • I enhanced our crawler’s throughput through a sharded, lock-free architecture that maintains strict politeness policies (ensuring no more than one concurrent download per website). This architectural redesign resulted in multiple orders of magnitude improvement in crawling performance.

  • I implemented a microservice to download, cache and parse robots.txt file and filter URLs based on rules defined in these files.

  • I worked on a number of the crawler’s core features: URL redirection management, URL filtering, end-to-end tests …