Photo of Torben Hansen

A TechBlog by Torben Hansen

Freelance Full Stack Web Developer located in Germany.
I create web applications mainly using TYPO3, PHP, Python and JavaScript.
Home Archive Tags

Optimizing t3versions for improved TYPO3 version analysis

My t3versions TYPO3 version analysis and statistics service is running for over five years now. During the years, I had to learn, that crawling and analyzing most likely the whole WWW for TYPO3 websites is sometimes challenging.

In order to find new websites using TYPO3, I regulary perform a crawling process, which checks over 260 million domains for possible TYPO3 usage. With my current infrastructure (3 servers with 6 CPUs each), this task takes about 14 days to analyze all domains. The crawling process checks the content of a website for TYPO3 fingerprints. When a TYPO3 website has been identified, the domain is queued for a detailed analysis using the t3versions API. The crawling process runs multicore and multithreaded and consumes ~4TB of traffic. It is not uncommon, that some webservers may block the t3versions crawling requests, if a higher amount of GET requests are performed from the same IP address in a short amount of time.

The detailed TYPO3 analysis using the t3versions API will result in several GET requests to a given website and uses fingerprinting techniques, if the TYPO3 major version could not be determined by performing the most common checks. Fingerprinting will however also lead to an unusual amount of GET requests possibly resulting in a 404 response, which might trigger WAF (Web Application Firewall) systems to block the requesting IP address for a given time.

Since both the t3versions crawling process and the TYPO3 detailed analysis are performed by the same servers, chances are high, that the IP addresses of my analysis infrastructure might get blocked. I noticed this happended quite often during the recent t3versions crawling and rescan tasks, so I had to find solution.

In order to avoid being blocked by WAFs (too quickly), I optimized the TYPO3 analysis process as following:

I am very satisfied with the results of my optimizations, as they have led to a significant decrease in the number of false positives (website being removed from the database allthough it uses TYPO3). Through constant efforts and fine-tuning, I have achieved a substantial improvement in the accuracy of the system.