Late on a Sunday afternoon, a monitoring system flagged a CPU spike on one of our shared hosting servers. What started as a routine investigation turned into a multi-hour battle with OpenAI's GPTBot crawler, and ultimately required a formal complaint report and a firewall-level block to resolve. This is blatant GPTBot abuse. Here's exactly what happened, how we diagnosed it, and what we did to stop it.
The Alert
At 5:36 PM PST, our automated monitoring triggered a CPU spike alert on rockstar.it-werx.net:
The top output showed an httpd process consuming 6.2% CPU with a telling breakdown: 29.4% user time alongside 35.3% system time. That high sys% pointed to something beyond normal PHP processing — excessive process forking, I/O pressure, or socket activity were all on the table.
Isolating the Culprit
First step was to check how many php-cgi processes were actually running:
ps aux | grep php-cgi
domain+ 1172768 94.0 2.4 274188 90880 ? R 17:52 0:00 /opt/cpanel/ea-php83/root/usr/bin/php-cgi
macron 1172771 0.0 1.0 323896 38844 ? R 17:52 0:00 /opt/cpanel/ea-php83/root/usr/bin/php-cgiOnly three processes total — but one, belonging to the domain.com cPanel account, was running at 94% CPU. This wasn't a flood of processes; it was a single request so expensive it was saturating the server on its own. The investigation narrowed to one account's Drupal install.
Finding the Access Logs
The logs weren't in the expected location, so we tracked them down:
find /usr/local/apache/domlogs -name "domain*" 2>/dev/null
/usr/local/apache/domlogs/domain.com-ssl_log
/usr/local/apache/domlogs/domain.com-bytes_logTailing the SSL log immediately revealed the pattern — every single request came from the same IP, with the same User-Agent string:
74.7.227.161 - - [22/Feb/2026:17:51:46] "GET /region/town/services/event-entertainment HTTP/2.0" 404 - "GPTBot/1.3"
74.7.227.161 - - [22/Feb/2026:17:52:04] "GET /region/town/valet-parking HTTP/2.0" 404 - "GPTBot/1.3"
74.7.227.161 - - [22/Feb/2026:17:52:13] "GET /region/town/pet-friendly HTTP/2.0" 404 - "GPTBot/1.3"GPTBot/1.3, OpenAI's web crawler. And every URL it was requesting returned a 404. These pages don't exist on the site.
The Scale of the Problem
# Total requests from this IP
grep "74.7.227.161" /usr/local/apache/domlogs/domain.com-ssl_log | wc -l
4475
# When did it start?
grep "74.7.227.161" /usr/local/apache/domlogs/domain.com-ssl_log | head -1
74.7.227.161 - - [22/Feb/2026:13:49:15] "GET /...//personal HTTP/2.0" 302 ...
# Total bandwidth consumed
grep "74.7.227.161" ... | awk '{sum += $10} END {print sum/1024/1024 " MB"}'
31.3459 MBOutput:
4,475 Total Requests: ~19/min
Sustained Rate: 31.3 MB
Bandwidth Used: 4,362
Were 404sThe crawl had been running since 1:49 PM — nearly four hours — and was accelerating. From 71 requests in the first hour to over 1,500 per hour by the afternoon. Of the 4,475 total requests, 97.5% were to pages that don't exist.
The Fabricated URL Problem
URLs like /region/town/room-service or /region/town/valet-parking were algorithmically constructed by appending plausible-sounding slugs to known pages. One request even contained a double slash, /services//personal a telltale sign of programmatic URL construction rather than real link following.
Each of these non-existent URLs triggered a full Drupal PHP execution cycle to render the 404 page. With 19 requests per minute sustained, this created a continuous stream of expensive PHP processes with no legitimate purpose.
Attempting the .htaccess Block
The first remediation attempt was a User-Agent block inserted into the site's .htaccess immediately after RewriteEngine on:
# Block GPTBot by User-Agent
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]The rule inserted correctly and appeared syntactically valid, but GPTBot continued to get through, now returning 500 errors instead of 404s, which was actually worse. Apache on cPanel was not processing the rewrite rule as expected for this account configuration.
Escalating to the Firewall
With the .htaccess approach ineffective, I moved to a firewall-level block using CSF:
csf -d 74.7.227.161 "GPTBot abuse domain.com"This drops the connection at the network level before it reaches Apache, zero PHP execution, zero CPU cost. The log went silent for that IP immediately and CPU returned to normal within minutes.
Timeline of the Incident
- 13:49 PST First GPTBot request logged. 71 requests in the first hour, all probing fabricated URLs.
- 15:00 PST Crawl accelerates to 915 requests/hour. Server load begins to climb.
- 17:36 PST CPU spike alert triggered. Server at 81% CPU, php-cgi process at 94%.
- 17:52 PST Root cause identified as GPTBot crawling domain.com account.
- 18:11 PST .htaccess block inserted. GPTBot continues, now returning 500 errors.
- 18:29 PST Second CPU spike alert. .htaccess block confirmed ineffective.
- 18:33 PST Firewall block applied via CSF. Traffic from 74.7.227.161 drops immediately.
- 18:35 PST Server CPU normalises. Incident resolved.
Producing a Formal Complaint
Given the scale of the crawl, sustained resource abuse, fabricated URL probing, and likely harvesting of proprietary content for AI training without consent. A report was prepared a formal complaint report on behalf of the site owner for submission to OpenAI, documenting the full evidence including timestamped log data, bandwidth figures, response code breakdowns, and the fabricated URL behaviour.
Final Resolution Checklist
- GPTBot blocked at firewall level via CSF — IP 74.7.227.161 dropped at network layer
- .htaccess rewrite rule left in place as a secondary defence layer
User-agent: GPTBot / Disallow: /added to robots.txt as formal opt-out on record- Runaway php-cgi processes terminated to restore normal server operation
- Formal complaint report prepared and submitted to OpenAI
Why Was This Site Targeted So Aggressively?
The content is training-data gold. Thousands of individual structured pages with unique content is highly valuable for AI training datasets. GPTBot prioritised the site accordingly.
A well-formed sitemap acted as an invitation. A clean, extensive sitemap told GPTBot exactly how many pages existed and gave it a roadmap. Sites without proper sitemaps tend to get crawled more superficially and quickly abandoned.
The fabricated URL probing is self-reinforcing. Once GPTBot determined the site was valuable, it began speculatively probing for content it predicted should exist. Even 500 error responses signal "this domain is active" and encourage continued crawling. A clean firewall drop is the correct signal to send.
Recommendations for Site Owners
- Add explicit
User-agent: GPTBot / Disallow: /to yourrobots.txtif you don't want OpenAI harvesting your content - Enable Drupal's Page Cache and Dynamic Page Cache modules — cached responses are dramatically cheaper to serve
- Consider a static 404 page that bypasses PHP entirely for unknown URLs
- Set up rate-limiting in CSF or fail2ban to automatically flag IPs exceeding a threshold of requests per minute
- Monitor your access logs periodically for sustained single-IP activity — 19 requests per minute from one source should always trigger an alert