In Pursuit of Available Short Domain Names

To nod to the X-Files: short domain names are out there. Believe it or not, there are abundant three- and four-letter domain names out there available for purchase today on many country-code and private TLDs. They make for great URL hacks and can hint at the next scrappy and non-sensical startup brand name. The only problem is, you can browse them. You have to go digging name-by-name as registrars do not present domains by availability.

Registrars are a peculiar kind of commerce site where instead of allowing consumers to browse available inventory, they make the consumer query, one-by-one (or if they're advanced, by a short delimited list of multiple words), whether they have the specific desired items available. It's like going to Blockbuster in 1995 and thinking, "Tonight feels like a comedy night," but instead of browsing an appropriately organized and labeled shelf, you have to ask the clerk title-by-title whether they have a suitable film among those you can recall. This requires pretty tedious memory and patience on the consumer's part.

In light of this, I created (no longer active) to present only available short domain names drawn from popular TLDs (.ai, .app, .cc, .co, .com, .io, .xyz, etc.) as determined from a corpus of all combinations of two- and- three letter sequences, and four- and five-letter Scrabble words.

Sadly, due to severe whois rate-limiting the service was ultimately discontinued, but it points to better ways that registrars might serve domain availability data, i.e., by publishing a public or even premium stream of domain availability changes.

Data Context

URL: (no new crawling)
Project Date: 2018
Technologies: Python, Google Cloud Platform, Mustache


The only free way to determine domain availability is via whois request. Sadly, in the interests of server load protection and to to prevent data misuse, the whois service providers (various registrars specific to each TLD) severely limit the number of queries that can be executed before a fairly significant time-out delay (ranging from two to ten minutes usually, varying by registrar). Accordingly, a script to check this information must understand handle these time-outs efficiently.

It should be noted that certain premium services offer volume whois access, but to execute the desired corpus would cost around $180 per update cycle.

Finally, whois responses aren't neatly structured and the response messages vary across registrars, requiring some easy research for keywords and other regex patterns to structure the domain availability status information contained in each response.

Service Design

The following elements were deployed in Google Cloud Platform App Engine leveraging Cloud SQL.

Backend Crawler

A multi-threaded Python application populates letter-combinations and the word-set into MySQL as status-unknown entries, and the executes whois commands for each to determine availability status per TLD-specific response messages indicating availability or reserved status.

Multiple crawler instances operate on dedicated threads, and maintain their own state for end-point target (TLD-specific), time-out delay status, and waiting during whois response delays.

Given that, generally, each TLD had a single whois endpoint, a single crawler instance is generated for each TLD, though multiple instances of the script itself could be run to parellelize queries for each TLD (though for query-limit purposes, it would only be helpful if done from multiple distinct network origins). Accordingly, it will query domains from the unknown-status or stale-status word queues sequentially, and maintain a single TLD-specific time-out timer specific to each registrar's time-out specifications (e.g., some hasty testing indicated .be maintains a 5-minute time-out after query exhaustion, while .com has a longer 10-minute time-out).

Available Domain Discovery Web Frontend

A simple, single Bootstrap and Mustache page served paginated results and allowed filtering by TLD, length, and word substring. On-hover of an available name, users could click through to one of several registrars to complete a registration.

Fatal Flaw: Artificially Low-Velocity Data Pipes

The query rate-limiting by domain registrars is simply too severe to be practicable.

The corpus of words I designated to query comprises almost 13,000 Scrabble words and over 18,000 generated exhaustive letter combinations. Back of the envelope math indicated that for even the most high-velocity TLDs this would have taken weeks or even longer to determine initial status for each domain, assuming my IP address wasn't blacklisted in the meantime. Moreover, that would just capture a snapshot in time -- if the service were to be valuable at all, it would be important to capture reasonably up-to-date status (i.e., sub-day latency) that would not only avoid user frustration ("it says it's free but it's not") but also afford valuable additional features such as allowing users to flag reserved words and alerting those users when those words become available again.

A Better Solution -- Streaming Availability Status Changes

It's obnoxious registrars don't make it easier to zero in on only available inventory. I suspect it's simply because the actual domain vendors -- GoDaddy, Network Solutions, Google, etc. -- don't actually have that information in a timely manner. Verisign is the ultimate registrar for .com (among several other TLDs) and they may simply not care about making the model slightly more user friendly. Motivated consumers will continue to buy-up juicy inventory regardless of the pains of searching, and certainly for brand-related keywords a brute-force whois-checking approach (i.e., the above approach but with a very limited corpus) is eminently sufficient. So ultimately, I suspect GoDaddy and other vendors simply have a bulk whois agreement of sorts (certainly with a higher-volume structured data endpoint, not public whois).

One reason they may not want whois data to be more widely available is because it's somewhat sensitive. whois data may include registering individuals' addresses and phone number, etc., and could be misused for marketing or other purposes.

What would be better? A pub-sub model. Verisign should create a simple and lightweight Kafka stream of domain availability, even skipping registrant PII in favor of simple "it's been registered" or "it's available again" message content. I can imagine a lot of market participants might actually pay to subscribe to that stream.