How do you actually compile a list of paying HubSpot customers?
It's pretty trivial to find companies that use HubSpot in some capacity, free or otherwise. But detecting paying customers across every Hub is a bit more tricky. I'll walk you through how I used 10 signals to come out with a dataset of paying HubSpot customers, broken down by hub and tier for Bloomberry.
Note: Instead of showing code, I'm giving you the exact prompts to paste into Claude to generate the implementation in whatever language you use.
Step 0: Build your universe of websites
Before anything else, you need a list of domains to run against. Cloudflare Radar publishes a list of the top 1 million domains that works well as a starting point. For bigger datasets, dig around GitHub - there are much larger lists out there. Save it to a text file (ie /tmp/domains.txt), one domain per line.
The database
Before starting I recommend building an actual database table, to store all the domains - one row per domain, one column per signal.
Prompt: "Create a SQLite table called
companieswith a TEXT primary key columndomain, aportal_idTEXT column, alinkedin_urlTEXT column, and INTEGER columns defaulting to 0 for: on_hubspot, content_hub_pro, content_hub_enterprise, marketing_hub_pro, service_hub_pro, sales_hub_starter, marketing_hub_enterprise, sales_hub_enterprise. Include a scraped_at timestamp column."
Signal 1 - Are they even on HubSpot? (free or paid)
As mentioned, it's rather trivial to find out if a company is using HubSpot in some capacity. Iterate every domain in your list, fetch the homepage, and run two checks:
Check 1: The tracking script. Every HubSpot account embeds a script at js.hs-scripts.com/<PORTALID>.js. So if you detect this script, it means they're using HubSpot.
Use a regex on the homepage HTML to find it and extract the Portal ID - HubSpot's unique identifier per customer. You'll need this portal ID in almost every subsequent signal.
Check 2: The SPF record. HubSpot's email onboarding walks customers through adding include:_spf.hubspotemail.net to their DNS. Check the TXT records for "hubspotemail" and you've confirmed HubSpot usage. Their SPF record also contains their portal ID (<PORTAL_ID>.SPF#.hubspotemail.net)
While you're scraping the homepage, also extract any LinkedIn company page URL - you'll need it for Signal 4.
⏱ Time estimate: ~4-6 hours per million domains at 50 threads. 3M domains = 12-18 hours.
Prompt: "Write a multithreaded Python script that reads domains from
/tmp/domains.txt, fetches each homepage with a 10 second timeout, and extracts: (1) a HubSpot portal ID by scanning forjs.hs-scripts.com/<ID>.js, (2) whetherhubspotemailappears in the domain's DNS TXT records, (3) any LinkedIn company page URL in the HTML. Use 50 threads. Write results to a SQLitecompaniestable. Skip unreachable domains silently."
If all you need is a list of companies using HubSpot - free or paid - you can stop here. But if you want to detect which companies are actually paying, and for which specific hubs, read on.
Signal 2 - Content Hub Pro+
The free plan gives you a subdomain on hs-sites.com. Hosting your actual domain on HubSpot's CMS however requires Content Hub Starter at minimum - though anyone serious is on Pro.
So our detection method is to check whether the A records for domain.com, www.domain.com, or blog.domain.com resolve to HubSpot's IP ranges.
HubSpot doesn't publish these publicly, but their IP allocations are registered through ARIN under org handle HUBSP-8, which has a public REST API. Fetch the ranges once, cache them, check every domain against them.
⏱ Time estimate: Under 30 minutes for 113k domains at 50 threads - DNS lookups are fast. No proxies needed.
Prompt: "Write a Python script that: (1) fetches HubSpot's IP ranges from ARIN's REST API for org handle HUBSP-8 by querying
https://whois.arin.net/rest/org/HUBSP-8/netsand parsing out the CIDR blocks, (2) loops through all domains in a SQLitecompaniestable whereon_hubspot=1, (3) for each domain checks ifdomain.com,www.domain.com, orblog.domain.comA records resolve to those IP ranges, (4) if yes, also fetches/_hcms/diagnosticsand reads theX-Hs-Portal-Idresponse header, (5) updates thecontent_hub_proandportal_idcolumns. Use 50 threads."
Caveat: companies behind Cloudflare, Akamai, or Fastly will be missed - their A records point to the CDN, not HubSpot. Acceptable false negatives.
Signal 3 - Content Hub Enterprise
Content Hub Pro allows one brand domain per portal. Two or more root domains on the same portal ID requires Content Hub Enterprise. In other words, if you wish to use HubSpot on yourwebsite.com and yourwebsite2.com, you need to pay for the Enterprise Plan.
Which means, if you can find 2 websites with the same Portal ID using HubSpot Content Hub, you automatically know they're a Content Hub Enterprise customer.
So the detection check is: group the domains by portal ID, extract root domains (strip subdomains), and any portal with 2+ distinct roots is Enterprise.
⏱ Time estimate: Seconds. It's just a SQL query and a loop. No proxies needed.
Prompt: "Write a Python script that queries all rows from a SQLite
companiestable wherecontent_hub_pro=1andportal_idis not null, groups them by portal_id, extracts the root domain for each (stripping subdomains like blog., www., lp.), and setscontent_hub_enterprise=1for all domains sharing a portal_id that has 2 or more distinct root domains."
Signal 4 - Marketing Hub Pro: the LinkedIn post trick
This is the one I'm most proud of.
Marketing Hub Pro and Enterprise unlock the social media publishing tool, and when a customer publishes through it, HubSpot wraps every outbound link in one of three URL shorteners: hubs.ly, hubs.li, or hubs.la.
So our detection check is: for each company's LinkedIn page (URL collected in Signal 1), look for those shortened URLs in posts, resolve them, and confirm the destination is the company's own domain - to filter out reshared content.
⚠️ Use proxies here. LinkedIn aggressively blocks scrapers. A rotating proxy pool is essential at scale.
⏱ Time estimate: Depends on proxy throughput. Budget 2-4 hours for 113k companies with a good proxy pool.
Prompt: "Write a Python script that loops through all rows in a SQLite
companiestable whereon_hubspot=1andlinkedin_urlis not null. For each row, use Playwright to scrape the company's LinkedIn page and extract all post links. Check if any links containhubs.ly,hubs.li, orhubs.la. For those that do, follow the redirect and check if the final URL contains the company's domain. If yes, setmarketing_hub_pro=1. Use a rotating proxy list. Use 20 threads."
Signal 5 - Marketing Hub Pro: smart fields in embedded forms
HubSpot embedded forms are exposed via a public JSON endpoint that requires zero authentication:
https://forms.hsforms.com/embed/v3/form/<portalId>/<formId>/json
You need a Portal ID (already collected) and a Form ID. The Form ID can easily be extracted from the HTML source of the company's site by crawling a few pages per domain: homepage, /contact, /demo, /resources.
Two fields in the JSON indicate Pro:
isSmartFieldorisSmartGroupset to true - dynamic fields that change based on CRM properties. Pro+ only.dependentFieldFilterspopulated - conditional form logic. Pro+ only.
⚠️ Use proxies here. HubSpot's form endpoint rate-limits repeated requests from the same IP.
⏱ Time estimate: ~2-3 hours for 113k companies at 30 threads.
Prompt: "Write a Python script that loops through all rows in a SQLite
companiestable whereportal_idis not null. For each domain, fetch the homepage, /contact, /demo, and /resources pages and extract HubSpot form IDs (look for UUIDs near 'formId' in the HTML). For each form ID, fetchhttps://forms.hsforms.com/embed/v3/form/<portalId>/<formId>/jsonusing a rotating proxy. Check if any field informFieldGroupshasisSmartField,isSmartGroup, ordependentFieldFiltersset. Also check ifsfdcCampaignIdis populated at the form level (this is Signal 10 - Salesforce sync). Updatemarketing_hub_pro=1orsales_hub_enterprise=1accordingly. Use 30 threads."
Signal 6 - Marketing Hub Pro: campaign GUIDs in pop-up configs
HubSpot's web interactives (pop-ups, slide-ins, banners) are fetched from a public endpoint:
https://api.hubspot.com/web-interactives/v1/public/audiences/<portalId>
If sortedAudienceConfigs is populated and any config contains a campaignGuid, the Campaigns tool is active - Marketing Hub Pro+ only.
⚠️ Use proxies here.
⏱ Time estimate: 1-2 hours at 30 threads. ~98% of portals return empty - low recall, high precision. Use as enrichment, not primary discovery.
Prompt: "Write a Python script that loops through all rows in a SQLite
companiestable whereportal_idis not null. For each portal ID, fetchhttps://api.hubspot.com/web-interactives/v1/public/audiences/<portalId>using a rotating proxy. Check if any item insortedAudienceConfigshas acampaignGuidfield. If yes, setmarketing_hub_pro=1for that domain. Use 30 threads."
Signal 7 - Service Hub Pro: the knowledge base subdomain
Service Hub Pro unlocks the Knowledge Base feature. Most companies host it on predictable subdomains: help., support., docs., kb., knowledge.
Just probe those subdomains with DNS lookups and check if their A records resolve to HubSpot IP ranges (same ranges from Signal 2). False positive rate: essentially zero. You can't accidentally point a subdomain at HubSpot's IPs without an active Service Hub Pro account.
⏱ Time estimate: Under 1 hour for 113k companies at 50 threads. No proxies needed.
Prompt: "Write a Python script that loops through all domains in a SQLite
companiestable whereon_hubspot=1. For each domain, do DNS A record lookups onsupport.,help.,docs.,kb., andknowledge.subdomains. Check if any resolve to HubSpot IP ranges (reuse the ARIN lookup from Signal 2). If yes, setservice_hub_pro=1. Use 50 threads."
Signal 8 - Sales Hub Starter+: the meetings widget
The free HubSpot Meetings widget shows a "Powered by HubSpot" badge. Removing it requires Sales Hub Starter or higher.
Find the widget by crawling pages with "sales", "demo", or "contact" in anchor text or URL slugs. Look for meetings.hubspot.com or meetings-na1.hubspot.com in iframes or scripts. If found and the badge is absent - confirmed paid.
⏱ Time estimate: Lowest-yield signal in the list. Test on a 10k sample first to assess hit rate. Full run at 30 threads: 2-3 hours. No proxies needed.
Prompt: "Write a Python script that loops through all domains in a SQLite
companiestable whereon_hubspot=1. For each domain, fetch the homepage and find links with 'sales', 'demo', or 'contact' in the href or anchor text. Crawl up to 5 of those pages. If any page containsmeetings.hubspot.comormeetings-na1.hubspot.combut does NOT contain the text 'Powered by HubSpot', setsales_hub_starter=1. Use 30 threads."
Signal 9 - Marketing Hub Enterprise: custom events in the tracking script
The HubSpot tracking script from Signal 1 can be fetched directly from HubSpot's CDN using any Portal ID:
https://js-na1.hs-analytics.net/analytics/0/<portalId>.js
Marketing Hub Enterprise unlocks Custom Events. When a customer configures one through HubSpot's Enterprise Event Visualizer, calls get injected into that script. The pattern to look for is a trackClick call that includes both a trackingConfigId number and an event name with the pe<portalId>_ prefix.
Important: the script always includes initEventVisualizerScript regardless of tier - that alone is NOT a signal. The tell is the trackClick + trackingConfigId combination.
⚠️ Use proxies here. You're hitting HubSpot's CDN once per portal ID at scale.
⏱ Time estimate: 1-2 hours for 113k portals at 30 threads.
Prompt: "Write a Python script that loops through all rows in a SQLite
companiestable whereportal_idis not null. For each portal ID, fetchhttps://js-na1.hs-analytics.net/analytics/0/<portalId>.jsusing a rotating proxy. Check if the script contains both atrackingConfigIdnumber value AND an event name matching the patternpe<digits>_<word>. If both are present, setmarketing_hub_enterprise=1. Use 30 threads."
Signal 10 - Sales Hub Enterprise: Salesforce sync in form configs
Already covered in Signal 5's prompt - fold it into that loop for zero extra cost. While fetching the form JSON endpoint, also check whether sfdcCampaignId is populated. If it is, Salesforce integration is configured - Sales Hub Pro at minimum, most commonly Enterprise.
Nobody configures Salesforce sync by accident. Base rate of "SFDC sync + free plan" is effectively zero.
Putting it all together
No single signal tells you a company is paying for HubSpot. Stack all ten and you get a complete picture:
- Tracking script + SPF record -> any HubSpot user
- A records pointing to HubSpot CMS IPs -> Content Hub Pro+
- Multiple root domains on one portal -> Content Hub Enterprise
- hubs.ly / hubs.li / hubs.la links in LinkedIn posts -> Marketing Hub Pro
- Smart fields or conditional logic in forms -> Marketing Hub Pro
- Campaign GUIDs in pop-up configs -> Marketing Hub Pro (enrichment)
- Knowledge base on custom subdomain -> Service Hub Pro
- Meetings widget without HubSpot branding -> Sales Hub Starter+
- trackClick + trackingConfigId in tracking script -> Marketing Hub Enterprise
- sfdcCampaignId in form JSON -> Sales Hub Pro/Enterprise
When I ran this across 3 million companies, I found 113,000+ HubSpot users. We've compiled the full dataset with company-level tier breakdowns at Bloomberry if you'd rather skip the scraping entirely.
General takeaways for detecting paid customers of any tool
These signals are HubSpot-specific, but the methodology applies to almost any SaaS product. Two principles worth keeping:
1. If a product has a visible config file or public endpoint, analyze it. HubSpot's form JSON endpoint, the tracking script, the pop-up audience config - none require authentication. Many SaaS products expose similar config blobs publicly without realizing it.
The playbook: collect config files from real live websites, dump them into Claude, and ask it to look for patterns - fields that correlate with paid features, settings that only appear on certain accounts. Claude has no idea what these configs look like until you feed it real data, but once you do, it's remarkably good at spotting structure and suggesting hypotheses to test. That's how several of these signals were found.
2. Parlay what you learn in one step into the next. The Portal ID collected in Signal 1 - originally just a way to confirm HubSpot usage - turned out to be the key to almost every subsequent signal. The form JSON endpoint needs it. The tracking script URL needs it. The pop-up config endpoint needs it. One piece of data compounded into six signals.
Whenever you find a unique identifier early in your pipeline, ask: what else can I do with this? What other endpoints accept it as a parameter?