Location Extractor. Crawl millions of business websites to find their Office Locations

5 – Minute read

Problem

Our client offers market research to its customers. Helping customers identify the demand & supply of certain businesses in a given region. With their database offering location data of 1 Billion+ businesses, they wanted to expand the records in their database

Solution

We built an Internet crawler. Much like a typical Search Engine crawler – it would go to millions of websites every day and employ AI and NLP to identify the office locations where their business operates from. These office locations would be extracted from the special web pages of each website. For pages like Contact Us, About Us, Office Locations, etc

Approach

A lot of websites today are built using rich internet applications. Because these websites make heavy use of Javascript, typical HTTP crawlers are not able to interpret the content in such websites. To solve this we built a Deep Web Crawler. Our crawler would navigate websites with Javascript-enabled web browser, and navigate through a network of links on the website, much as humans would.

To speed up crawler & save on crawling costs, we also implemented Focused Crawling. This enabled our crawler to prevent going from the entire website content, but rather use AI to identify just the key pages where location data was most likely to be found. By identifying which web pages are for Contact Us, About Us, and Office Locations, we will navigate directly to them. To access only the information we are interested in.

Finally, we used a combination of Natural Language processes and Computer Vision to identify the very regions of the pages where Location Details were mentioned.

Result

Our client was able to expand their database of 1 Billion business locations to 1.5 Billion business locations in a matter of 6 months. This 50% increase was attributed to the Location Extract Internet Crawler we built for them.