What Is theHarvester: Installation and Basic Use

By Leonard Cucos •  Updated: 06/06/21 •  7 min read

theHarvester is an Open Source Intelligence [OSINT] utility used to collect publicly accessible email addresses, subdomains, IP addresses, and URLs from various Internet sources such as Google, Bing, Baidu, Yahoo, LinkedIn, Twitter, etc.

This tutorial is for Ethical Hacking and Penetration Testing purposes only. It covers theHarvester installation and basic use – enough for you to get a good taste of what this amazing tool can do in the intelligence-gathering process. 

What is theHarvester?

theHarvester is one of the greatest command-line tools available mainly due to its ability to broadly scrape publicly available information such as company names and email addresses from all over the Internet. 

It is used mainly for passive reconnaissance, but DNS brute force attack and subdomain screenshots are possible when used as part of active reconnaissance. 

It is not uncommon to hear that you’re most likely not using it right if you don’t get any results.

To obtain information on its targets, theHarvester employs various approaches, including DNS reverse lookups, dictionary enumerations to perform DNS brute force attacks, hostnames, and search engine Dorking [results that are not visible using a normal search].

How To Install theHarvester

theHarvester requires Python 3.7+ and can be downloaded from its official repository on GitHub

Unfortunately, theHarvester is not available on Windows due to dependency systems through mixed success has been reported on using Docker. 

# git clone https://github.com/laramies/theHarvester.git
# sudo apt-get install python3-pip

NOTE: Run the following commands in the same directory where you cloned theHarvester GitHub. 

# cd theHarvester
# sudo pip3 install -r requirements.txt
# sudo python3 ./theHarvester.py
theHarvester dependency installation. Source: nudesystems.com
# sudo dnf install python3-pip
# sudo pacman -S python3-pip

Syntax and Options

The following command-line syntax is supported and can be triggered with the -h option:

theHarvester [-h] -d DOMAIN [-l LIMIT] [-S START] [-g] [-p] [-s] [--screenshoot SCREENSHOT] [-v] [-e DNS_SERVER] [-t DNS_TLD] [-r] [-n] [-c] [-f FILENAME] [-b SOURCE]
theHarvester command-line options and parameters. Source: nudesystems.com

NOTE:theharvester” command-line is deprecated and was replaced by “theHarvester” in the latest releases.

The table below compiles all the optional arguments supported by theHarvester.

ArgumentDescription
-h, –helpShow the help page and exists.
-d DOMAIN, –domain DOMAINSpecify the domain name to search
-l LIMIT, –limit LIMITLimit the number of search results [default=500]
-S START, –start STARTStart with result number X [default=0]
-g, –google-dorkUse Google Dorks for Google search
-p PORT_SCAN, –port-scan PORT_SCANscan the detected hosts and check for Takeovers – 21,22,80,443,8080. [default=False, params=True]
-s, –shodan Use Shodan to query discovered hosts
-v VIRTUAL_HOST, –virtual-host VIRTUAL_HOSTVerify hostname using DNS resolution and search for virtual hosts [params=basic, default=False]
-e DNS_SERVER, –dns-server DNS_SERVERDNS server to use for lookup
-t DNS_TLD, –dns-tld DNS_TLDPerform a DNS TLD expansion discovery, [default False]
-n DNS_LOOKUP, –dns-lookup DNS_LOOKUPEnable DNS server lookup [default=False,params=True]
-c, –dns-brutePerform a DNS brute force on a specified domain
-f FILENAME, –filename FILENAMESave the results to an HTML and/or XML file on the local machine.
-b SOURCE, –source SOURCEWhere the source can be: baidu, bing, bingapi, censys, crtsh, cymon, dnsdumpster, dogpile, duckduckgo, google, google- certificates, hunter, intelx, linkedin, netcraft, securityTrails, threatcrowd, trello, twitter, vhost, virustotal, yahoo. 
-x EXCLUDE, –exclude EXCLUDEExclude options when using all sources above.
theHarvester optional arguments

Scanning Using theHarvester

Because scraping data is frowned by search engines, theHarvester will introduce time delays between queries to evade detection. 

Is advisable to limit the number of queries by restricting the number of return results. You will also be able to work with a more manageable list as a result of this. 

theHarvester supports  HTML and XML formats for saving your scan data. 

NOTE: theHarvester will produce zero [0] results if no parameters are input when performing a scan.

One of the following domain [-d], source [-b], and limit [-l] parameters are mandatory for theHarvester to produce a scan.

theHarvester example. Source: nudesystems.com

In the example above, theHarvester scan resulted in No IPs, emails, or hosts found. The reason is that the scan parameters were not inserted correctly. 

The -d option is used to define the domain or business name to be searched. 

The -l parameter specifies the maximum number of searches to be performed against the specified search engine(s).

The -b argument specifies the source search engine for your query such as google, bing, yahoo, etc.  

NOTE: the “all” parameter for removed from theHarvester optional arguments list.

CAUTION: Increasing your search limit will result in a temporary suspension from Google or other search engines. The default search LIMIT value is 500, implying that theHarvester will go through the first 500 search results. 

Since Google displays ten search results on each page, 50 pages will be scanned using the default parameter. 

That is more than enough for Google [or any other search engine] to identify you as a scraper and either display a Captcha warning or block your results. 

A proxy service [e.g., Storm Proxies] that routes all requests via a separate IP address is highly recommended to circumvent this limitation.

Let’s perform a scan for the vulnweb.com domain with a limit set to 100 using LinkedIn as a source using the following command:

# theHarvester -d www.vulnweb.com -l 100 -b linkedin

The query result identified 13 references on LinkedIn with relation to vulnweb.com as seen in the capture below.

theHarvester example LinkedIn. Source: nudesystems.com

You can perform the search against a much larger target like, e.g., microsoft.com using the same source and parameters. 

theHarvester example Microsoft. Source: nudesystems.com

Paste Websites

To publish samples of their findings online using similar tools such theHarvester, hackers and other threat actors often utilize various websites such as the one listed below. 

Depending on your objectives, these sites may frequently provide a large quantity of information. Paste websites often provide instances of data breaches, threat actor communications, and numerous dox on people.

Like forums, Paste websites are popular with threat actors and may frequently reveal crucial information during an investigation. 

The difficulty with these platforms is that individual pastes are often removed if they are detected to include Personal Identifiable Information [PII] or other types of confidential material, which violates the site’s terms of service.

theHarvester psbdmp. Source: nudesystems.com

As a consequence, many organizations search these Paste sites for new information daily.

Psbdmp is one of the hidden gems on the Internet since it is the only site that preserves a complete historical record of every paste dating back to 2015. 

Not only that, but its API is very user-friendly and can be used through a URL or an external program, or a self-made script.

Conclusion

theHarvester is one of the bests OSINT tools of its kind, very easy to use, and a must-have in your arsenal for active or passive reconnaissance. 

If you found this article useful, please share it with your friends and colleagues – it really helps this website grow faster. 

Stay safe. 

Leonard Cucos

Leonard Cucos is an engineer with over 20 years of IT/Telco experience managing large UNIX/Linux-based server infrastructures, IP and Optics core networks, Information Security [red/blue], Data Science, and FinTech.

medyum