theHarvester is an Open Source Intelligence [OSINT] utility used to collect publicly accessible email addresses, subdomains, IP addresses, and URLs from various Internet sources such as Google, Bing, Baidu, Yahoo, LinkedIn, Twitter, etc.
This tutorial is for Ethical Hacking and Penetration Testing purposes only. It covers theHarvester installation and basic use – enough for you to get a good taste of what this amazing tool can do in the intelligence-gathering process.
What is theHarvester?
theHarvester is one of the greatest command-line tools available mainly due to its ability to broadly scrape publicly available information such as company names and email addresses from all over the Internet.
It is used mainly for passive reconnaissance, but DNS brute force attack and subdomain screenshots are possible when used as part of active reconnaissance.
It is not uncommon to hear that you’re most likely not using it right if you don’t get any results.
To obtain information on its targets, theHarvester employs various approaches, including DNS reverse lookups, dictionary enumerations to perform DNS brute force attacks, hostnames, and search engine Dorking [results that are not visible using a normal search].
How To Install theHarvester
theHarvester requires Python 3.7+ and can be downloaded from its official repository on GitHub.
Unfortunately, theHarvester is not available on Windows due to dependency systems through mixed success has been reported on using Docker.
- On Kali Linux, theHarvester comes preinstalled in the latest 2021.x releases.
- On Debian and Debian-based operating systems [Ubuntu, Linux Mint, Pop!_OS, etc.], type the following commands in the terminal one at a time [without “#”].
# git clone https://github.com/laramies/theHarvester.git
# sudo apt-get install python3-pip
NOTE: Run the following commands in the same directory where you cloned theHarvester GitHub.
# cd theHarvester
# sudo pip3 install -r requirements.txt
# sudo python3 ./theHarvester.py
- On the RPM-based systems, the procedure is similar to the above installation, but instead of using apt-get, we will use dnf or yum as following:
# sudo dnf install python3-pip
- On Arch Linux and derivatives use the above procedure but install python3-pip using pacman by typing the following command in the terminal:
# sudo pacman -S python3-pip
Syntax and Options
The following command-line syntax is supported and can be triggered with the -h option:
theHarvester [-h] -d DOMAIN [-l LIMIT] [-S START] [-g] [-p] [-s] [--screenshoot SCREENSHOT] [-v] [-e DNS_SERVER] [-t DNS_TLD] [-r] [-n] [-c] [-f FILENAME] [-b SOURCE]
NOTE: “theharvester” command-line is deprecated and was replaced by “theHarvester” in the latest releases.
The table below compiles all the optional arguments supported by theHarvester.
Argument | Description |
-h, –help | Show the help page and exists. |
-d DOMAIN, –domain DOMAIN | Specify the domain name to search |
-l LIMIT, –limit LIMIT | Limit the number of search results [default=500] |
-S START, –start START | Start with result number X [default=0] |
-g, –google-dork | Use Google Dorks for Google search |
-p PORT_SCAN, –port-scan PORT_SCAN | scan the detected hosts and check for Takeovers – 21,22,80,443,8080. [default=False, params=True] |
-s, –shodan | Use Shodan to query discovered hosts |
-v VIRTUAL_HOST, –virtual-host VIRTUAL_HOST | Verify hostname using DNS resolution and search for virtual hosts [params=basic, default=False] |
-e DNS_SERVER, –dns-server DNS_SERVER | DNS server to use for lookup |
-t DNS_TLD, –dns-tld DNS_TLD | Perform a DNS TLD expansion discovery, [default False] |
-n DNS_LOOKUP, –dns-lookup DNS_LOOKUP | Enable DNS server lookup [default=False,params=True] |
-c, –dns-brute | Perform a DNS brute force on a specified domain |
-f FILENAME, –filename FILENAME | Save the results to an HTML and/or XML file on the local machine. |
-b SOURCE, –source SOURCE | Where the source can be: baidu, bing, bingapi, censys, crtsh, cymon, dnsdumpster, dogpile, duckduckgo, google, google- certificates, hunter, intelx, linkedin, netcraft, securityTrails, threatcrowd, trello, twitter, vhost, virustotal, yahoo. |
-x EXCLUDE, –exclude EXCLUDE | Exclude options when using all sources above. |
Scanning Using theHarvester
Because scraping data is frowned by search engines, theHarvester will introduce time delays between queries to evade detection.
Is advisable to limit the number of queries by restricting the number of return results. You will also be able to work with a more manageable list as a result of this.
theHarvester supports HTML and XML formats for saving your scan data.
NOTE: theHarvester will produce zero [0] results if no parameters are input when performing a scan.
One of the following domain [-d], source [-b], and limit [-l] parameters are mandatory for theHarvester to produce a scan.
In the example above, theHarvester scan resulted in No IPs, emails, or hosts found. The reason is that the scan parameters were not inserted correctly.
The -d option is used to define the domain or business name to be searched.
The -l parameter specifies the maximum number of searches to be performed against the specified search engine(s).
The -b argument specifies the source search engine for your query such as google, bing, yahoo, etc.
NOTE: the “all” parameter for removed from theHarvester optional arguments list.
CAUTION: Increasing your search limit will result in a temporary suspension from Google or other search engines. The default search LIMIT value is 500, implying that theHarvester will go through the first 500 search results.
Since Google displays ten search results on each page, 50 pages will be scanned using the default parameter.
That is more than enough for Google [or any other search engine] to identify you as a scraper and either display a Captcha warning or block your results.
A proxy service [e.g., Storm Proxies] that routes all requests via a separate IP address is highly recommended to circumvent this limitation.
Let’s perform a scan for the vulnweb.com domain with a limit set to 100 using LinkedIn as a source using the following command:
# theHarvester -d www.vulnweb.com -l 100 -b linkedin
The query result identified 13 references on LinkedIn with relation to vulnweb.com as seen in the capture below.
You can perform the search against a much larger target like, e.g., microsoft.com using the same source and parameters.
Paste Websites
To publish samples of their findings online using similar tools such theHarvester, hackers and other threat actors often utilize various websites such as the one listed below.
- Justpaste.it
- Pastebin.com
- Doxbin.org
- 0bin.net
Depending on your objectives, these sites may frequently provide a large quantity of information. Paste websites often provide instances of data breaches, threat actor communications, and numerous dox on people.
Like forums, Paste websites are popular with threat actors and may frequently reveal crucial information during an investigation.
The difficulty with these platforms is that individual pastes are often removed if they are detected to include Personal Identifiable Information [PII] or other types of confidential material, which violates the site’s terms of service.
As a consequence, many organizations search these Paste sites for new information daily.
Psbdmp is one of the hidden gems on the Internet since it is the only site that preserves a complete historical record of every paste dating back to 2015.
Not only that, but its API is very user-friendly and can be used through a URL or an external program, or a self-made script.
Conclusion
theHarvester is one of the bests OSINT tools of its kind, very easy to use, and a must-have in your arsenal for active or passive reconnaissance.
If you found this article useful, please share it with your friends and colleagues – it really helps this website grow faster.
Stay safe.
