Skip to content

Latest commit

 

History

History
97 lines (68 loc) · 3.59 KB

README.md

File metadata and controls

97 lines (68 loc) · 3.59 KB

Email Spider for Websites and PDFs

Static Badge

A Python script that crawls every email address of a given website and saves the result in a nosql file. The search also includs pdf files. This script should be used for educational purposes only. I am not responsible for any misuse of this script. My intention is to help finding mail addresses on websites and explain how to secure them from spam Preventation and solutions.

Advantages:

  • crawling pdf files
  • finding (at) mail addresses
  • decryption of cloudflare's mail encryption
  • threading for faster results
  • sending every request with a valid browser user agent

Table of contents

Usage

(Back to top)

First run the script with the crawl command and the domain you want to crawl.

./main.py crawl -d https://example.com -m 2 -v

After that you can run the script with the read command to display the result.

./main.py read

Flags

  • With -d (or) --domain: domain or url to crawl

  • With -m (or) --maxdepth : maximum depth to crawl (default: 2)

  • With -v (or) --verbose : increase verbosity

Installation

(Back to top)

  1. Install Python (at least, version >= 3.10)
  2. Install all requirements from requirements.txt via pip
  3. Start executing ./main.py or python main.py

Possible false positives

(Back to top)

Strings within a website could still be recognized as an email by the email regex pattern, when there is a @ in a name. That could be the case for example if the @ is in an image title, alt text or within the src path itself. After collecting possible mail addresses, there is a check with popular media suffixes, to exclude these entries from the result.

Preventation and solutions

(Back to top)

So how can I prevent my own website from crawling mail addresses by a bot and what are the pitfalls. One of many solutions is not to hide your mail address with an (at). As you can see in my script it's not a big deal to decrypt it.

Solution 1

Set a rate limit in your webserver, so that a bot can't crawl your website in a short amount of time. For example, you can use the limit_req module in nginx and set a rate limit of 1 request per second for an ip address.

limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;

server {
    #...

    location / {
        limit_req zone=one;
    }
}

Solution 2

Hiding an email address with javascript by encoding it. This is a common solution, but it has a big disadvantage. The email address is encoded with javascript, so screen-readers aren't able to read the correct mailto link. Especially with the new directive the European Accessibility Act (EAA), it is not the best solution for all.

<a href="mailto:user@domain@@com"
   onmouseover="this.href=this.href.replace('@@','.')">
   Send email
</a>

License

(Back to top)

The GNU GENERAL PUBLIC LICENSE. Please have a look at the LICENSE for more details.