Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch to skip blacklist entries in hosts file #38

Open
forthrin opened this issue May 28, 2023 · 9 comments
Open

Patch to skip blacklist entries in hosts file #38

forthrin opened this issue May 28, 2023 · 9 comments

Comments

@forthrin
Copy link

Initialization takes forever with a large hosts blacklist. Proposing the following patch:

index 47c4ef6..4ae0dee 100644
--- a/lib/resolv.rb
+++ b/lib/resolv.rb
@@ -198 +198 @@ class Resolv
-              next unless addr
+              next if !addr || addr.start_with?('0')
@forthrin
Copy link
Author

Bump

@hanazuki
Copy link
Contributor

hanazuki commented Apr 25, 2024

Blocklisting using /etc/hosts aims to inhibit resolving a certain set of domain names system-wide. As Resolv is an alternative to system resolver, I think it is against the purpose of blocklisting for Resolv to ignore 0.0.0.0 entries in /etc/hosts. If it does so, the end users will face a very confusing situation that applications written in C or Go or any other languages respect blocklists, while Ruby apps don't.

For the specific use case where blocklist entries should be ignored, an instance of Resolv::Hosts with the patched behavior can be passed as an argument to Resolv.new().

Generally, I'd not recommend putting such an enormous number of records into /etc/hosts, because the file must be read by every single process involving hostname resolution (if not cached by something like nscd, which is not compatible with some programming languages, including Ruby apps using this library). Instead, you can set up a DNS server that caches hostname-address mapping on memory, such as dnsmasq.

@hanazuki
Copy link
Contributor

To discuss performance we'd be happy to have some numbers. What is the environment? How poor is the current performance? How is it improved with this patch?

A reproducible benchmark would help us spot which part of the code is slow and optimize it.

@forthrin
Copy link
Author

I'll get back to you with numbers.

About dnsmasq:

  1. It says it reads from /etc/hosts. Does that mean it takes over the job from the OS in handling this file, and does it much faster?
  2. What other vital, really noticeable benefits does running dnsmasq have? Is it really worth running?
  3. Is it possible to install it on the home router? Or does the home router have to support it from the factory? How does one manage the blacklist on a home router? Do you simply upload a flat file?
  4. Does dnsmasq log/count which blacklist entries are actually used, so you can throw out the unused ones after a month or so?

@forthrin
Copy link
Author

$ wc -l /etc/hosts
  228858 /etc/hosts
$ git diff -U0
diff --git a/lib/resolv.rb b/lib/resolv.rb
index e36dbce..0356591 100644
--- a/lib/resolv.rb
+++ b/lib/resolv.rb
@@ -190,0 +191 @@ class Resolv
+        time = Time.now.to_f
@@ -205,0 +207 @@ class Resolv
+        printf "Took %.1fs\n", Time.now.to_f - time

Took 0.6s

-              next unless addr
+              next if !addr || addr.start_with?('0')

Took 0.3s

$ grep -v \0.\0.\0.\0 < /etc/hosts > /etc/hosts # pretend this works :D
$ wc -l /etc/hosts
      69 /etc/hosts

Took 0.0s

@forthrin
Copy link
Author

PS! 0.0.0.0 entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv is part of HTTPX used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.

@hanazuki
Copy link
Contributor

It looks quite faster than "forever" :)

Self-contained benchmark:

require 'benchmark/ips'
require 'resolv'
require 'tempfile'

hosts = {
  small: 20,
  medium: 2000,
  large: 200000,
}.transform_values do |size|
  f = Tempfile.open('hosts')
  f.write("127.0.0.1 localhost\n")
  size.times do |i|
    f.printf("0.0.0.0 %x.test\n", i)
  end
  f.tap(&:flush)
end

Benchmark.ips do |x|
  x.warmup = 1
  x.time = 5

  hosts.each do |name, f|
    x.report(name) do
      Resolv.new([Resolv::Hosts.new(f.path)]).getaddress('localhost')
    end
  end

  x.compare!
end
% bundle exec ruby ./benchmark/hosts.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
Warming up --------------------------------------
               small     1.417k i/100ms
              medium    14.000 i/100ms
               large     1.000 i/100ms
Calculating -------------------------------------
               small     13.765k (± 5.6%) i/s -     69.433k in   5.061995s
              medium    147.119 (± 4.8%) i/s -    742.000 in   5.054971s
               large      0.515 (± 0.0%) i/s -      3.000 in   5.836935s

Comparison:
               small:    13765.1 i/s
              medium:      147.1 i/s - 93.56x  slower
               large:        0.5 i/s - 26739.79x  slower

@hanazuki
Copy link
Contributor

PS! 0.0.0.0 entries are only useful for browsers etc. which connect to a plethora of unwanted servers.resolv is part of HTTPX used for dev projects with full control over what is contacted, thus blacklisting is unnecessary.

Resolv is a generic library that implements hostname resolution (it's not just a part of httpx library but any applications written in Ruby can use it), thus, IMO, it should be neutral on the use cases. Also because /etc/hosts is a system-wide setting, any applications running on the system are expected to respect it by default.

So I think optimizing Resolv for a large /etc/hosts database is good, but changing its behavior in the suggested way is not desirable.

@forthrin
Copy link
Author

forthrin commented Apr 26, 2024

Agree, but a consistent half second delay on all scripts using the library is sluggishly unacceptable. (I think also "forever" might have been longer, but under other circumstances which fail my memory.)

See questions about dnsmasq. If that (or any other approach) can alleviate the need for swamping /etc/hosts with 200k+ entries, there is no problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants