-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns_cache #63
Comments
If there is possibility to let user chose then we can do it. On the other side if your solution supports caching of negative responses and current one does not. Then I would go with your dns_cache. Could you improve the logging as you are suggesting, please? It might be helpful for users when they will be debugging the code. |
Ok, I'll get a PR underway today. |
#65 is a first cut of showing that negative hits are cached. |
btw, dns-cache was built for https://github.com/jayvdb/pypidb , where I am also using urlextract ; the test suite is processing a huge dataset, and exposes quite a lot of potential improvements with urlextract. |
Yeah, I've already check that. Interesting project. I am glad that this small library could be part of it :) |
I've started that with #68 , but those are less about the DNS aspects. To see DNS issues, actually it would be helpful to add some optional mechanism for URLExtract to keep a list of rejected URLs/domains, so that I can then easily review those in my test suite, highlighting any which might be solvable earlier in URLExtract to reduce the DNS hits. Currently the best way to do that is to cause a test class to fail all packages and review the logs. The test runner will stop after 50 failures - edit tox.ini to see more. e.g. https://pypi.org/project/Genshi/ INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.7.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.6.1.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.6.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.5.1.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.5.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.4.4.zip) gaierror(-2) blurb:
kaitaistruct
libpysal
pyxdg DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://freedesktop.org/wiki/Software/pyxdg
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(inifile.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mimetype.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mime.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(applications.menu) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(icontheme.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(inifile.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(basedirectory.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(icontheme.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(locale.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mime.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(recentfiles.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2) This is extremely common when processing webpages. (mwlib.ext is where I am seeing it now)
mwlib.ext
py-trello
msgpack-python: See #69 (comment) |
config: DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://docs.red-dove.com/cfg/python.html
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
...
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(settings.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
...
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: https://play.google.com/store/apps/details?id=com.google.android.apps.authenticator2&
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.call) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(p.click) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(co.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(co.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(m.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(k.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.re) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.re) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.call) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(array.prototype.map) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.name) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.data) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(window.google&&window.google.sn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(window.google.sn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.tc) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.tc) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(g.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.next.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(b.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(b.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.bottom-this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.right-a.left,a.bottom-a.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(g.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.mls) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.bv) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.kn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.sb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(cp.me) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(cp.ml) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(up.sl) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.bottom-a.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.o.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.id) gaierror(-2)
INFO urlextract:urlextract_core.py:518 Invalid host 'http://.o.style.width=b.items[c]'. If the host is valid report a bug.
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.gb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.qa) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.lb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(person.photo) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.si) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(silk.s.sis.ca) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.play.games) gaierror(-2) geoip
|
I've created https://github.com/jayvdb/dns-cache which caches negative responses, which is quite helpful when using the recently added DNS checking in URLExtract.
Should I add
dns_cache
todns_cache_install
? Or just mention it in the README for users which want more control?Also there is a fairly serious problem with the dnspython "socket" resolver on Windows during negative responses.
rthalley/dnspython#416
However the
AttributeError
caused there should be caught at https://github.com/lipoja/URLExtract/blob/1eb9ad5/urlextract/urlextract_core.py#L564 , so the logging there is the only bit which can be improved.We can also improve the logging by catching
socket.gaierror
and giving it a better log entry.The text was updated successfully, but these errors were encountered: