You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Continuing from #63 , these are the known issues (list will grow).
As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.
Others I think are harder and may not be in urlextract scope:
Lots of annoying invalid .py domains filtered out by DNS checking, such as setup.py which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.
Relative urls Relative urls jayvdb/pypidb#38 This would be a huge enhancement to URLExtract, but requires adding a completely different extraction algorithm.
Continuing from #63 , these are the known issues (list will grow).
As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.
Others I think are harder and may not be in urlextract scope:
.py
domains filtered out by DNS checking, such assetup.py
which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.e.target
is really common, appearing in<script>
blocks, but I am not sure it would be useful to exclude urls found inscript
tags via https://pypi.org/project/config{{
in url ; pydevd-pycharmThe text was updated successfully, but these errors were encountered: