Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SITES] https://calarasipress.ro/ #628

Open
5 of 10 tasks
TudorAndrei opened this issue Mar 28, 2024 · 0 comments
Open
5 of 10 tasks

[SITES] https://calarasipress.ro/ #628

TudorAndrei opened this issue Mar 28, 2024 · 0 comments

Comments

@TudorAndrei
Copy link
Contributor

TudorAndrei commented Mar 28, 2024

First please check that it is really an issue with the library, and not some special case of website:

  • There is no paywall
  • You do not have to be logged in to see the articles
  • You tried using a common browser user agent in your configuration / call
  • The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://calarasipress.ro/au-sunat-alarmele-la-calarasi-oamenii-nu-au-stiut-ce-se-intampla/img_3495/

The others work as intended

www.example.com/article1
www.example.com/article2

The exact code i used to test this articles/website

# load html manually
at = Article(url=None)
at.download(html, title="")
at.parse()
at.text

** What parts of the article are missing / not parsed correctly **

  • Title
  • Text Content
  • Publication Date
  • Authors
  • Images
  • Movies

Other information, remarks, messages, etc:

The extractor extracts the lines from "Breaking News" as the content of the article. This is not obvious, because the content is present in the html, but the user needs to hover on the "Breaking News" tab to see the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants