Releases: gocolly/colly
v2.2.0
Tons of bug fixes & little improvements & updated dependencies
What's Changed
- Added new project that uses Colly by @twiny in #525
- adding docker-slim to the list of open source projects using Colly by @kcq in #526
- update coursera example by @ysung6 in #537
- Update openedx_courses.go by @lingt-xyz in #539
- feat: support Stop() in queue by @PalanQu in #531
- Fix invalid base URL assignment by @dasshield in #549
- Add context.Context support by @WGH- in #548
- Fix handling of redirected URLs in OnResponseHeaders by @WGH- in #561
- Fix race on count fields of Collector type by @shubham14bajpai in #563
- SetProxyFunc: fix proxies rotation by @wowinter13 in #567
- Update xkcd_store.go by @BigManing in #570
- Updated xmlquery to close a known security vulnerability by @ijomeli in #582
- Fix default User-Agent when using custom headers by @WGH- in #592
- Update openedx_courses.go by @BigManing in #576
- Add httpBackend cache response callback by @slashapp in #574
- Update README.md by @seversky in #595
- Change the type assersion for Attr by @moritamori in #588
- Update User-Agent strings for Desktop by @guessi in #605
- add quote crawler using colly by @eval-exec in #603
- Fix golint issues by @waltton in #620
- Replace Travis with GitHub Actions by @alexhraber in #669
- Use github.com/nlnwa/whatwg-url for URL parsing by @WGH- in #673
- Fix Async ignoring its arguments by @ljbink in #652
- fix: generated code has compile error by @darjun in #629
- moving filterings to one function (checkFilters) and use it in case of redirect too by @joelazar in #638
- Fix Contributing.md link by @SaladinoBelisario in #641
- Work around invalid URL encoding in path by @WGH- in #675
- Host & Custom Header Support by @ErikOwen in #691
- Code cleanup by @WGH- in #678
- Add GitHub issue templates and nudge questions to Stack Overflow by @WGH- in #702
- Fix redirects ignoring AllowURLRevisit=false by @WGH- in #681
- Bump whatwg-url version by @WGH- in #713
- fix some typos by @cuishuang in #736
- feat: Add MaxRequests paramete by @guilhem in #722
- Update CI configuration by @WGH- in #749
- upgrade Cascadia version by @kinoute in #650
- Bump golang.org/x/net from 0.0.0-20220114011407-0dd24b26b47d to 0.7.0 by @dependabot in #756
- Support websites redirecting to the same page when AllowURLRevisit is disabled by @WGH- in #763
- Update User-Agent strings for 2023 by @guessi in #765
- Fix data races by @WGH- in #771
- Replaced deprecated io/ioutil package with proper implementations of … by @zachary-walters in #785
- Bump golang.org/x/net from 0.7.0 to 0.17.0 by @dependabot in #789
- put accept header before OnRequest by @k4lizen in #786
- Fix Bug: retry scrape will lost POST requestData by @Shinku-Chen in #794
- scrapper -> scraper Update README.md by @ideabrian in #809
- Test on Go 1.22, drop old versions by @WGH- in #810
- Bump google.golang.org/protobuf from 1.24.0 to 1.33.0 by @dependabot in #807
- Implement content sniffing for HTML parsing by @WGH- in #808
- Add headers/ua to robots.txt request by @makew0rld in #838
- doc: mention that OnScraped executes after OnHTML and OnXML by @aladmit in #832
New Contributors
- @twiny made their first contribution in #525
- @kcq made their first contribution in #526
- @ysung6 made their first contribution in #537
- @lingt-xyz made their first contribution in #539
- @PalanQu made their first contribution in #531
- @dasshield made their first contribution in #549
- @shubham14bajpai made their first contribution in #563
- @wowinter13 made their first contribution in #567
- @BigManing made their first contribution in #570
- @ijomeli made their first contribution in #582
- @slashapp made their first contribution in #574
- @seversky made their first contribution in #595
- @moritamori made their first contribution in #588
- @guessi made their first contribution in #605
- @eval-exec made their first contribution in #603
- @waltton made their first contribution in #620
- @alexhraber made their first contribution in #669
- @ljbink made their first contribution in #652
- @darjun made their first contribution in #629
- @joelazar made their first contribution in #638
- @SaladinoBelisario made their first contribution in #641
- @ErikOwen made their first contribution in #691
- @cuishuang made their first contribution in #736
- @guilhem made their first contribution in #722
- @kinoute made their first contribution in #650
- @dependabot made their first contribution in #756
- @zachary-walters made their first contribution in #785
- @k4lizen made their first contribution in #786
- @Shinku-Chen made their first contribution in #794
- @ideabrian made their first contribution in #809
- @makew0rld made their first contribution in #838
- @aladmit made their first contribution in #832
Full Changelog: v2.1.0...v2.2.0
v2.1.0
v2.0.0
- Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
- Go module support
- Collector.HasVisited method added to be able to check if an url has been visited
- Collector.SetClient method introduced
- HTMLElement.ChildTexts method added
- New user agents
- Multiple bugfixes
v1.2.0
v1.1.0
- Appengine integration takes context.Context instead of http.Request (API change)
- Added "Accept" http header by default to every request
- Support slices of pointers and structs in unmarshal
- Fixed a race condition in queues
- ForEachWithBreak method added to HTMLElement
- Added a local file example
- Support gzip decompression of response bodies
- Don't share waitgroup when cloning a collector
- Fixed instagram example
v1.0.0
We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.
Facts about the development
It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.
You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?
-
Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.
-
Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.
-
Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.
-
Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.
-
Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.
-
Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.
-
Goodies The package named
extensions
provides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension isRandomUserAgent
which generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions -
Debuggers Debugging can be painful. Colly tries to ease the pain by providing
Debuggers
to inspect your scraper. You can simply write debug messages to the console by usingLogDebugger
. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing aWebDebugger
. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug
We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.