site scraping for a server behind cloudflare

zelea2 · Unread post by **zelea2** » 2023-11-16, 16:11

The other day I was looking to extract a multi-page table from a site protected by cloudflare.
No amount of juggling the cookies and user agent convinced wget to do the job. All I was getting back were 403 forbidden messages.

PaleMoon has no extensions to navigate through pages and extract content from web pages but other browsers do.
So I've tried to solve this task with various chrome extensions; some were able to extract the info I was after but were not capable of advancing the page,
some were going to the last page instead of the next and some were paying extensions but still not working properly in demo mode.

All I wanted was a way to instruct a browser (any browser with an UI) to open an URL and save that page automatically. I couldn't find an easy way.
Apparently you can do this if you launch chrome in headless mode, install chromium-driver and then a selenium server for scripting it.
Selenium is dependent on nodejs which has a few hundred nodejs packages to install. This was not the way to go for me.

In the end I wrote a Perl script which uses xdotool - a tool that can mimic key presses and mouse movements for Xorg server (linux only).
This worked surprisingly well and I could also process the html in the same script using the HTML::TableExtract Perl extension then saving the data as CSV.
The only drawback is that you have to leave the computer alone until it finishes the site scraping. The Perl script is

attached

in case anyone wants to use it as a template.

If you know a better was to remotely control a browser (in particular to open a URL and save the page) please let me know.

Kris_88 · Unread post by **Kris_88** » 2023-11-16, 17:42

It's probably possible to write a bookmarklet that will load many pages via XHR and dump it all into a file.

deChat · Unread post by **deChat** » 2023-11-17, 11:52

zelea2 wrote: ↑
2023-11-16, 16:11
In the end I wrote a Perl script which uses xdotool - a tool that can mimic key presses and mouse movements for Xorg server (linux only).
This worked surprisingly well and I could also process the html in the same script using the HTML::TableExtract Perl extension then saving the data as CSV.
The only drawback is that you have to leave the computer alone until it finishes the site scraping. The Perl script is attached in case anyone wants to use it as a template.

Just curious, was there a reason you chose Perl over say, Python, for this task?

I haven't written anything in Perl for over a decade (and frankly, wasn't very good at it), but you now have me intrigued and considering trying it again if it has good tools for scraping sites.

zelea2 · Unread post by **zelea2** » 2023-11-17, 12:09

deChat wrote: ↑
2023-11-17, 11:52
Just curious, was there a reason you chose Perl over say, Python, for this task?

Personal preference, I've learned Perl before Python and I still like Perl's regex and hashes better. I also think Perl's ecosystem is much more stable and mature.

Any scripting language will do because its only job is to call 'xdotool' and parse a html file.

andyprough · Unread post by **andyprough** » 2023-11-17, 15:08

zelea2 wrote: ↑
2023-11-16, 16:11
All I wanted was a way to instruct a browser (any browser with an UI) to open an URL and save that page automatically. I couldn't find an easy way.

Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/

That's exactly the kind of task it is made for, and it works brilliantly. I'm not sure if it would work on your one Cloudflare protected site though. It hasn't failed me, but it's possible that Cloudflare has some extreme anti-scraping technology at work.

zelea2 · Unread post by **zelea2** » 2023-11-23, 16:48

andyprough wrote: ↑
2023-11-17, 15:08
Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/

That extension works on the current page only. You cannot browse through hundreds of search pages automatically.

andyprough · Unread post by **andyprough** » 2023-11-24, 04:46

zelea2 wrote: ↑
2023-11-23, 16:48

andyprough wrote: ↑
2023-11-17, 15:08
Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/
That extension works on the current page only. You cannot browse through hundreds of search pages automatically.

Go to Addons-Scrapbook X-Preferences-Organize tab-Default Save Settings Customize-Depth To Follow Links. Set it to whatever depth you would like, and Scrapbook X will copy the pages from that many links deep in your search page.

Pale Moon forum

site scraping for a server behind cloudflare

site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare

Re: site scraping for a server behind cloudflare