site scraping for a server behind cloudflare

General project discussion.
Use this as a last resort if your topic does not fit in any of the other boards but it still on-topic.
Forum rules
This General Discussion board is meant for topics that are still relevant to Pale Moon, web browsers, browser tech, UXP applications, and related, but don't have a more fitting board available.

Please stick to the relevance of this forum here, which focuses on everything around the Pale Moon project and its user community. "Random" subjects don't belong here, and should be posted in the Off-Topic board.
User avatar
zelea2
Apollo supporter
Apollo supporter
Posts: 34
Joined: 2019-02-02, 00:56
Location: UK

site scraping for a server behind cloudflare

Unread post by zelea2 » 2023-11-16, 16:11

The other day I was looking to extract a multi-page table from a site protected by cloudflare.
No amount of juggling the cookies and user agent convinced wget to do the job. All I was getting back were 403 forbidden messages.

PaleMoon has no extensions to navigate through pages and extract content from web pages but other browsers do.
So I've tried to solve this task with various chrome extensions; some were able to extract the info I was after but were not capable of advancing the page,
some were going to the last page instead of the next and some were paying extensions but still not working properly in demo mode.

All I wanted was a way to instruct a browser (any browser with an UI) to open an URL and save that page automatically. I couldn't find an easy way.
Apparently you can do this if you launch chrome in headless mode, install chromium-driver and then a selenium server for scripting it.
Selenium is dependent on nodejs which has a few hundred nodejs packages to install. This was not the way to go for me.

In the end I wrote a Perl script which uses xdotool - a tool that can mimic key presses and mouse movements for Xorg server (linux only).
This worked surprisingly well and I could also process the html in the same script using the HTML::TableExtract Perl extension then saving the data as CSV.
The only drawback is that you have to leave the computer alone until it finishes the site scraping. The Perl script is
attached
in case anyone wants to use it as a template.

If you know a better was to remotely control a browser (in particular to open a URL and save the page) please let me know.
You do not have the required permissions to view the files attached to this post.

User avatar
Kris_88
Keeps coming back
Keeps coming back
Posts: 940
Joined: 2021-01-26, 11:18

Re: site scraping for a server behind cloudflare

Unread post by Kris_88 » 2023-11-16, 17:42

It's probably possible to write a bookmarklet that will load many pages via XHR and dump it all into a file.

User avatar
deChat
Hobby Astronomer
Hobby Astronomer
Posts: 15
Joined: 2023-09-18, 04:13

Re: site scraping for a server behind cloudflare

Unread post by deChat » 2023-11-17, 11:52

zelea2 wrote:
2023-11-16, 16:11
In the end I wrote a Perl script which uses xdotool - a tool that can mimic key presses and mouse movements for Xorg server (linux only).
This worked surprisingly well and I could also process the html in the same script using the HTML::TableExtract Perl extension then saving the data as CSV.
The only drawback is that you have to leave the computer alone until it finishes the site scraping. The Perl script is attached in case anyone wants to use it as a template.
Just curious, was there a reason you chose Perl over say, Python, for this task?

I haven't written anything in Perl for over a decade (and frankly, wasn't very good at it), but you now have me intrigued and considering trying it again if it has good tools for scraping sites.

User avatar
zelea2
Apollo supporter
Apollo supporter
Posts: 34
Joined: 2019-02-02, 00:56
Location: UK

Re: site scraping for a server behind cloudflare

Unread post by zelea2 » 2023-11-17, 12:09

deChat wrote:
2023-11-17, 11:52
Just curious, was there a reason you chose Perl over say, Python, for this task?
Personal preference, I've learned Perl before Python and I still like Perl's regex and hashes better. I also think Perl's ecosystem is much more stable and mature.

Any scripting language will do because its only job is to call 'xdotool' and parse a html file.

User avatar
andyprough
Keeps coming back
Keeps coming back
Posts: 752
Joined: 2020-05-31, 04:33

Re: site scraping for a server behind cloudflare

Unread post by andyprough » 2023-11-17, 15:08

zelea2 wrote:
2023-11-16, 16:11
All I wanted was a way to instruct a browser (any browser with an UI) to open an URL and save that page automatically. I couldn't find an easy way.
Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/

That's exactly the kind of task it is made for, and it works brilliantly. I'm not sure if it would work on your one Cloudflare protected site though. It hasn't failed me, but it's possible that Cloudflare has some extreme anti-scraping technology at work.

User avatar
zelea2
Apollo supporter
Apollo supporter
Posts: 34
Joined: 2019-02-02, 00:56
Location: UK

Re: site scraping for a server behind cloudflare

Unread post by zelea2 » 2023-11-23, 16:48

andyprough wrote:
2023-11-17, 15:08
Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/
That extension works on the current page only. You cannot browse through hundreds of search pages automatically.

User avatar
andyprough
Keeps coming back
Keeps coming back
Posts: 752
Joined: 2020-05-31, 04:33

Re: site scraping for a server behind cloudflare

Unread post by andyprough » 2023-11-24, 04:46

zelea2 wrote:
2023-11-23, 16:48
andyprough wrote:
2023-11-17, 15:08
Did you try Pale Moon's 'Scrap Book X' extension? https://addons.palemoon.org/addon/scrapbook-x/
That extension works on the current page only. You cannot browse through hundreds of search pages automatically.
Go to Addons-Scrapbook X-Preferences-Organize tab-Default Save Settings Customize-Depth To Follow Links. Set it to whatever depth you would like, and Scrapbook X will copy the pages from that many links deep in your search page.