Saturday, April 15, 2017

A bit of script for downloading images from a website

So I wanted to download only images from this one website for archival usage. After looking at all ready available page, I couldn't find anything suiting for my purpose. So I decided to throw in a little shell-script.
It went on as follows:
The pages are incremental in index. So I can use a while loop to fetch all pages one by one. Done.
Need to fetch the page. Wget is fine. Done.
Then I need to look for the image URL in retrieved HTML. Hmm. Bit of grep with cut does that. Done.
Next get the actual image. Again prepare the image URL and wget. Done.

Error handling? Gaah. Nothing since this is not that trivial.

After testing for few pages it was golden.

So I put it up with counter loop of hundred images at one time. That's because this script is kind of slow since I don't know nor care to put in multithreading in shellscript. Gaah.
Plus as I later found out having 100's counter is good since wget hung out when the net went down for a moment, and I had to restart the script from that page count.

And somehow https proxy is set on my laptop. Why, god only knows. Below is the script.

###########################
#!/bin/bash

export https_proxy=""
count=11000

while [ $count -gt 10800 ]
do

echo "Downloading page $count..."

content=$(wget https://www.mygallery.com/photo/$count -q -O -)
#echo $content
echo "Page downloaded!"

line=$(echo $content | grep -b -o "class=\"main-photo\"")
offset=$(echo $line| cut -d : -f1)
((offset1=offset+50))
img="https://www.mygallery.com/resized/"$(echo $content | cut -c$offset-$offset1 | tail -c18)

echo "Getting Image: "$img"..."
wget $img -q
((count=count-1))
#echo "count: $count"

done
exit 0
###########################


1 comment:

  1. I just wanted to add a comment here to mention thanks for you very nice ideas. Blogs are troublesome to run and time consuming thus I appreciate when I see well written material. For more information visit legal herbal empire for sale

    ReplyDelete