Wednesday, September 10, 2014

Extract a document from an online reader

I'm glad i'm not going through first year undergrad general studies again.
The horseshit and scams that surround these programs never ceases to amaze me.

Someone showed me the online reader they had to use for their classes.  Though i'm sure there are others that are equally shitty, this Brytewave digibook reader from Follett was unusable.  It was extremely slow, and navigation was made difficult because it would tend to direct the user to random pages instead of the ones specified.  It was cram time at the end of the semester, and the book she's trying to study keeps closing and going to random pages.  That's how e-book technology has revolutionized education and brought us to a bright new era of virtual learning!  Fuck the rent-seeking charlatans and the government money they rode in on.

That's the extent of the rant.  How to get the fucking shitty book as a pdf:
As you can guess, i wrote a script to capture the document.  Actually, i wrote two scripts.  The first script uses xdotool and scrot (a fast and script-friendly screenshot tool) to click through the document and capture all the image data.  It helps to have a large monitor, otherwise, one can zoom in and take multiple shots per page.  
#!/bin/bash
# click through drm'd digibooks in brytewave reader and copy them as screenshots

clickwait=1       
pagewait=20    # seconds to wait for next page to load
startpage=1     # useful in case of crash (it happens)
endpage=400   # the last reader page 
dumppath="/home/personface/pileofshittybooks"
docname="governmentbook"

xdotool search --screen 0 --name "BryteWave" windowfocus; 
page=$startpage
while [ $page -le $endpage ]; do
    scrot "$dumppath/$docname-$page.jpg"

    # use a screenshot to find button coordinates    
    sleep $clickwait; xdotool mousemove 1907 630
    sleep 0.1; xdotool click 1
    sleep $pagewait
    
    page=$[page+1]
done

The second script cuts the two pages out of each screenshot and compiles a minimal pdf from the images.
#!/bin/bash
# disassemble screenshots made of bryteclicker

docprefix="governmentbook"
pdfname="Fascist_Propaganda_by_CKSucker_10ed"

inext="jpg"
outext="jpg"
outqual=50 
endonduplicate=1

page=1
lastsize=0
for infile in $(ls -1tr $docprefix*.$inext); do
    # check for consecutive duplicates since screengrabber cannot verify page loads
    # if flag is set, assume duplicates indicate screengrabber is stalled on last page of document
    # useful when screengrabber doesn't know exact document size and is set with excess pagecount
    thissize=$(ls -l $infile | cut -d " " -f 5)
    if [ $lastsize == $thissize ]; then
        echo "$infile may be a duplicate of the previous file!"
        if [ $endonduplicate == 1 ]; then
            echo "i'm going to assume this is the end of the document"
            break;
        fi
    fi
    lastsize=$thissize

    # crop pages from screenshot
    # use GIMP to get coordinates
    convert -crop 670x1005+282+122 -quality $outqual $infile "output-$(printf "%04d" $page)-A.$outext"
    convert -crop 670x1005+968+122 -quality $outqual $infile "output-$(printf "%04d" $page)-B.$outext"
    page=$[$page+1]
done

convert output*.$outext $pdfname.pdf

Like i mentioned, one can get better image quality by making each page span more than one screenshot, but this complicates post-processing a tad.  The quality settings in the second script can also be tweaked for better output.  My method was a tradeoff between readability and file size reduction.

No comments:

Post a Comment