facebook crawler and youtube api: part 1

in programming on 07 October 2012

Summary

Created a small app, implemented in python and PHP, that crawls authenticated Facebook page for Youtube videos and adds them to a specified user's Youtube playlist. Useful for groups that post a lot of youtube videos and want a centralized playlist to share with others. I'll go over some of the implementation details in this post.

The application discussed here can be found at the FacebookToYoutube github repository. Special thanks to the Google API docs and others who helped provide skeleton code in some areas.

We have a group called Club Coat Check on Facebook where we post links to music we've found around the web. Most of the time that takes the form of a youtube video. I wanted the ability to scan the group periodically, find youtube links and add them automatically to a playlist. This was also a great opportunity to delve further into the world of APIs and frameworks. I'll outline the project and provide links to useful downloads to get things working.

The first part of the equation is logging into Facebook automatically and navigating to a webpage that need an authenticated user to access. I chose to use python and mechanize. Mechanize functions as a gui-less web browser and has no javascript support. The javascript part is troublesome on AJAX-heavy Facebook, but realized using the mobile site would work. Same content, without AJAX (to reduce load on phones). Below is a sample of the code. The general flow is thus: setup browser, find form on page, fill-out form, and submit. Then navigate to your chosen Facebook page and download its HTML source.

Python
def getFacebookPage():
        #Opens secure connection to facebook and downloads page
 
        #Setup mechanize browser
        browser = mechanize.Browser()
        browser.set_handle_robots(False)
        browser.addheaders = 
        [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US)')]
 
        #Connect to login URL
        browser.open(LOGIN_URL)
        print 'Login page: connected!'
 
        #Select the form, nr=0 means to use the first form you find
        #Else enter the name='form ID name'
        browser.select_form(nr=0)# name="login_form"
 
        #Fill the username and password in the form
        #Might have to check your particular webpage for actual name of input id
        browser['email'] = USERNAME
        browser['pass'] = PASSWORD
           
        #Clicks the submit button for a particular form
        result = browser.submit()
        print 'Login page: accessing...'
 
        #Open URL to download
        result = browser.open(CRAWL_URL)
        print 'Downloaded source:',CRAWL_URL
 
        #Logout of our session to prevent problems...
        for link in browser.links(url_regex="/logout"):
                print 'Following:',link.url
                req = browser.click_link(url=link.url)
                req = browser.follow_link(url=link.url)
                if 'Facebook' in req.get_data():
                        print 'Logged out!'
                        sepline()
 
        return result.get_data()

Then crawl through the webpage using HTMLParser. See the git repository for details, but the essential part is the regexp to pull out the youtube video ID from the hrefs.

Python
class FacebookHTMLParser(HTMLParser):
        #Subclass that overrides the HTMLParser handler methods
        def handle_starttag(self, tag, attrs):
                for item in attrs:
                        if '%2Fwatch%3Fv%3' in item[1]:
                                #Regular expression for youtube URLs in raw html
                                go = re.search('v%3D[0-9A-Za-z_-]*(&|%26)',item[1])
                                youtubeURL = go.group()
                                
                                #Remove characters, get only base video ID
                                youtubeURL = youtubeURL.replace('v%3D','')
                                youtubeURL = youtubeURL.replace('&','')
                                youtubeURL = youtubeURL.replace('%26','')

You can then download the webpage for each video and grab the correct title.

Python
browser = mechanize.Browser()
browser.open('http://www.youtube.com/watch?v='+youtubeURL)
youtubeTitle = browser.title()

The last part is sending this list to a PHP script. Thanks TheBestJohn for a good starting point with this.

Python
def sendDataToURL(inputData):
        #Import urllib libraries
        import urllib2, urllib
 
        #List of POST variables to pass, structure: [(varname,value)]
        dataToSend=[('youtubeURL',inputData[0]), ('youtubeTitles',inputData[1]),('youtubeID',inputData[2])]  
 
        #Convert data
        dataToSend=urllib.urlencode(dataToSend)
 
        #Send the data
        req=urllib2.Request(path, dataToSend)

To connect to Youtube and add videos to a user's playlist, I borrowed heavily from the Google PHP developer webpage and modified from there. First you need to get a developer key and download Zend.

From the python code we have passed several POST variables to PHP. Because the python list is passed as a string, we need to parse it.

PHP
public static function pythonStringToPHPArray($input){
        $inputClean = str_replace(' u', '', $input);
        $inputClean = str_replace('[u', '', $inputClean);
        $inputClean = str_replace(']', '', $inputClean);
        // $facebookDataClean = str_replace('[', '', $facebookDataClean);
        $inputArray = preg_split('\ ','', $inputClean);
        return $inputArray;
}
        

Now we can finally process the data. See youtube.php for the actual functions, but the below provides a nice overview. We get the python lists, convert to arrays, setup our authenticated Youtube connection, make a playlist if it doesn't already exist, check for video duplicates between playlist and recently posted videos, and finally insert each video into the playlist.

PHP
#Retrieve data from facebook_parser.py, no filtering at the moment
$youtubeURL = $_POST\['youtubeURL'\];
 
#Convert python strings into PHP arrays
$youtubeURLArray = model::pythonStringToPHPArray($youtubeURL);
 
#Remove duplicates from the arrays
$youtubeURLArray = array_unique($youtubeURLArray);
 
#Setup youtube API connection
$yt = youtube::setYoutubeConnection();
 
#Name of playlist to modify
$playlistNameTitle = 'Club Coat Check 1';
$playlistNameDescription = 'A dance sensation.';
 
#Create new playlist, does nothing if playlist already exists
youtube::createNewYoutubePlaylist($yt, $playlistNameTitle, $playlistNameDescription);
$playlistToModify = youtube::getPlaylistObj($yt,$playlistNameTitle);
 
#Get List of video IDs
$playlistVideoIDs = youtube::getVideosInPlaylist($yt,$playlistToModify);
print_r($playlistVideoIDs);
 
#Cycle through each ID and add to playlist
foreach ($youtubeIDArray as $ID) {
        #Remove quotations from string to make proper URI
        $ID = str_replace('\'','',$ID);
        echo $ID.'
        ';
 
        #Skip video if in array
        if(in_array($ID, $playlistVideoIDs)){
                echo 'Skipped: '.$ID.'<br>
                ';
                continue;
        }
 
        #Add each video to the specified playlist
        youtube::addVideoToPlaylist($yt,$playlistToModify,$ID);
}       

That is the basic outline of how to download HTML from Facebook after authentication and then adding videos to a user's Youtube playlist. There are several features that I need to add, namely full OOP implementation of the python and PHP functions along with the ability to store the Youtube IDs, etc. in a MySQL database. The youtube class I made in PHP should be a useful wrapper for others looking for an start to interfacing with the youtube api.

Note: Uncomment extension=php_openssl.dll in your php.ini file to remove the following error: Unable to find the socket transport "ssl" – did you forget to enable it when you configured PHP?

-biafra

bahanonu [at] alum.mit.edu

bash scripting: youtube downloading macro
17 may 2013 | programming

<p> Once again, the command line is the root of all that is good in the world. This time, it has helped improve on a long[...]-standing issue for me: what is the easiest way to get a copy of all the <a href='http://www.youtube.com/playlist?list=PLmku2swCXQpqWAZSscjV4h9bcLennVcif' target='_blank'>luscious melodies</a> i hear on youtube? Courtesy of <a href='http://rg3.github.io/youtube-dl/' target='_blank'>youtube-dl</a>, a nifty little command line utility, this problem has been solved. However, every once in awhile it throws errors and i wanted a wrapper bash script to take care of this and some other processing. I'll briefly go over the code. </p>

humanity's dirge
14 june 2021 | filugori

A short dirge that introduces one of the themes of Filugori, my planned book on man's conquest of space.[...]

book review — barbarians at the gate: the fall of rjr nabisco
07 january 2017 | books

Barbarians at the Gate is an excellent journalistic endeavor and a non-stop thrill from start to finish; highly recommended.[...]

dreams
02 july 2012 | essay

I have been recording down many recent and very old dreams in a Word document, which has swelled to over 7,000+ words and contains near one[...] hundred stories. The plan is to clean-up and compile all these stories into one novella that has several characters exploring the dream-worlds with some overarching story to tie it all together. Should be a fun experiment.

Biafra Ahanonu, PhD

home

about

contact [at] bahanonu.com

stanford

linkden

github

goodreads

medium

twitter

publications

talks

ciatah

articles

graduate school resources

Stanford Biosciences Student Association

list of post tags

all articles - with pictures

all articles - text form

favorite posts

favorite short stories

short stories

spanish short stories

singapore

teaching

reading

current reading + ratings

full reviews

designs

neuroscience

blog

resources

technologies

abiogenesis

search

feeds

main website

brain initiative notes

next»

«previous

random!