facebook crawler and youtube api: part 1

in programming on 07 October 2012

Summary

Created a small app, implemented in python and PHP, that crawls authenticated Facebook page for Youtube videos and adds them to a specified user's Youtube playlist. Useful for groups that post a lot of youtube videos and want a centralized playlist to share with others. I'll go over some of the implementation details in this post.

The application discussed here can be found at the FacebookToYoutube github repository. Special thanks to the Google API docs and others who helped provide skeleton code in some areas.

We have a group called Club Coat Check on Facebook where we post links to music we've found around the web. Most of the time that takes the form of a youtube video. I wanted the ability to scan the group periodically, find youtube links and add them automatically to a playlist. This was also a great opportunity to delve further into the world of APIs and frameworks. I'll outline the project and provide links to useful downloads to get things working.

The first part of the equation is logging into Facebook automatically and navigating to a webpage that need an authenticated user to access. I chose to use python and mechanize. Mechanize functions as a gui-less web browser and has no javascript support. The javascript part is troublesome on AJAX-heavy Facebook, but realized using the mobile site would work. Same content, without AJAX (to reduce load on phones). Below is a sample of the code. The general flow is thus: setup browser, find form on page, fill-out form, and submit. Then navigate to your chosen Facebook page and download its HTML source.

Python
def getFacebookPage():
        #Opens secure connection to facebook and downloads page
 
        #Setup mechanize browser
        browser = mechanize.Browser()
        browser.set_handle_robots(False)
        browser.addheaders = 
        [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US)')]
 
        #Connect to login URL
        browser.open(LOGIN_URL)
        print 'Login page: connected!'
 
        #Select the form, nr=0 means to use the first form you find
        #Else enter the name='form ID name'
        browser.select_form(nr=0)# name="login_form"
 
        #Fill the username and password in the form
        #Might have to check your particular webpage for actual name of input id
        browser['email'] = USERNAME
        browser['pass'] = PASSWORD
           
        #Clicks the submit button for a particular form
        result = browser.submit()
        print 'Login page: accessing...'
 
        #Open URL to download
        result = browser.open(CRAWL_URL)
        print 'Downloaded source:',CRAWL_URL
 
        #Logout of our session to prevent problems...
        for link in browser.links(url_regex="/logout"):
                print 'Following:',link.url
                req = browser.click_link(url=link.url)
                req = browser.follow_link(url=link.url)
                if 'Facebook' in req.get_data():
                        print 'Logged out!'
                        sepline()
 
        return result.get_data()

Then crawl through the webpage using HTMLParser. See the git repository for details, but the essential part is the regexp to pull out the youtube video ID from the hrefs.

Python
class FacebookHTMLParser(HTMLParser):
        #Subclass that overrides the HTMLParser handler methods
        def handle_starttag(self, tag, attrs):
                for item in attrs:
                        if '%2Fwatch%3Fv%3' in item[1]:
                                #Regular expression for youtube URLs in raw html
                                go = re.search('v%3D[0-9A-Za-z_-]*(&|%26)',item[1])
                                youtubeURL = go.group()
                                
                                #Remove characters, get only base video ID
                                youtubeURL = youtubeURL.replace('v%3D','')
                                youtubeURL = youtubeURL.replace('&','')
                                youtubeURL = youtubeURL.replace('%26','')

You can then download the webpage for each video and grab the correct title.

Python
browser = mechanize.Browser()
browser.open('http://www.youtube.com/watch?v='+youtubeURL)
youtubeTitle = browser.title()

The last part is sending this list to a PHP script. Thanks TheBestJohn for a good starting point with this.

Python
def sendDataToURL(inputData):
        #Import urllib libraries
        import urllib2, urllib
 
        #List of POST variables to pass, structure: [(varname,value)]
        dataToSend=[('youtubeURL',inputData[0]), ('youtubeTitles',inputData[1]),('youtubeID',inputData[2])]  
 
        #Convert data
        dataToSend=urllib.urlencode(dataToSend)
 
        #Send the data
        req=urllib2.Request(path, dataToSend)

To connect to Youtube and add videos to a user's playlist, I borrowed heavily from the Google PHP developer webpage and modified from there. First you need to get a developer key and download Zend.

From the python code we have passed several POST variables to PHP. Because the python list is passed as a string, we need to parse it.

PHP
public static function pythonStringToPHPArray($input){
        $inputClean = str_replace(' u', '', $input);
        $inputClean = str_replace('[u', '', $inputClean);
        $inputClean = str_replace(']', '', $inputClean);
        // $facebookDataClean = str_replace('[', '', $facebookDataClean);
        $inputArray = preg_split('\ ','', $inputClean);
        return $inputArray;
}
        

Now we can finally process the data. See youtube.php for the actual functions, but the below provides a nice overview. We get the python lists, convert to arrays, setup our authenticated Youtube connection, make a playlist if it doesn't already exist, check for video duplicates between playlist and recently posted videos, and finally insert each video into the playlist.

PHP
#Retrieve data from facebook_parser.py, no filtering at the moment
$youtubeURL = $_POST\['youtubeURL'\];
 
#Convert python strings into PHP arrays
$youtubeURLArray = model::pythonStringToPHPArray($youtubeURL);
 
#Remove duplicates from the arrays
$youtubeURLArray = array_unique($youtubeURLArray);
 
#Setup youtube API connection
$yt = youtube::setYoutubeConnection();
 
#Name of playlist to modify
$playlistNameTitle = 'Club Coat Check 1';
$playlistNameDescription = 'A dance sensation.';
 
#Create new playlist, does nothing if playlist already exists
youtube::createNewYoutubePlaylist($yt, $playlistNameTitle, $playlistNameDescription);
$playlistToModify = youtube::getPlaylistObj($yt,$playlistNameTitle);
 
#Get List of video IDs
$playlistVideoIDs = youtube::getVideosInPlaylist($yt,$playlistToModify);
print_r($playlistVideoIDs);
 
#Cycle through each ID and add to playlist
foreach ($youtubeIDArray as $ID) {
        #Remove quotations from string to make proper URI
        $ID = str_replace('\'','',$ID);
        echo $ID.'
        ';
 
        #Skip video if in array
        if(in_array($ID, $playlistVideoIDs)){
                echo 'Skipped: '.$ID.'<br>
                ';
                continue;
        }
 
        #Add each video to the specified playlist
        youtube::addVideoToPlaylist($yt,$playlistToModify,$ID);
}       

That is the basic outline of how to download HTML from Facebook after authentication and then adding videos to a user's Youtube playlist. There are several features that I need to add, namely full OOP implementation of the python and PHP functions along with the ability to store the Youtube IDs, etc. in a MySQL database. The youtube class I made in PHP should be a useful wrapper for others looking for an start to interfacing with the youtube api.

Note: Uncomment extension=php_openssl.dll in your php.ini file to remove the following error: Unable to find the socket transport "ssl" – did you forget to enable it when you configured PHP?

-biafra

bahanonu [at] alum.mit.edu

bash scripting: randomly rename files
13 june 2013 | programming

Small script to enable quick randomization of files in a directory and conversion back to original names later. Original inspiration was a [...]way to blind data analysis, e.g. if studying images from an experiment and don't want to be biased by the conditions applied.

2016 presidential election campaign posters
28 august 2016 | politics

A series of posters about the 2016 presidential election. They will focus on the candidates themselves along with how the public reacts to,[...] and is manipulated by, the election as a whole.

bio42: diagrams, part 1
25 january 2013 | teaching

Had a couple minutes to spare before leaving lab, so decided to throw together some diagrams to help explain a couple biological pathways s[...]tudents of bio42, a bio class at Stanford I'm TAing. Hoping to make a set for each system we study. Started with vesicle budding and fusion along with muscle contraction in smooth and skeletal muscles.

¿qué es la calle?
24 may 2013 | short story | spanish

Había terminado y salé para mi cocina. Tenía hambre pero este día no había comida dentro de mi despensa. Me fui y caminé hacienda[...] la Tport—una máquina que puede transportar una persona a otro lugar sin energía y tiempo. Cuando entré la máquina, toqué algunos botónes y esperé. Pero nada ocurrió y lo hice las mismas acciones otra vez—y nada ocurrió.

How would the disappearance of streets affect the social fabric? This short story briefly (in castellano!) explores a world in which instantaneous, free transport is possible. Meant mainly to practice my spanish, i plan to follow-up with a more detail story in the future.

Biafra Ahanonu, PhD

home

about

contact [at] bahanonu.com

stanford

linkden

github

goodreads

medium

twitter

publications

talks

ciatah

articles

graduate school resources

Stanford Biosciences Student Association

list of post tags

all articles - with pictures

all articles - text form

favorite posts

favorite short stories

short stories

spanish short stories

singapore

teaching

reading

current reading + ratings

full reviews

designs

neuroscience

blog

resources

technologies

abiogenesis

search

feeds

main website

brain initiative notes

next»

«previous

random!