facebook crawler and youtube api: part 1


Created a small app, implemented in python and PHP, that crawls authenticated Facebook page for Youtube videos and adds them to a specified user's Youtube playlist. Useful for groups that post a lot of youtube videos and want a centralized playlist to share with others. I'll go over some of the implementation details in this post.

The application discussed here can be found at the FacebookToYoutube github repository. Special thanks to the Google API docs and others who helped provide skeleton code in some areas.

We have a group called Club Coat Check on Facebook where we post links to music we've found around the web. Most of the time that takes the form of a youtube video. I wanted the ability to scan the group periodically, find youtube links and add them automatically to a playlist. This was also a great opportunity to delve further into the world of APIs and frameworks. I'll outline the project and provide links to useful downloads to get things working.

The first part of the equation is logging into Facebook automatically and navigating to a webpage that need an authenticated user to access. I chose to use python and mechanize. Mechanize functions as a gui-less web browser and has no javascript support. The javascript part is troublesome on AJAX-heavy Facebook, but realized using the mobile site would work. Same content, without AJAX (to reduce load on phones). Below is a sample of the code. The general flow is thus: setup browser, find form on page, fill-out form, and submit. Then navigate to your chosen Facebook page and download its HTML source.

  1. def getFacebookPage():
  2.         #Opens secure connection to facebook and downloads page
  4.         #Setup mechanize browser
  5.         browser = mechanize.Browser()
  6.         browser.set_handle_robots(False)
  7.         browser.addheaders =
  8.         [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US)')]
  10.         #Connect to login URL
  11.         browser.open(LOGIN_URL)
  12.         print 'Login page: connected!'
  14.         #Select the form, nr=0 means to use the first form you find
  15.         #Else enter the name='form ID name'
  16.         browser.select_form(nr=0)# name="login_form"
  18.         #Fill the username and password in the form
  19.         #Might have to check your particular webpage for actual name of input id
  20.         browser['email'] = USERNAME
  21.         browser['pass'] = PASSWORD
  23.         #Clicks the submit button for a particular form
  24.         result = browser.submit()
  25.         print 'Login page: accessing...'
  27.         #Open URL to download
  28.         result = browser.open(CRAWL_URL)
  29.         print 'Downloaded source:',CRAWL_URL
  31.         #Logout of our session to prevent problems...
  32.         for link in browser.links(url_regex="/logout"):
  33.                 print 'Following:',link.url
  34.                 req = browser.click_link(url=link.url)
  35.                 req = browser.follow_link(url=link.url)
  36.                 if 'Facebook' in req.get_data():
  37.                         print 'Logged out!'
  38.                         sepline()
  40.         return result.get_data()

Then crawl through the webpage using HTMLParser. See the git repository for details, but the essential part is the regexp to pull out the youtube video ID from the hrefs.

  1. class FacebookHTMLParser(HTMLParser):
  2.         #Subclass that overrides the HTMLParser handler methods
  3.         def handle_starttag(self, tag, attrs):
  4.                 for item in attrs:
  5.                         if '%2Fwatch%3Fv%3' in item[1]:
  6.                                 #Regular expression for youtube URLs in raw html
  7.                                 go = re.search('v%3D[0-9A-Za-z_-]*(&|%26)',item[1])
  8.                                 youtubeURL = go.group()
  10.                                 #Remove characters, get only base video ID
  11.                                 youtubeURL = youtubeURL.replace('v%3D','')
  12.                                 youtubeURL = youtubeURL.replace('&','')
  13.                                 youtubeURL = youtubeURL.replace('%26','')

You can then download the webpage for each video and grab the correct title.

  1. browser = mechanize.Browser()
  2. browser.open('http://www.youtube.com/watch?v='+youtubeURL)
  3. youtubeTitle = browser.title()

The last part is sending this list to a PHP script. Thanks TheBestJohn for a good starting point with this.

  1. def sendDataToURL(inputData):
  2.         #Import urllib libraries
  3.         import urllib2, urllib
  5.         #List of POST variables to pass, structure: [(varname,value)]
  6.         dataToSend=[('youtubeURL',inputData[0]), ('youtubeTitles',inputData[1]),('youtubeID',inputData[2])]  
  8.         #Convert data
  9.         dataToSend=urllib.urlencode(dataToSend)
  11.         #Send the data
  12.         req=urllib2.Request(path, dataToSend)

To connect to Youtube and add videos to a user's playlist, I borrowed heavily from the Google PHP developer webpage and modified from there. First you need to get a developer key and download Zend.

From the python code we have passed several POST variables to PHP. Because the python list is passed as a string, we need to parse it.

  1. public static function pythonStringToPHPArray($input){
  2.         $inputClean = str_replace(' u', '', $input);
  3.         $inputClean = str_replace('[u', '', $inputClean);
  4.         $inputClean = str_replace(']', '', $inputClean);
  5.         // $facebookDataClean = str_replace('[', '', $facebookDataClean);
  6.         $inputArray = preg_split('\ ','', $inputClean);
  7.         return $inputArray;
  8. }

Now we can finally process the data. See youtube.php for the actual functions, but the below provides a nice overview. We get the python lists, convert to arrays, setup our authenticated Youtube connection, make a playlist if it doesn't already exist, check for video duplicates between playlist and recently posted videos, and finally insert each video into the playlist.

  1. #Retrieve data from facebook_parser.py, no filtering at the moment
  2. $youtubeURL = $_POST\['youtubeURL'\];
  4. #Convert python strings into PHP arrays
  5. $youtubeURLArray = model::pythonStringToPHPArray($youtubeURL);
  7. #Remove duplicates from the arrays
  8. $youtubeURLArray = array_unique($youtubeURLArray);
  10. #Setup youtube API connection
  11. $yt = youtube::setYoutubeConnection();
  13. #Name of playlist to modify
  14. $playlistNameTitle = 'Club Coat Check 1';
  15. $playlistNameDescription = 'A dance sensation.';
  17. #Create new playlist, does nothing if playlist already exists
  18. youtube::createNewYoutubePlaylist($yt, $playlistNameTitle, $playlistNameDescription);
  19. $playlistToModify = youtube::getPlaylistObj($yt,$playlistNameTitle);
  21. #Get List of video IDs
  22. $playlistVideoIDs = youtube::getVideosInPlaylist($yt,$playlistToModify);
  23. print_r($playlistVideoIDs);
  25. #Cycle through each ID and add to playlist
  26. foreach ($youtubeIDArray as $ID) {
  27.         #Remove quotations from string to make proper URI
  28.         $ID = str_replace('\'','',$ID);
  29.         echo $ID.'
  30.         ';
  32.         #Skip video if in array
  33.         if(in_array($ID, $playlistVideoIDs)){
  34.                 echo 'Skipped: '.$ID.'<br>
  35.                 ';
  36.                 continue;
  37.         }
  39.         #Add each video to the specified playlist
  40.         youtube::addVideoToPlaylist($yt,$playlistToModify,$ID);
  41. }      

That is the basic outline of how to download HTML from Facebook after authentication and then adding videos to a user's Youtube playlist. There are several features that I need to add, namely full OOP implementation of the python and PHP functions along with the ability to store the Youtube IDs, etc. in a MySQL database. The youtube class I made in PHP should be a useful wrapper for others looking for an start to interfacing with the youtube api.

Note: Uncomment extension=php_openssl.dll in your php.ini file to remove the following error: Unable to find the socket transport "ssl" – did you forget to enable it when you configured PHP?

bahanonu [at] alum.mit.edu

©2006-2018 | biafra ahanonu | updated 31 january 2018
biafra ahanonu