facebook crawler and youtube api: part 1


Created a small app, implemented in python and PHP, that crawls authenticated Facebook page for Youtube videos and adds them to a specified user's Youtube playlist. Useful for groups that post a lot of youtube videos and want a centralized playlist to share with others. I'll go over some of the implementation details in this post.

The application discussed here can be found at the FacebookToYoutube github repository. Special thanks to the Google API docs and others who helped provide skeleton code in some areas.

We have a group called Club Coat Check on Facebook where we post links to music we've found around the web. Most of the time that takes the form of a youtube video. I wanted the ability to scan the group periodically, find youtube links and add them automatically to a playlist. This was also a great opportunity to delve further into the world of APIs and frameworks. I'll outline the project and provide links to useful downloads to get things working.

The first part of the equation is logging into Facebook automatically and navigating to a webpage that need an authenticated user to access. I chose to use python and mechanize. Mechanize functions as a gui-less web browser and has no javascript support. The javascript part is troublesome on AJAX-heavy Facebook, but realized using the mobile site would work. Same content, without AJAX (to reduce load on phones). Below is a sample of the code. The general flow is thus: setup browser, find form on page, fill-out form, and submit. Then navigate to your chosen Facebook page and download its HTML source.

  1. def getFacebookPage():
  2.         #Opens secure connection to facebook and downloads page
  4.         #Setup mechanize browser
  5.         browser = mechanize.Browser()
  6.         browser.set_handle_robots(False)
  7.         browser.addheaders =
  8.         [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US)')]
  10.         #Connect to login URL
  11.         browser.open(LOGIN_URL)
  12.         print 'Login page: connected!'
  14.         #Select the form, nr=0 means to use the first form you find
  15.         #Else enter the name='form ID name'
  16.         browser.select_form(nr=0)# name="login_form"
  18.         #Fill the username and password in the form
  19.         #Might have to check your particular webpage for actual name of input id
  20.         browser['email'] = USERNAME
  21.         browser['pass'] = PASSWORD
  23.         #Clicks the submit button for a particular form
  24.         result = browser.submit()
  25.         print 'Login page: accessing...'
  27.         #Open URL to download
  28.         result = browser.open(CRAWL_URL)
  29.         print 'Downloaded source:',CRAWL_URL
  31.         #Logout of our session to prevent problems...
  32.         for link in browser.links(url_regex="/logout"):
  33.                 print 'Following:',link.url
  34.                 req = browser.click_link(url=link.url)
  35.                 req = browser.follow_link(url=link.url)
  36.                 if 'Facebook' in req.get_data():
  37.                         print 'Logged out!'
  38.                         sepline()
  40.         return result.get_data()

Then crawl through the webpage using HTMLParser. See the git repository for details, but the essential part is the regexp to pull out the youtube video ID from the hrefs.

  1. class FacebookHTMLParser(HTMLParser):
  2.         #Subclass that overrides the HTMLParser handler methods
  3.         def handle_starttag(self, tag, attrs):
  4.                 for item in attrs:
  5.                         if '%2Fwatch%3Fv%3' in item[1]:
  6.                                 #Regular expression for youtube URLs in raw html
  7.                                 go = re.search('v%3D[0-9A-Za-z_-]*(&|%26)',item[1])
  8.                                 youtubeURL = go.group()
  10.                                 #Remove characters, get only base video ID
  11.                                 youtubeURL = youtubeURL.replace('v%3D','')
  12.                                 youtubeURL = youtubeURL.replace('&','')
  13.                                 youtubeURL = youtubeURL.replace('%26','')

You can then download the webpage for each video and grab the correct title.

  1. browser = mechanize.Browser()
  2. browser.open('http://www.youtube.com/watch?v='+youtubeURL)
  3. youtubeTitle = browser.title()

The last part is sending this list to a PHP script. Thanks TheBestJohn for a good starting point with this.

  1. def sendDataToURL(inputData):
  2.         #Import urllib libraries
  3.         import urllib2, urllib
  5.         #List of POST variables to pass, structure: [(varname,value)]
  6.         dataToSend=[('youtubeURL',inputData[0]), ('youtubeTitles',inputData[1]),('youtubeID',inputData[2])]  
  8.         #Convert data
  9.         dataToSend=urllib.urlencode(dataToSend)
  11.         #Send the data
  12.         req=urllib2.Request(path, dataToSend)

To connect to Youtube and add videos to a user's playlist, I borrowed heavily from the Google PHP developer webpage and modified from there. First you need to get a developer key and download Zend.

From the python code we have passed several POST variables to PHP. Because the python list is passed as a string, we need to parse it.

  1. public static function pythonStringToPHPArray($input){
  2.         $inputClean = str_replace(' u', '', $input);
  3.         $inputClean = str_replace('[u', '', $inputClean);
  4.         $inputClean = str_replace(']', '', $inputClean);
  5.         // $facebookDataClean = str_replace('[', '', $facebookDataClean);
  6.         $inputArray = preg_split('\ ','', $inputClean);
  7.         return $inputArray;
  8. }

Now we can finally process the data. See youtube.php for the actual functions, but the below provides a nice overview. We get the python lists, convert to arrays, setup our authenticated Youtube connection, make a playlist if it doesn't already exist, check for video duplicates between playlist and recently posted videos, and finally insert each video into the playlist.

  1. #Retrieve data from facebook_parser.py, no filtering at the moment
  2. $youtubeURL = $_POST\['youtubeURL'\];
  4. #Convert python strings into PHP arrays
  5. $youtubeURLArray = model::pythonStringToPHPArray($youtubeURL);
  7. #Remove duplicates from the arrays
  8. $youtubeURLArray = array_unique($youtubeURLArray);
  10. #Setup youtube API connection
  11. $yt = youtube::setYoutubeConnection();
  13. #Name of playlist to modify
  14. $playlistNameTitle = 'Club Coat Check 1';
  15. $playlistNameDescription = 'A dance sensation.';
  17. #Create new playlist, does nothing if playlist already exists
  18. youtube::createNewYoutubePlaylist($yt, $playlistNameTitle, $playlistNameDescription);
  19. $playlistToModify = youtube::getPlaylistObj($yt,$playlistNameTitle);
  21. #Get List of video IDs
  22. $playlistVideoIDs = youtube::getVideosInPlaylist($yt,$playlistToModify);
  23. print_r($playlistVideoIDs);
  25. #Cycle through each ID and add to playlist
  26. foreach ($youtubeIDArray as $ID) {
  27.         #Remove quotations from string to make proper URI
  28.         $ID = str_replace('\'','',$ID);
  29.         echo $ID.'
  30.         ';
  32.         #Skip video if in array
  33.         if(in_array($ID, $playlistVideoIDs)){
  34.                 echo 'Skipped: '.$ID.'<br>
  35.                 ';
  36.                 continue;
  37.         }
  39.         #Add each video to the specified playlist
  40.         youtube::addVideoToPlaylist($yt,$playlistToModify,$ID);
  41. }      

That is the basic outline of how to download HTML from Facebook after authentication and then adding videos to a user's Youtube playlist. There are several features that I need to add, namely full OOP implementation of the python and PHP functions along with the ability to store the Youtube IDs, etc. in a MySQL database. The youtube class I made in PHP should be a useful wrapper for others looking for an start to interfacing with the youtube api.

Note: Uncomment extension=php_openssl.dll in your php.ini file to remove the following error: Unable to find the socket transport "ssl" – did you forget to enable it when you configured PHP?

bahanonu [at] alum.mit.edu

more articles to enjoy:

11 november 2009 | short story

“Haha, my brother, how have you been?” “Great. Times are good, money is flowing, bitches crawl all over me—wh[...]at more could a man desire?”
I sat in the corner through blurred vision, watching them hug and chat. Grrrrrrr. There were many things I hated about this house, least of which was the rats. I went to work every day, but that can wait. I got up and was immediately dragged back down. The chains, how could I forget?

Inspired by a series of articles on sexual slavery in Eastern Europe, this is a surreal, sad tale of a girl and her abusers. This took my experimentation with more suspenseful, fear-driven stories to another level. Unfortunately, not all stories have a fairy-tale ending...

humanism in european art and society
06 january 2012 | essay

One of the main themes of the renaissance was the rebirth in the interest of classical themes or greco-roman culture. Many artist, eit[...]her through paintings, sculptures or architecture, portrayed this general movement by using Greek/roman themes, such as pillars, and integrating it into their works. But it wasn?t just an interest in greek/roman architecture or appearance but also their cultures.

An essay looking at various European paintings and how they were used to capture the essence of European culture, both old and new.

state of sbsa: a review of 2017 and thoughts on future directions
04 june 2017 | sbsa

I spent the past year leading the Stanford Biosciences Student Association (SBSA) as President. This post consist of the letter to the comm[...]unity I sent out at the end of my term giving some highlights of the past year, those who have helped out, and thoughts on future directions.

©2006-2017 | biafra ahanonu | updated 12 december 2017
biafra ahanonu