facebook crawler and youtube api: part 1

Summary

Created a small app, implemented in python and PHP, that crawls authenticated Facebook page for Youtube videos and adds them to a specified user's Youtube playlist. Useful for groups that post a lot of youtube videos and want a centralized playlist to share with others. I'll go over some of the implementation details in this post.

The application discussed here can be found at the FacebookToYoutube github repository. Special thanks to the Google API docs and others who helped provide skeleton code in some areas.

We have a group called Club Coat Check on Facebook where we post links to music we've found around the web. Most of the time that takes the form of a youtube video. I wanted the ability to scan the group periodically, find youtube links and add them automatically to a playlist. This was also a great opportunity to delve further into the world of APIs and frameworks. I'll outline the project and provide links to useful downloads to get things working.

The first part of the equation is logging into Facebook automatically and navigating to a webpage that need an authenticated user to access. I chose to use python and mechanize. Mechanize functions as a gui-less web browser and has no javascript support. The javascript part is troublesome on AJAX-heavy Facebook, but realized using the mobile site would work. Same content, without AJAX (to reduce load on phones). Below is a sample of the code. The general flow is thus: setup browser, find form on page, fill-out form, and submit. Then navigate to your chosen Facebook page and download its HTML source.

Python
  1. def getFacebookPage():
  2.         #Opens secure connection to facebook and downloads page
  3.  
  4.         #Setup mechanize browser
  5.         browser = mechanize.Browser()
  6.         browser.set_handle_robots(False)
  7.         browser.addheaders =
  8.         [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US)')]
  9.  
  10.         #Connect to login URL
  11.         browser.open(LOGIN_URL)
  12.         print 'Login page: connected!'
  13.  
  14.         #Select the form, nr=0 means to use the first form you find
  15.         #Else enter the name='form ID name'
  16.         browser.select_form(nr=0)# name="login_form"
  17.  
  18.         #Fill the username and password in the form
  19.         #Might have to check your particular webpage for actual name of input id
  20.         browser['email'] = USERNAME
  21.         browser['pass'] = PASSWORD
  22.            
  23.         #Clicks the submit button for a particular form
  24.         result = browser.submit()
  25.         print 'Login page: accessing...'
  26.  
  27.         #Open URL to download
  28.         result = browser.open(CRAWL_URL)
  29.         print 'Downloaded source:',CRAWL_URL
  30.  
  31.         #Logout of our session to prevent problems...
  32.         for link in browser.links(url_regex="/logout"):
  33.                 print 'Following:',link.url
  34.                 req = browser.click_link(url=link.url)
  35.                 req = browser.follow_link(url=link.url)
  36.                 if 'Facebook' in req.get_data():
  37.                         print 'Logged out!'
  38.                         sepline()
  39.  
  40.         return result.get_data()

Then crawl through the webpage using HTMLParser. See the git repository for details, but the essential part is the regexp to pull out the youtube video ID from the hrefs.

Python
  1. class FacebookHTMLParser(HTMLParser):
  2.         #Subclass that overrides the HTMLParser handler methods
  3.         def handle_starttag(self, tag, attrs):
  4.                 for item in attrs:
  5.                         if '%2Fwatch%3Fv%3' in item[1]:
  6.                                 #Regular expression for youtube URLs in raw html
  7.                                 go = re.search('v%3D[0-9A-Za-z_-]*(&|%26)',item[1])
  8.                                 youtubeURL = go.group()
  9.                                
  10.                                 #Remove characters, get only base video ID
  11.                                 youtubeURL = youtubeURL.replace('v%3D','')
  12.                                 youtubeURL = youtubeURL.replace('&','')
  13.                                 youtubeURL = youtubeURL.replace('%26','')

You can then download the webpage for each video and grab the correct title.

Python
  1. browser = mechanize.Browser()
  2. browser.open('http://www.youtube.com/watch?v='+youtubeURL)
  3. youtubeTitle = browser.title()

The last part is sending this list to a PHP script. Thanks TheBestJohn for a good starting point with this.

Python
  1. def sendDataToURL(inputData):
  2.         #Import urllib libraries
  3.         import urllib2, urllib
  4.  
  5.         #List of POST variables to pass, structure: [(varname,value)]
  6.         dataToSend=[('youtubeURL',inputData[0]), ('youtubeTitles',inputData[1]),('youtubeID',inputData[2])]  
  7.  
  8.         #Convert data
  9.         dataToSend=urllib.urlencode(dataToSend)
  10.  
  11.         #Send the data
  12.         req=urllib2.Request(path, dataToSend)

To connect to Youtube and add videos to a user's playlist, I borrowed heavily from the Google PHP developer webpage and modified from there. First you need to get a developer key and download Zend.

From the python code we have passed several POST variables to PHP. Because the python list is passed as a string, we need to parse it.

PHP
  1. public static function pythonStringToPHPArray($input){
  2.         $inputClean = str_replace(' u', '', $input);
  3.         $inputClean = str_replace('[u', '', $inputClean);
  4.         $inputClean = str_replace(']', '', $inputClean);
  5.         // $facebookDataClean = str_replace('[', '', $facebookDataClean);
  6.         $inputArray = preg_split('\ ','', $inputClean);
  7.         return $inputArray;
  8. }
  9.        

Now we can finally process the data. See youtube.php for the actual functions, but the below provides a nice overview. We get the python lists, convert to arrays, setup our authenticated Youtube connection, make a playlist if it doesn't already exist, check for video duplicates between playlist and recently posted videos, and finally insert each video into the playlist.

PHP
  1. #Retrieve data from facebook_parser.py, no filtering at the moment
  2. $youtubeURL = $_POST\['youtubeURL'\];
  3.  
  4. #Convert python strings into PHP arrays
  5. $youtubeURLArray = model::pythonStringToPHPArray($youtubeURL);
  6.  
  7. #Remove duplicates from the arrays
  8. $youtubeURLArray = array_unique($youtubeURLArray);
  9.  
  10. #Setup youtube API connection
  11. $yt = youtube::setYoutubeConnection();
  12.  
  13. #Name of playlist to modify
  14. $playlistNameTitle = 'Club Coat Check 1';
  15. $playlistNameDescription = 'A dance sensation.';
  16.  
  17. #Create new playlist, does nothing if playlist already exists
  18. youtube::createNewYoutubePlaylist($yt, $playlistNameTitle, $playlistNameDescription);
  19. $playlistToModify = youtube::getPlaylistObj($yt,$playlistNameTitle);
  20.  
  21. #Get List of video IDs
  22. $playlistVideoIDs = youtube::getVideosInPlaylist($yt,$playlistToModify);
  23. print_r($playlistVideoIDs);
  24.  
  25. #Cycle through each ID and add to playlist
  26. foreach ($youtubeIDArray as $ID) {
  27.         #Remove quotations from string to make proper URI
  28.         $ID = str_replace('\'','',$ID);
  29.         echo $ID.'
  30.         ';
  31.  
  32.         #Skip video if in array
  33.         if(in_array($ID, $playlistVideoIDs)){
  34.                 echo 'Skipped: '.$ID.'<br>
  35.                 ';
  36.                 continue;
  37.         }
  38.  
  39.         #Add each video to the specified playlist
  40.         youtube::addVideoToPlaylist($yt,$playlistToModify,$ID);
  41. }      

That is the basic outline of how to download HTML from Facebook after authentication and then adding videos to a user's Youtube playlist. There are several features that I need to add, namely full OOP implementation of the python and PHP functions along with the ability to store the Youtube IDs, etc. in a MySQL database. The youtube class I made in PHP should be a useful wrapper for others looking for an start to interfacing with the youtube api.

Note: Uncomment extension=php_openssl.dll in your php.ini file to remove the following error: Unable to find the socket transport "ssl" – did you forget to enable it when you configured PHP?

-biafra
bahanonu [at] alum.mit.edu

more articles to enjoy:

quicklinks
29 may 2012 | programming

The new-tab page on most browsers inefficiently uses screen real estate by giving each website a picture and large button. While useful for[...] tablets, it prevents addition of many websites and can't be organized into groups without increasing the number of clicks required. Also, often only one search site is accessible, with the rest found in a drop-down menu. To alleviate this, I created quicklinks. It uses javascript to create a row of categories (that you choose), which when hovered over show associated links. Further, there are several search bars readily avaliable, no extra clicking needed. Enjoy.

BIO200 Proposal Small
11 November 2012 | designs

The graphical abstract was supposed to illustrate one hypothesis for d[...]etermining the coupling between BOLD and neuronal activity. This is the improved, more concise version.

wary statistics #1: the tale of cdc mortality
06 april 2020 | statistics

I will briefly discuss properly interpreting data you might see in the mainstream or on social media. The takeaway: if recent data for some[...] measure (e.g. pneumonia deaths) from this year looks to be different than prior years, make sure to check that it is not an artifact of data collection or compilation.

neuroscience and biology technologies
30 november 2014 | neuroscience

Earlier this year I created a webpage to list various labs, websites, and other resources related to neuroscience and biotechnology. Decide[...]d to expand on this and create a living document of various technologies currently used in neuroscience.

©2006-2024 | Site created & coded by Biafra Ahanonu | Updated 17 April 2024
biafra ahanonu