Zephyrnet Logo

Marvel Web Scraper & Twitter Bot in Under 50 Lines of Code

Date:

Ekene A.

@marvelbeings is a simple Twitter bot I wrote in early 2019 during a vacation trip. The bot scrapes Marvel’s official characters site and tweets the Marvel character’s url link, the photo of said character found on the site, along with a boilerplate string and a few hashtags. A user tapping or clicking on the link is taken to the character’s individual page where they can read more about the character.

This post will serve as the documentation for @marvelbeings as I have made the decision to deprecate it (at least for the time being). This post can also be used as a guide for anyone interested in building their own Twitter bot. My code is available on my Github.

Dependencies: requests, tweepy, beautifulsoup4, time

I wanted to keep this project as short and modular as possible. My code consists of three files, and this documentation is split into four sections. The first section briefly discusses how to go about registering a bot’s Twitter account and then obtaining tweepy API credentials. I won’t provide step by step details regarding setting up a Developer account but I will provide links to documentation about necessary actions. I should mention that I employed the official means to create this bot which required me to enter a valid email address, register the account as a bot, and answer a few basic questions about the bot’s use case. I encourage anyone developing their own Twitter bot to use the official means as well. The second section recounts how simple it was to scrape the Marvel website and retrieve the url path to each character’s page. The third section cover’s the bot’s basic functionality with a good amount of room for customization. The fourth section will conclude this documentation and highlight a few ideas for further development.

In this documentation, I’ll be treating the three files found on my Github as one script.

Section 1: Bot Account Setup and Connecting to tweepy

Twitter allows developers to create and manage automated accounts through its Twitter Dev program.

Before connecting to tweepy, you’ll need to create your bot’s Twitter account (if you haven’t yet done so) and progress through the application process. Also, be sure to review Restricted Use Cases to ensure that how you intend to use your bot is actually permitted. Follow these steps on the Twitter Dev website to generate access tokens.

Now that we have our authentication credentials, we can starting writing some code.

Importing tweepy:

import tweepy# creds: replace with your Twitter Developer credentials 
con_key = "your consumer key"
con_sec = "your consumer secret"
acc_tok = "your access token"
acc_sec = "your access secret"

auth = tweepy.OAuthHandler(con_key, con_sec) # authenticate
auth.set_access_token(acc_tok, acc_sec) # grant access

api = tweepy.API(auth) # connect to API

If you’re able to run the above code without error, you’ve successfully established your connection to tweepy and can now begin writing your bot’s functionality.

Test tweet:

Try calling update_status(). The update_status() method is how your bot will update its status, AKA send out a tweet.

api.update_status("Bot's 1st tweet. Thanks tweepy!🐥")

After running update_status(), open your bot’s Twitter page in a browser window and make sure the tweet shows up. To combat spam, Twitter does not allow an automated account to post a tweet that the account has already tweeted before. So, running the above line of code more than once will yield the duplicate status TweepError in the screenshot below.

Section 2: Scraping

Beautiful Soup 4 (bs4) is a standard library for exploring and collecting web page data. We’ll also use the requests library to read the url we want our code to scrape.

from bs4 import BeautifulSoup
import requests

The Marvel characters url is “https://www.marvel.com/characters”.

Make some Soup:

url = "https://www.marvel.com/characters"
web_stuff = requests.get(url) # allows code to read the url
html = web_stuff.text # returns the page's content in unicode
soup = BeautifulSoup(html, "html.parser") # gets the page's html

Now, in order to find the relevant data, we’ll have to do some digging through the site’s html. I’ve laid these out as numbered steps.

  1. Using Safari or Chrome, navigate to the url declared in the above code.
  2. Right click on one of the character panes and click “Inspect Element”.
  3. This allows you to view the pages html structure and access its tags. I’ve used my favorite Avenger as an example in the screenshot below.

It can be a bit tricky determining which html tag(s) to grab. Luckily, we’re only looking for the character’s name and the path to the character’s individual page. The <div> tag (div) I’ve highlighted in the above screenshot gives us both the name and the path. We can grab the entire div and then parse out the information we need.

4. We’ll grab all divs of class “mvl-card mvl-card — explore” since we want to be able to get every character’s info, and not just Spiderman’s. To do this we’ll call the find_all() method and pass it the tag type, which in this case is “div”, and the class name, “mvl-card mvl-card — explore”.

char_content = soup.find_all("div", class_="mvl-card mvl-card--explore")print(type(char_content), "n",char_content)
# type() should return "<class ‘bs4.element.ResultSet’>"

Printing char_content returns the jungle of tags, text, and links you see in the below screenshot. This is because char_content contains all of the page’s divs of class “mvl-card mvl-card — explore” and there are 48 such divs per page at the time of writing. I’ve highlighted part of Spiderman’s block as an example of one of the divs.

We’ll be looping through this collection of divs and sending off one tweet per div.

5. To extract the name and url path from each div, we’ll simply split() the content of the divs to grab the text value of “data-further-details” for the character’s name and split() after “data-click-url” for the path.

Character name:

tweeted_chars = [] # Keep track of tweeted charactersdef get_name(content, i): # i is the index to begin loop at
html_block = str(content[i])
temp_name = html_block.split('data-click-text="', 1)
character = temp_name[1].split('" data-click-type')[0]
tweeted_chars.append(character) # Track tweeted characters
return character

Character path:

We’ll use the path contained in the div to create the full path to the character’s page by appending that path to the top level https://www.marvel.com. The full path for Spiderman, for example, would be https://www.marvel.com + “/characters/spider-man-peter-parker” which when clicked/tapped will open Spiderman’s individual page.

def get_url(content, i): # i is the index to begin loop at
# First five are not characters.
html_block = str(content[i]) # List of divs -> str
temp_path = html_block.split('url="', 1) # Single out the url
path = temp_path[1].split('" data-further-details', 1)[0]
char_url = "https://www.marvel.com"+path # Full path
return char_url

That’s it for our scraper.

Section 3: Start tweeting

This is the easier part. We loop through char_content and pass each div to our two functions which extract the character’s name and create the character’s full path. We can then pass both values to api.update_status() which, if you remember, is how our bot send out tweets. We’ll pass a boilerplate f-string containing our values with some hashtags to help our bot gain some attention.

i = 6 # index start to be passed to get_name() and get_url()

At the time of writing, the Marvel web page is structured in such a way that the first six divs of class “mvl-card mvl-card — explore” do not contain any Marvel character data. For this reason, we begin our loop at index 6.

import time for content in char_content:
api.update_status(
f"Today's Marvel character is {get_name(char_content,i).upper()}" +
f" 🤖 #Marvel #MarvelComics #Heroes #Villains {get_url(char_content,i)}"
)
i += 1
print("Posting character " + str(i) + ".")
time.sleep(86400) # Sleep for 86400 # sec -> 1 day

Change the time.sleep() integer to 1 and let it run for 5 seconds to make sure the bot is grabbing and tweeting the characters in the order they appear on the site. Don’t run it for too long, because you’ll need to manually delete those tweets if you re-run the loop to avoid the duplicate status TweepError.

Section 4: Conclusion & Further Development

This concludes the documentation for the @marvelbeings Twitter bot. Below is the final script along with two ideas for further development.

Be sure to keep your keys and tokens secure. Misuse of bots can get your account reported, deactivated, or worse.

import tweepy
from bs4 import BeautifulSoup
import requests
import time

# creds: replace with your Twitter Developer credentials
con_key = "your consumer key"
con_sec = "your consumer secret"
acc_tok = "your access token"
acc_sec = "your access secret"

auth = tweepy.OAuthHandler(con_key, con_sec)
auth.set_access_token(acc_tok, acc_sec)
api = tweepy.API(auth) # connect to API

url = "https://www.marvel.com/characters"
web_stuff = requests.get(url) # allows code to read the url
html = web_stuff.text # returns the page's content in unicode
soup = BeautifulSoup(html, "html.parser") # gets the page's html

char_content = soup.find_all("div", class_="mvl-card mvl-card--explore")
print(type(char_content), "n",char_content)

tweeted_chars = []
def get_name(content, i): # i is the index to begin loop at
html_block = str(content[i])
temp_name = html_block.split('data-click-text="', 1)
character = temp_name[1].split('" data-click-type')[0]
tweeted_chars.append(character) # Record tweeted characters
return character # name

def get_url(content, i): # i is the index to begin loop at
# First five are not characters.
html_block = str(content[i]) # List of divs -> str
temp_path = html_block.split('url="', 1) # Single out the url
path = temp_path[1].split('" data-further-details', 1)[0]
char_url = "https://www.marvel.com"+path # Full path
return char_url

i = 6 # index start to be passed to get_name() and get_url()

for content in char_content:
api.update_status(
f"Today's Marvel character is {get_name(char_content,i).upper()}" +
f" 🤖 #Marvel #MarvelComics #Heroes #Villains {get_url(char_content,i)}"
)
i += 1
print("Posting character " + str(i) + ".")
time.sleep(86400) # Sleep for 86400 # sec -> 1 day

Scrape Multiple Pages: There are 72 pages worth of Marvel characters on the Marvel website.

@marvelbeings currently only scrapes page 1. The first difficulty here is that the url path will not change as you jump from one page to another; it remains https://www.marvel.com/characters. Using Selenium Webdriver to navigate through the pages may be a good approach. Also, the first twelve Marvel characters on each page are the key characters in movies like Avengers Endgame and Captain Marvel, and do not change regardless of the page you navigate to. For all pages after page one, you will need to begin your loop at index 18 instead of index 6 in order to avoid posting the same character more than once and to avoid receiving the duplicate status TweepError (which will also break the loop).

Randomize Characters: Instead of following the order in which the characters appear on the site, you can make the bot grab and tweet the characters in a random order by using Lib/random.py and setting your index (i) to a random integer within range(6, len(char_content)). You’d need to pass get_name() and get_url() the random integer, record that random integer, and ensure it isn’t passed again or you’ll end up getting the duplicate status TweepError (which will also break the loop) for attempting to post an already-tweeted tweet.

Thanks a bunch for allowing me to share, more to come. 😃

Source: https://chatbotslife.com/marvel-web-scraper-twitter-bot-in-under-50-lines-of-code-453456c917c?source=rss—-a49517e4c30b—4

  • Plato Tags:
  • AI
spot_img

Latest Intelligence

spot_img