Information Finder
In this activity, create a program that uses Wikipedia to find information. The goal is for the user to enter a search term, and then show the first paragraph of the Wikipedia article for that subject.
Setting Up
Set up by creating a new Replit project.
- Go to Replit
- Log in
- Create a new Python Repl project named "Info Finder"
Basic Loop
Before figuring out how to use Wikipedia, create the basic loop for the program.
- At the top of the main.py file, create a
while Trueloop - In the indented body of the
whileloop, useprintto show a welcome message - Under that, create a variable named
search - Set the
searchvariable to the result of aninputasking for a search term - Under that, create a variable named
keep_going - Set the
keep_goingvariable to the result of aninputasking if the user would like to continue (y/n) - Under that, add an empty
printstatement to print a new line - Under that, create an
ifstatement checking ifkeep_goingis NOT equal to"y" - In the body of the
if, print a "Goodbye" message andbreak
Run the program, and verify that the loop can continue as long as the user wants!
while True:
print("Welcome to text-based Wikipedia!\n")
search = input("Enter a search term: ")
keep_going = input("\nWould you like to enter another term (y/n)? ")
print("")
if keep_going != "y":
print("Goodbye!")
break
Info Function
Next, build out the structure for a get_information function. This function will not do anything yet, but eventually it will.
Defining the Function
First, define the function.
- Above the
whileloop, define a new function namedget_information - Give one parameter to the
get_informationfunction:search_term - In the body of the
get_informationfunction, simply return"No information found"for now
def get_information(search_term):
return "No information found"
Calling the Function
Next, call the function in the code.
- Find the line between the
searchandkeep_goingvariable initializations in thewhileloop body - At that spot, create a variable named
information - Set the variable to the result of a call to the
get_informationfunction - Pass in
searchas the argument to the function - Under that, use
printto show a message saying "Here's what I found:" - Under that, print out the
informationvariable
information = get_information(search)
print("\nHere's what I found:")
print(information)
Run the program, and verify that the "No information found" message properly appears - that means the function is being called!
Requesting the HTML
Now it's time to go grab some information from the web. At the very top of the main.py file, add the following code to import the requests library:
import requests
More information about this library is available here.
Retrieving Information Manually
As a human being, it's not too hard to go to Wikipedia and find some information. Giving that ability to a computer program is a little less straightforward, but it is certainly possible!
As a human, it would make sense to go to Wikipedia and type something into the search bar. For example, typing in "Apple" would lead to this page:
Behind the scenes, that actually sends the web browser to this url: https://en.wikipedia.org/w/index.php?search=apple
In fact, anything can be appended to the end of that URL in order to find the corresponding Wikipedia page! This will be extremely helpful for the Info Finder.
Retrieving Information in Python
Using the requests library, it is possible to pull down all of the HTML code from a URL. Update the get_information function so that it prints out some raw HTML (for now).
- Find the
get_informationfunction, and make a new line at the top of the body - Create a variable named
url - Set
urlto this string:"https://en.wikipedia.org/w/index.php?search=" - At the end of the line, add
+ search_termto add on the search term from the user- This way, the URL will always reflect what the user hopes to find
- Under that, create a variable named
response - Set the
responsevariable torequests.get(url)- Here, the
requests.getfunction is able to retrieve the info from the web
- Here, the
- Under that, create a variable named
html_text - Set the
html_textvariable to equalresponse.text- This gets the raw HTML from the response
- Under that, use
printto print thehtml_textvariable - Run the program, enter a search term, and verify that a mess of HTML appears!
url = "https://en.wikipedia.org/w/index.php?search=" + search_term
response = requests.get(url)
html_text = response.text
print(html_text)
Beautiful Soup
Now the program has a bunch of HTML, but it's not very readable. It's time to parse it using BeautifulSoup, a Python library designed to help extract information from HTML text!
Right under the first import statement in the code, add the following line:
from bs4 import BeautifulSoup
Before jumping into the code, think about what is necessary to find in the HTML.
Looking at the HTML
Every Wikipedia page is a little different, so it's important to make sure the HTML parsing solution works for all (or at least the majority) of them. Take a look at some of the HTML for a page.
- Open a separate browser window
- Go to the Wikipedia page for Apple
- Right click somewhere on the first paragraph
- Select "Inspect" to open Developer Tools
- Try to find the first paragraph text in the HTML
- It may be a little jumbled, but the text should be there
- Notice that the text is within a
pelement - Notice that that
pelement is within adivelement - Notice that the
divelement has aclassattribute ofmw-parser-output - Notice that the important
pis not the firstpwithin thatdiv - Also notice that the
divis within anotherdivwith anidofbodyContent
All of this investigation should help when writing the code.
Parsing the HTML in Python
Now it's time to get into the code!
- Find the spot in the
find_informationfunction wherehtml_textis printed - Remove that line as it is no longer necessary
- In its place, create a new variable named
html_document - Set the
html_documentvariable to the result of a call to theBeautifulSoupfunction- Pass in
html_textas the first parameter - Pass in
"html.parser"(don't forget the quotes) as the second parameter
- Pass in
Now the document should be properly objectified. The code should look something like this:
html_document = BeautifulSoup(html_text, "html.parser")
Getting the Container <div>
The next step is to pull the container <div id="bodyContent"> element out of the entire document.
- Under the
html_documentvariable, create a new variable namedcontent_criteria- Consider what the program needs to find first
- Set the
content_criteriavariable to a new dictionary:{} - Add a key of
"id"to the dictionary, with a value of"bodyContent"- This will be able to find the appropriate
divin the HTML
- This will be able to find the appropriate
- Under that, create a variable named
content_div - Set the
content_divvariable tohtml_document.find("div", content_criteria)- This searches the
html_documentfordivelements with anidofbodyContent
- This searches the
Now, the container <div> is stored in a variable! The code should look like this:
content_criteria = {
"id": "bodyContent"
}
content_div = html_document.find("div", content_criteria)
Getting the Inner <div>
Next, pull the inner <div class="mw-parser-output"> out of the content_div.
- Under the
content_divvariable, create a new variable namedinner_div - Set the
inner_divvariable to a call tohtml_document.find - For the first argument, pass in
"div" - For the second argument, pass in a new object with
{and}- Set a
"class"property in the new object to be"mw-parser-output"
- Set a
Now, inner_div should be the <div class="mw-parser-output"> that contains the information. The code looks something like this:
inner_div = content_div.find("div", {
"class": "mw-parser-output"
})
Getting the <p> Elements
Finally, it's time to get down to the paragraphs!
- Under the
inner_divvariable, create a new variable namedparagraphs - Set the
paragraphsvariable toinner_div.find_all("p")- This will be all of the
pelements within thediv
- This will be all of the
The paragraphs variable should contain all of the content <p> elements in a list. The code looks like this:
paragraphs = inner_div.find_all("p")
The Right Paragraph
Now all the possible paragraphs have been found, but only the first non-empty one is needed. Loop through the paragraphs, and return the text for the first non-empty one!
- Under the
paragraphsvariable, create aforloop - Loop through each
paragraphinparagraphs - In the indented body of the
forloop, create a variable namedp_text - Set the
p_textvariable toparagraph.get_text()- This will hold the raw text for the paragraph
- Under that, create a variable named
clean_text - Set the
clean_textvariable top_text.strip()- This will remove any whitespace from the text
- Under that, still within the
forbody, create anifstatement - For the condition of the
if, simply putclean_text- This will be
Truefor all non-empty string values
- This will be
- In the indented body of the
if:return clean_text - Run the code, enter a search term, and verify that the first paragraph appears!
for paragraph in paragraphs:
p_text = paragraph.get_text()
clean_text = p_text.strip()
if clean_text:
return clean_text
Nice Printing
There is one last thing to make the program a little nicer to use. Currently, the text will break a new line in the middle of a word. This is not ideal! Luckily, there is another Python library to handle it. At the top of the file, along with the other imports, add the following:
from textwrap import fill
The fill function will add newline characters in between strings so that they print out a little nicer. More information can be found here.
Find the part of the code where the information variable is printed, and replace it with this:
print(fill(information))
Run the program again, and verify that all information printed looks nicer!
Final Code
import requests
from bs4 import BeautifulSoup
from textwrap import fill
def get_information(search_term):
url = "https://en.wikipedia.org/w/index.php?search=" + search_term
response = requests.get(url)
html_text = response.text
html_document = BeautifulSoup(html_text, "html.parser")
content_criteria = {
"id": "bodyContent"
}
content_div = html_document.find("div", content_criteria)
inner_div = content_div.find("div", {
"class": "mw-parser-output"
})
paragraphs = inner_div.find_all("p")
for paragraph in paragraphs:
p_text = paragraph.get_text()
clean_text = p_text.strip()
if clean_text:
return clean_text
return "No information found"
while True:
print("Welcome to text-based Wikipedia!\n")
search = input("Enter a search term: ")
information = get_information(search)
print("\nHere's what I found:")
print(fill(information))
keep_going = input("\nWould you like to enter another term (y/n)? ")
print("")
if keep_going != "y":
print("Goodbye!")
break
Challenges
After the activity, start working on the Info Finder Challenges.