Information Finder
In this activity, create a program that uses Wikipedia to find information. The goal is for the user to enter a search term, and then show the first paragraph of the Wikipedia article for that subject.
Setting Up
Set up by creating a new Replit project.
- Go to Replit
- Log in
- Create a new Python Repl project named "Info Finder"
Basic Loop
Before figuring out how to use Wikipedia, create the basic loop for the program.
- At the top of the main.py file, create a
while True
loop - In the indented body of the
while
loop, useprint
to show a welcome message - Under that, create a variable named
search
- Set the
search
variable to the result of aninput
asking for a search term - Under that, create a variable named
keep_going
- Set the
keep_going
variable to the result of aninput
asking if the user would like to continue (y/n) - Under that, add an empty
print
statement to print a new line - Under that, create an
if
statement checking ifkeep_going
is NOT equal to"y"
- In the body of the
if
, print a "Goodbye" message andbreak
Run the program, and verify that the loop can continue as long as the user wants!
while True:
print("Welcome to text-based Wikipedia!\n")
search = input("Enter a search term: ")
keep_going = input("\nWould you like to enter another term (y/n)? ")
print("")
if keep_going != "y":
print("Goodbye!")
break
Info Function
Next, build out the structure for a get_information
function. This function will not do anything yet, but eventually it will.
Defining the Function
First, define the function.
- Above the
while
loop, define a new function namedget_information
- Give one parameter to the
get_information
function:search_term
- In the body of the
get_information
function, simply return"No information found"
for now
def get_information(search_term):
return "No information found"
Calling the Function
Next, call the function in the code.
- Find the line between the
search
andkeep_going
variable initializations in thewhile
loop body - At that spot, create a variable named
information
- Set the variable to the result of a call to the
get_information
function - Pass in
search
as the argument to the function - Under that, use
print
to show a message saying "Here's what I found:" - Under that, print out the
information
variable
information = get_information(search)
print("\nHere's what I found:")
print(information)
Run the program, and verify that the "No information found" message properly appears - that means the function is being called!
Requesting the HTML
Now it's time to go grab some information from the web. At the very top of the main.py file, add the following code to import the requests
library:
import requests
More information about this library is available here.
Retrieving Information Manually
As a human being, it's not too hard to go to Wikipedia and find some information. Giving that ability to a computer program is a little less straightforward, but it is certainly possible!
As a human, it would make sense to go to Wikipedia and type something into the search bar. For example, typing in "Apple" would lead to this page:
Behind the scenes, that actually sends the web browser to this url: https://en.wikipedia.org/w/index.php?search=apple
In fact, anything can be appended to the end of that URL in order to find the corresponding Wikipedia page! This will be extremely helpful for the Info Finder.
Retrieving Information in Python
Using the requests library, it is possible to pull down all of the HTML code from a URL. Update the get_information
function so that it prints out some raw HTML (for now).
- Find the
get_information
function, and make a new line at the top of the body - Create a variable named
url
- Set
url
to this string:"https://en.wikipedia.org/w/index.php?search="
- At the end of the line, add
+ search_term
to add on the search term from the user- This way, the URL will always reflect what the user hopes to find
- Under that, create a variable named
response
- Set the
response
variable torequests.get(url)
- Here, the
requests.get
function is able to retrieve the info from the web
- Here, the
- Under that, create a variable named
html_text
- Set the
html_text
variable to equalresponse.text
- This gets the raw HTML from the response
- Under that, use
print
to print thehtml_text
variable - Run the program, enter a search term, and verify that a mess of HTML appears!
url = "https://en.wikipedia.org/w/index.php?search=" + search_term
response = requests.get(url)
html_text = response.text
print(html_text)
Beautiful Soup
Now the program has a bunch of HTML, but it's not very readable. It's time to parse it using BeautifulSoup, a Python library designed to help extract information from HTML text!
Right under the first import
statement in the code, add the following line:
from bs4 import BeautifulSoup
Before jumping into the code, think about what is necessary to find in the HTML.
Looking at the HTML
Every Wikipedia page is a little different, so it's important to make sure the HTML parsing solution works for all (or at least the majority) of them. Take a look at some of the HTML for a page.
- Open a separate browser window
- Go to the Wikipedia page for Apple
- Right click somewhere on the first paragraph
- Select "Inspect" to open Developer Tools
- Try to find the first paragraph text in the HTML
- It may be a little jumbled, but the text should be there
- Notice that the text is within a
p
element - Notice that that
p
element is within adiv
element - Notice that the
div
element has aclass
attribute ofmw-parser-output
- Notice that the important
p
is not the firstp
within thatdiv
- Also notice that the
div
is within anotherdiv
with anid
ofbodyContent
All of this investigation should help when writing the code.
Parsing the HTML in Python
Now it's time to get into the code!
- Find the spot in the
find_information
function wherehtml_text
is printed - Remove that line as it is no longer necessary
- In its place, create a new variable named
html_document
- Set the
html_document
variable to the result of a call to theBeautifulSoup
function- Pass in
html_text
as the first parameter - Pass in
"html.parser"
(don't forget the quotes) as the second parameter
- Pass in
Now the document should be properly objectified. The code should look something like this:
html_document = BeautifulSoup(html_text, "html.parser")
Getting the Container <div>
The next step is to pull the container <div id="bodyContent">
element out of the entire document.
- Under the
html_document
variable, create a new variable namedcontent_criteria
- Consider what the program needs to find first
- Set the
content_criteria
variable to a new dictionary:{}
- Add a key of
"id"
to the dictionary, with a value of"bodyContent"
- This will be able to find the appropriate
div
in the HTML
- This will be able to find the appropriate
- Under that, create a variable named
content_div
- Set the
content_div
variable tohtml_document.find("div", content_criteria)
- This searches the
html_document
fordiv
elements with anid
ofbodyContent
- This searches the
Now, the container <div>
is stored in a variable! The code should look like this:
content_criteria = {
"id": "bodyContent"
}
content_div = html_document.find("div", content_criteria)
Getting the Inner <div>
Next, pull the inner <div class="mw-parser-output">
out of the content_div
.
- Under the
content_div
variable, create a new variable namedinner_div
- Set the
inner_div
variable to a call tohtml_document.find
- For the first argument, pass in
"div"
- For the second argument, pass in a new object with
{
and}
- Set a
"class"
property in the new object to be"mw-parser-output"
- Set a
Now, inner_div
should be the <div class="mw-parser-output">
that contains the information. The code looks something like this:
inner_div = content_div.find("div", {
"class": "mw-parser-output"
})
Getting the <p>
Elements
Finally, it's time to get down to the paragraphs!
- Under the
inner_div
variable, create a new variable namedparagraphs
- Set the
paragraphs
variable toinner_div.find_all("p")
- This will be all of the
p
elements within thediv
- This will be all of the
The paragraphs
variable should contain all of the content <p>
elements in a list. The code looks like this:
paragraphs = inner_div.find_all("p")
The Right Paragraph
Now all the possible paragraphs have been found, but only the first non-empty one is needed. Loop through the paragraphs, and return
the text for the first non-empty one!
- Under the
paragraphs
variable, create afor
loop - Loop through each
paragraph
inparagraphs
- In the indented body of the
for
loop, create a variable namedp_text
- Set the
p_text
variable toparagraph.get_text()
- This will hold the raw text for the paragraph
- Under that, create a variable named
clean_text
- Set the
clean_text
variable top_text.strip()
- This will remove any whitespace from the text
- Under that, still within the
for
body, create anif
statement - For the condition of the
if
, simply putclean_text
- This will be
True
for all non-empty string values
- This will be
- In the indented body of the
if
:return clean_text
- Run the code, enter a search term, and verify that the first paragraph appears!
for paragraph in paragraphs:
p_text = paragraph.get_text()
clean_text = p_text.strip()
if clean_text:
return clean_text
Nice Printing
There is one last thing to make the program a little nicer to use. Currently, the text will break a new line in the middle of a word. This is not ideal! Luckily, there is another Python library to handle it. At the top of the file, along with the other import
s, add the following:
from textwrap import fill
The fill
function will add newline characters in between strings so that they print out a little nicer. More information can be found here.
Find the part of the code where the information
variable is printed, and replace it with this:
print(fill(information))
Run the program again, and verify that all information printed looks nicer!
Final Code
import requests
from bs4 import BeautifulSoup
from textwrap import fill
def get_information(search_term):
url = "https://en.wikipedia.org/w/index.php?search=" + search_term
response = requests.get(url)
html_text = response.text
html_document = BeautifulSoup(html_text, "html.parser")
content_criteria = {
"id": "bodyContent"
}
content_div = html_document.find("div", content_criteria)
inner_div = content_div.find("div", {
"class": "mw-parser-output"
})
paragraphs = inner_div.find_all("p")
for paragraph in paragraphs:
p_text = paragraph.get_text()
clean_text = p_text.strip()
if clean_text:
return clean_text
return "No information found"
while True:
print("Welcome to text-based Wikipedia!\n")
search = input("Enter a search term: ")
information = get_information(search)
print("\nHere's what I found:")
print(fill(information))
keep_going = input("\nWould you like to enter another term (y/n)? ")
print("")
if keep_going != "y":
print("Goodbye!")
break
Challenges
After the activity, start working on the Info Finder Challenges.