Information Finder
In this activity, create a program that uses Wikipedia to find information. The goal is for the user to enter a search term, and then show the first paragraph of the Wikipedia article for that subject.
Setting Up
Set up by creating a new Replit project.
- Go to Replit
- Log in
- Create a new Python Repl project named "Info Finder"
Basic Loop
Before figuring out how to use Wikipedia, create the basic loop for the program.
- At the top of the main.py file, create a while Trueloop
- In the indented body of the whileloop, useprintto show a welcome message
- Under that, create a variable named search
- Set the searchvariable to the result of aninputasking for a search term
- Under that, create a variable named keep_going
- Set the keep_goingvariable to the result of aninputasking if the user would like to continue (y/n)
- Under that, add an empty printstatement to print a new line
- Under that, create an ifstatement checking ifkeep_goingis NOT equal to"y"
- In the body of the if, print a "Goodbye" message andbreak
Run the program, and verify that the loop can continue as long as the user wants!
while True:
    print("Welcome to text-based Wikipedia!\n")
    search = input("Enter a search term: ")
    keep_going = input("\nWould you like to enter another term (y/n)? ")
    print("")
    if keep_going != "y":
        print("Goodbye!")
        break
Info Function
Next, build out the structure for a get_information function. This function will not do anything yet, but eventually it will.
Defining the Function
First, define the function.
- Above the whileloop, define a new function namedget_information
- Give one parameter to the get_informationfunction:search_term
- In the body of the get_informationfunction, simply return"No information found"for now
def get_information(search_term):    
    return "No information found"
Calling the Function
Next, call the function in the code.
- Find the line between the searchandkeep_goingvariable initializations in thewhileloop body
- At that spot, create a variable named information
- Set the variable to the result of a call to the get_informationfunction
- Pass in searchas the argument to the function
- Under that, use printto show a message saying "Here's what I found:"
- Under that, print out the informationvariable
information = get_information(search)
print("\nHere's what I found:")
print(information)
Run the program, and verify that the "No information found" message properly appears - that means the function is being called!
Requesting the HTML
Now it's time to go grab some information from the web. At the very top of the main.py file, add the following code to import the requests library:
import requests
More information about this library is available here.
Retrieving Information Manually
As a human being, it's not too hard to go to Wikipedia and find some information. Giving that ability to a computer program is a little less straightforward, but it is certainly possible!
As a human, it would make sense to go to Wikipedia and type something into the search bar. For example, typing in "Apple" would lead to this page:
Behind the scenes, that actually sends the web browser to this url: https://en.wikipedia.org/w/index.php?search=apple
In fact, anything can be appended to the end of that URL in order to find the corresponding Wikipedia page! This will be extremely helpful for the Info Finder.
Retrieving Information in Python
Using the requests library, it is possible to pull down all of the HTML code from a URL. Update the get_information function so that it prints out some raw HTML (for now).
- Find the get_informationfunction, and make a new line at the top of the body
- Create a variable named url
- Set urlto this string:"https://en.wikipedia.org/w/index.php?search="
- At the end of the line, add + search_termto add on the search term from the user- This way, the URL will always reflect what the user hopes to find
 
- Under that, create a variable named response
- Set the responsevariable torequests.get(url)- Here, the requests.getfunction is able to retrieve the info from the web
 
- Here, the 
- Under that, create a variable named html_text
- Set the html_textvariable to equalresponse.text- This gets the raw HTML from the response
 
- Under that, use printto print thehtml_textvariable
- Run the program, enter a search term, and verify that a mess of HTML appears!
url = "https://en.wikipedia.org/w/index.php?search=" + search_term
response = requests.get(url)
html_text = response.text
print(html_text)
Beautiful Soup
Now the program has a bunch of HTML, but it's not very readable. It's time to parse it using BeautifulSoup, a Python library designed to help extract information from HTML text!
Right under the first import statement in the code, add the following line:
from bs4 import BeautifulSoup
Before jumping into the code, think about what is necessary to find in the HTML.
Looking at the HTML
Every Wikipedia page is a little different, so it's important to make sure the HTML parsing solution works for all (or at least the majority) of them. Take a look at some of the HTML for a page.
- Open a separate browser window
- Go to the Wikipedia page for Apple
- Right click somewhere on the first paragraph
- Select "Inspect" to open Developer Tools
- Try to find the first paragraph text in the HTML- It may be a little jumbled, but the text should be there
 
- Notice that the text is within a pelement
- Notice that that pelement is within adivelement
- Notice that the divelement has aclassattribute ofmw-parser-output
- Notice that the important pis not the firstpwithin thatdiv
- Also notice that the divis within anotherdivwith anidofbodyContent
All of this investigation should help when writing the code.
Parsing the HTML in Python
Now it's time to get into the code!
- Find the spot in the find_informationfunction wherehtml_textis printed
- Remove that line as it is no longer necessary
- In its place, create a new variable named html_document
- Set the html_documentvariable to the result of a call to theBeautifulSoupfunction- Pass in html_textas the first parameter
- Pass in "html.parser"(don't forget the quotes) as the second parameter
 
- Pass in 
Now the document should be properly objectified. The code should look something like this:
html_document = BeautifulSoup(html_text, "html.parser")
Getting the Container <div>
The next step is to pull the container <div id="bodyContent"> element out of the entire document.
- Under the html_documentvariable, create a new variable namedcontent_criteria- Consider what the program needs to find first
 
- Set the content_criteriavariable to a new dictionary:{}
- Add a key of "id"to the dictionary, with a value of"bodyContent"- This will be able to find the appropriate divin the HTML
 
- This will be able to find the appropriate 
- Under that, create a variable named content_div
- Set the content_divvariable tohtml_document.find("div", content_criteria)- This searches the html_documentfordivelements with anidofbodyContent
 
- This searches the 
Now, the container <div> is stored in a variable! The code should look like this:
content_criteria = {
    "id": "bodyContent"
}
content_div = html_document.find("div", content_criteria)
Getting the Inner <div>
Next, pull the inner <div class="mw-parser-output"> out of the content_div.
- Under the content_divvariable, create a new variable namedinner_div
- Set the inner_divvariable to a call tohtml_document.find
- For the first argument, pass in "div"
- For the second argument, pass in a new object with {and}- Set a "class"property in the new object to be"mw-parser-output"
 
- Set a 
Now, inner_div should be the <div class="mw-parser-output"> that contains the information. The code looks something like this:
inner_div = content_div.find("div", {
    "class": "mw-parser-output"
})
Getting the <p> Elements
Finally, it's time to get down to the paragraphs!
- Under the inner_divvariable, create a new variable namedparagraphs
- Set the paragraphsvariable toinner_div.find_all("p")- This will be all of the pelements within thediv
 
- This will be all of the 
The paragraphs variable should contain all of the content <p> elements in a list. The code looks like this:
paragraphs = inner_div.find_all("p")
The Right Paragraph
Now all the possible paragraphs have been found, but only the first non-empty one is needed. Loop through the paragraphs, and return the text for the first non-empty one!
- Under the paragraphsvariable, create aforloop
- Loop through each paragraphinparagraphs
- In the indented body of the forloop, create a variable namedp_text
- Set the p_textvariable toparagraph.get_text()- This will hold the raw text for the paragraph
 
- Under that, create a variable named clean_text
- Set the clean_textvariable top_text.strip()- This will remove any whitespace from the text
 
- Under that, still within the forbody, create anifstatement
- For the condition of the if, simply putclean_text- This will be Truefor all non-empty string values
 
- This will be 
- In the indented body of the if:return clean_text
- Run the code, enter a search term, and verify that the first paragraph appears!
for paragraph in paragraphs:
    p_text = paragraph.get_text()
    clean_text = p_text.strip()
    if clean_text:
        return clean_text
Nice Printing
There is one last thing to make the program a little nicer to use. Currently, the text will break a new line in the middle of a word. This is not ideal! Luckily, there is another Python library to handle it. At the top of the file, along with the other imports, add the following:
from textwrap import fill
The fill function will add newline characters in between strings so that they print out a little nicer. More information can be found here.
Find the part of the code where the information variable is printed, and replace it with this:
print(fill(information))
Run the program again, and verify that all information printed looks nicer!
Final Code
import requests
from bs4 import BeautifulSoup
from textwrap import fill
def get_information(search_term):
    url = "https://en.wikipedia.org/w/index.php?search=" + search_term
    response = requests.get(url)
    html_text = response.text
    html_document = BeautifulSoup(html_text, "html.parser")
    content_criteria = {
        "id": "bodyContent"
    }
    content_div = html_document.find("div", content_criteria)
    inner_div = content_div.find("div", {
      "class": "mw-parser-output"
    })
    paragraphs = inner_div.find_all("p")
    for paragraph in paragraphs:
        p_text = paragraph.get_text()
        clean_text = p_text.strip()
        if clean_text:
            return clean_text
    return "No information found"
while True:
    print("Welcome to text-based Wikipedia!\n")
    search = input("Enter a search term: ")
    information = get_information(search)
    print("\nHere's what I found:")
    print(fill(information))
    keep_going = input("\nWould you like to enter another term (y/n)? ")
    print("")
    if keep_going != "y":
        print("Goodbye!")
        break
Challenges
After the activity, start working on the Info Finder Challenges.