GitHub Matters Scraper | Python Internet-Scraping

13 min learn

Might 31, 2023

Internet scraping is a way used to extract knowledge from web sites. It permits us to collect info from net pages and use it for varied functions, resembling knowledge evaluation, analysis, or constructing functions.

On this article, we are going to discover a Python challenge known as “GitHub Matters Scraper,” which leverages net scraping to extract info from the GitHub subjects web page and retrieve repository names and particulars for every matter.

GitHub is a broadly fashionable platform for internet hosting and collaborating on code repositories. It presents a function known as “subjects” that enables customers to categorize repositories based mostly on particular topics or themes. The GitHub Matters Scraper challenge automates the method of scraping these subjects and retrieving related repository info.

The GitHub Matters Scraper is carried out utilizing Python and makes use of the next libraries:

requests: Used for making HTTP requests to retrieve the HTML content material of net pages.
BeautifulSoup: A robust library for parsing HTML and extracting knowledge from it.
pandas: A flexible library for knowledge manipulation and evaluation, used for organizing the scraped knowledge right into a structured format.

Let’s dive into the code and perceive how every element of the challenge works.

import requests
from bs4 import BeautifulSoup
import pandas as pd

The above code snippet imports three libraries: requests, BeautifulSoup, and pandas.

def topic_page_authentication(url):topics_url = url
response = requests.get(topics_url)
page_content = response.textual content
doc = BeautifulSoup(page_content, 'html.parser')
return doc

Defines a perform known as topic_page_authentication that takes a URL as an argument.

Right here’s a breakdown of what the code does:

1. topics_url = url: This line assigns the offered URL to the variable topics_url. This URL represents the online web page that we need to authenticate and retrieve its content material.

2. response = requests.get(topics_url): This line makes use of the requests.get() perform to ship an HTTP GET request to the topics_url and shops the response within the response variable. This request is used to fetch the HTML content material of the online web page.

3. page_content = response.textual content: This line extracts the HTML content material from the response object and assigns it to the page_content variable. The response.textual content attribute retrieves the textual content content material of the response.

4. doc = BeautifulSoup(page_content, 'html.parser'): This line creates a BeautifulSoup object known as doc by parsing the page_content utilizing the 'html.parser' parser. This permits us to navigate and extract info from the HTML construction of the online web page.

5. return doc: This line returns the BeautifulSoup object doc from the perform. Which means that when the topic_page_authentication perform is named, it should return the parsed HTML content material as a BeautifulSoup object.

The aim of this perform is to authenticate and retrieve the HTML content material of an internet web page specified by the offered URL. It makes use of the requests library to ship an HTTP GET request retrieves the response content material, after which parses it utilizing BeautifulSoup to create a navigable object representing the HTML construction.

Please be aware that the offered code snippet handles the preliminary steps of net web page authentication and parsing, nevertheless it doesn’t carry out any particular scraping or knowledge extraction duties.

def topicSraper(doc):# Extract title 
title_class = 'f3 lh-condensed mb-0 mt-1 Hyperlink--primary'
topic_title_tags = doc.find_all('p', {'class':title_class})
# Extract description
description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class':description_class})
# Extract hyperlink
link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a',{'class':link_class})
#Extract all the subject names
topic_titles = []
for tag in topic_title_tags:
topic_titles.append(tag.textual content)
#Extract the descrition textual content of the actual matter
topic_description = []
for tag in topic_desc_tags:
topic_description.append(tag.textual content.strip())
#Extract the urls of the actual subjects
topic_urls = []
base_url = "https://github.com"
for tags in topic_link_tags:
topic_urls.append(base_url + tags['href'])
topics_dict = {
'Title':topic_titles,
'Description':topic_description,
'URL':topic_urls
}
topics_df = pd.DataFrame(topics_dict)
return topics_df

Defines a perform known as topicScraper that takes a BeautifulSoup object (doc) as an argument.

Get Licensed in ChatGPT + Conversational UX + Dialogflow

Here is a breakdown of what the code does:

1. title_class = 'f3 lh-condensed mb-0 mt-1 Hyperlink--primary': This line defines the CSS class identify (title_class) for the HTML aspect that incorporates the subject titles on the net web page.

2. topic_title_tags = doc.find_all('p', {'class':title_class}): This line makes use of the find_all() methodology of the BeautifulSoup object to search out all HTML parts (<p>) with the desired CSS class (title_class). It retrieves a listing of BeautifulSoup Tag objects representing the subject title tags.

3. description_class = 'f5 color-fg-muted mb-0 mt-1': This line defines the CSS class identify (description_class) for the HTML aspect that incorporates the subject descriptions on the net web page.

4. topic_desc_tags = doc.find_all('p', {'class':description_class}): This line makes use of the find_all() methodology to search out all HTML parts (<p>) with the desired CSS class (description_class). It retrieves a listing of BeautifulSoup Tag objects representing the subject description tags.

5. link_class = 'no-underline flex-1 d-flex flex-column': This line defines the CSS class identify (link_class) for the HTML aspect that incorporates the subject hyperlinks on the net web page.

6. topic_link_tags = doc.find_all('a',{'class':link_class}): This line makes use of the find_all() methodology to search out all HTML parts (<a>) with the desired CSS class (link_class). It retrieves a listing of BeautifulSoup Tag objects representing the subject hyperlink tags.

7. topic_titles = []: This line initializes an empty listing to retailer the extracted matter titles.

8. for tag in topic_title_tags: ...: This loop iterates over the topic_title_tags listing and appends the textual content content material of every tag to the topic_titles listing.

9. topic_description = []: This line initializes an empty listing to retailer the extracted matter descriptions.

10. for tag in topic_desc_tags: ...: This loop iterates over the topic_desc_tags listing and appends the stripped textual content content material of every tag to the topic_description listing.

11. topic_urls = []: This line initializes an empty listing to retailer the extracted matter URLs.

12. base_url = "https://github.com": This line defines the bottom URL of the web site.

13. for tags in topic_link_tags: ...: This loop iterates over the topic_link_tags listing and appends the concatenated URL (base URL + href attribute) of every tag to the topic_urls listing.

14. topics_dict = {...}: This block creates a dictionary (topics_dict) that incorporates the extracted knowledge: matter titles, descriptions, and URLs.

15. topics_df = pd.DataFrame(topics_dict): This line converts the topics_dict dictionary right into a pandas DataFrame, the place every key turns into a column within the DataFrame.

16. return topics_df: This line returns the pandas DataFrame containing the extracted knowledge.

The aim of this perform is to scrape and extract info from the offered BeautifulSoup object (doc). It retrieves the subject titles, descriptions, and URLs from particular HTML parts on the net web page and shops them in a pandas knowledge body for additional evaluation or processing.

def topic_url_extractor(dataframe):url_lst = []
for i in vary(len(dataframe)):
topic_url = dataframe['URL'][i]
url_lst.append(topic_url)
return url_lst

Defines a perform known as topic_url_extractor that takes a panda DataFrame (dataframe) as an argument.

Here is a breakdown of what the code does:

1. url_lst = []: This line initializes an empty listing (url_lst) to retailer the extracted URLs.

2. for i in vary(len(dataframe)): ...: This loop iterates over the indices of the DataFrame rows.

3. topic_url = dataframe['URL'][i]: This line retrieves the worth of the ‘URL’ column for the present row index (i) within the knowledge body.

4. url_lst.append(topic_url): This line appends the retrieved URL to the url_lst listing.

5. return url_lst: This line returns the url_lst listing containing the extracted URLs.

The aim of this perform is to extract the URLs from the ‘URL’ column of the offered DataFrame.

It iterates over every row of the DataFrame, retrieves the URL worth for every row, and provides it to a listing. Lastly, the perform returns the listing of extracted URLs.

This perform will be helpful whenever you need to extract the URLs from a DataFrame for additional processing or evaluation, resembling visiting every URL or performing further net scraping on the person net pages.

def parse_star_count(stars_str):stars_str = stars_str.strip()[6:]
if stars_str[-1] == 'ok':
stars_str =  float(stars_str[:-1]) * 1000
return int(stars_str)

Defines a perform known as parse_star_count that takes a string (stars_str) as an argument.

Here is a breakdown of what the code does:

1. stars_str = stars_str.strip()[6:]: This line removes main and trailing whitespace from the stars_str string utilizing the strip() methodology. It then slices the string ranging from the sixth character and assigns the consequence again to stars_str. The aim of this line is to take away any undesirable characters or areas from the string.

2. if stars_str[-1] == 'ok': ...: This line checks if the final character of stars_str is ‘ok’, indicating that the star rely is in 1000’s.

3. stars_str = float(stars_str[:-1]) * 1000: This line converts the numeric a part of the string (excluding the ‘ok’) to a float after which multiplies it by 1000 to transform it to the precise star rely.

4. return int(stars_str): This line converts the stars_str to an integer and returns it.

The aim of this perform is to parse and convert the star rely from a string illustration to an integer worth. It handles circumstances the place the star rely is in 1000’s (‘ok’) by multiplying the numeric a part of the string by 1000. The perform returns the parsed star rely as an integer.

This perform will be helpful when you might have star counts represented as strings, resembling ‘1.2k’ for 1,200 stars, and it is advisable to convert them to numerical values for additional evaluation or processing.

def get_repo_info(h3_tags, star_tag):
base_url = 'https://github.com'
a_tags = h3_tags.find_all('a')
username = a_tags[0].textual content.strip()
repo_name = a_tags[1].textual content.strip()
repo_url = base_url + a_tags[1]['href']
stars = parse_star_count(star_tag.textual content.strip())
return username, repo_name, stars, repo_url

Defines a perform known as get_repo_info that takes two arguments: h3_tags and star_tag.

Here is a breakdown of what the code does:

1. base_url = 'https://github.com': This line defines the bottom URL of the GitHub web site.

2. a_tags = h3_tags.find_all('a'): This line makes use of the find_all() methodology of the h3_tags object to search out all HTML parts (<a>) inside it. It retrieves a listing of BeautifulSoup Tag objects representing the anchor tags.

3. username = a_tags[0].textual content.strip(): This line extracts the textual content content material of the primary anchor tag (a_tags[0]) and assigns it to the username variable. It additionally removes any main or trailing whitespace utilizing the strip() methodology.

4. repo_name = a_tags[1].textual content.strip(): This line extracts the textual content content material of the second anchor tag (a_tags[1]) and assigns it to the repo_name variable. It additionally removes any main or trailing whitespace utilizing the strip() methodology.

5. repo_url = base_url + a_tags[1]['href']: This line retrieves the worth of the ‘href’ attribute from the second anchor tag (a_tags[1]) and concatenates it with the base_url to kind the entire URL of the repository. The ensuing URL is assigned to the repo_url variable.

6. stars = parse_star_count(star_tag.textual content.strip()): This line extracts the textual content content material of the star_tag object removes any main or trailing whitespace and passes it as an argument to the parse_star_count perform. The perform returns the parsed star rely as an integer, which is assigned to the stars variable.

7. return username, repo_name, stars, repo_url: This line returns a tuple containing the extracted info: username, repo_name, stars, and repo_url.

The aim of this perform is to extract details about a GitHub repository from the offered h3_tags and star_tag objects. It retrieves the username, repository identify, star rely, and repository URL by navigating and extracting particular parts from the HTML construction. The perform then returns this info as a tuple.

This perform will be helpful whenever you need to extract repository info from an internet web page that incorporates a listing of repositories, resembling when scraping GitHub subjects.

def topic_information_scraper(topic_url):
# web page authentication
topic_doc = topic_page_authentication(topic_url)# extract identify
h3_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class':h3_class})
#get star tag
star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_tags = topic_doc.find_all('a',{'class':star_class})
#get details about the subject
topic_repos_dict = {
'username': [],
'repo_name': [],
'stars': [],
'repo_url': []
}
for i in vary(len(repo_tags)):
repo_info = get_repo_info(repo_tags[i], star_tags[i])
topic_repos_dict['username'].append(repo_info[0])
topic_repos_dict['repo_name'].append(repo_info[1])
topic_repos_dict['stars'].append(repo_info[2])
topic_repos_dict['repo_url'].append(repo_info[3])
return pd.DataFrame(topic_repos_dict)

Defines a perform known as topic_information_scraper that takes a topic_url as an argument.

Here is a breakdown of what the code does:

1. topic_doc = topic_page_authentication(topic_url): This line calls the topic_page_authentication perform to authenticate and retrieve the HTML content material of the topic_url. The parsed HTML content material is assigned to the topic_doc variable.

2. h3_class = 'f3 color-fg-muted text-normal lh-condensed': This line defines the CSS class identify (h3_class) for the HTML aspect that incorporates the repository names inside the matter web page.

3. repo_tags = topic_doc.find_all('h3', {'class':h3_class}): This line makes use of the find_all() methodology of the topic_doc object to search out all HTML parts (<h3>) with the desired CSS class (h3_class). It retrieves a listing of BeautifulSoup Tag objects representing the repository identify tags.

4. star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default': This line defines the CSS class identify (star_class) for the HTML aspect that incorporates the star counts inside the matter web page.

5. star_tags = topic_doc.find_all('a',{'class':star_class}): This line makes use of the find_all() methodology to search out all HTML parts (<a>) with the desired CSS class (star_class). It retrieves a listing of BeautifulSoup Tag objects representing the star rely tags.

6. topic_repos_dict = {...}: This block creates a dictionary (topic_repos_dict) that may retailer the extracted repository info: username, repository identify, star rely, and repository URL.

7. for i in vary(len(repo_tags)): ...: This loop iterates over the indices of the repo_tags listing, assuming that it has the identical size because the star_tags listing.

8. repo_info = get_repo_info(repo_tags[i], star_tags[i]): This line calls the get_repo_info perform to extract details about a selected repository. It passes the present repository identify tag (repo_tags[i]) and star rely tag (star_tags[i]) as arguments. The returned info is assigned to the repo_info variable.

9. topic_repos_dict['username'].append(repo_info[0]): This line appends the extracted username from repo_info to the ‘username’ listing in topic_repos_dict.

10. topic_repos_dict['repo_name'].append(repo_info[1]): This line appends the extracted repository identify repo_info to the ‘repo_name’ listing in topic_repos_dict.

11. topic_repos_dict['stars'].append(repo_info[2]): This line appends the extracted star rely repo_info to the ‘stars’ listing in topic_repos_dict.

12. topic_repos_dict['repo_url'].append(repo_info[3]): This line appends the extracted repository URL from repo_info to the ‘repo_url’ listing in topic_repos_dict.

13. return pd.DataFrame(topic_repos_dict): This line converts the topic_repos_dict dictionary right into a pandas DataFrame, the place every key turns into a column within the DataFrame. The ensuing knowledge body incorporates the extracted repository info.

The aim of this perform is to scrape and extract details about the repositories inside a selected matter on GitHub. It authenticates and retrieves the HTML content material of the subject web page, then extracts the repository names and star counts utilizing particular CSS class names.

It calls the get_repo_info perform for every repository to retrieve the username, repository identify, star rely, and repository URL.

The extracted info is saved in a dictionary after which transformed right into a pandas DataFrame, which is returned by the perform.

if __name__ == "__main__":
url = 'https://github.com/subjects'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)# Make Different CSV information acording to the subjects
url = topic_url_extractor(topic_dataframe) 
identify = topic_dataframe['Title']
for i in vary(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/{identify[i]}.csv', index=None)

The code snippet demonstrates the primary execution stream of the script.

Right here’s a breakdown of what the code does:

1. if __name__ == "__main__":: This conditional assertion checks if the script is being run immediately (not imported as a module).

2. url = 'https://github.com/subjects': This line defines the URL of the GitHub subjects web page.

3. topic_dataframe = topicSraper(topic_page_authentication(url)): This line retrieves the subject web page’s HTML content material utilizing topic_page_authentication, after which passes the parsed HTML (doc) to the topicSraper perform. It assigns the ensuing knowledge body (topic_dataframe) to a variable.

4. topic_dataframe.to_csv('GitHubtopics.csv', index=None): This line exports the topic_dataframe DataFrame to a CSV file named ‘GitHubtopics.csv’. The index=None argument ensures that the row indices aren’t included within the CSV file.

5. url = topic_url_extractor(topic_dataframe): This line calls the topic_url_extractor perform, passing the topic_dataframe as an argument. It retrieves a listing of URLs (url) extracted from the information body.

6. identify = topic_dataframe['Title']: This line retrieves the ‘Title’ column from the topic_dataframe and assigns it to the identify variable.

7. for i in vary(len(topic_dataframe)): ...: This loop iterates over the indices of the topic_dataframe DataFrame.

8. new_df = topic_information_scraper(url[i]): This line calls the topic_information_scraper perform, passing the URL (url[i]) as an argument. It retrieves repository info for the particular matter URL and assigns it to the new_df DataFrame.

9. new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/{identify[i]}.csv', index=None): This line exports the new_df DataFrame to a CSV file. The file identify is dynamically generated utilizing an f-string, incorporating the subject identify (identify[i]). The index=None an argument ensures that the row indices aren’t included within the CSV file.

The aim of this script is to scrape and extract info from the GitHub subjects web page and create CSV information containing the extracted knowledge. It first scrapes the primary subjects web page, saves the extracted info in ‘GitHubtopics.csv’, after which proceeds to scrape particular person matter pages utilizing the extracted URLs.

For every matter, it creates a brand new CSV file named after the subject and saves the repository info in it.

This script will be executed on to carry out the scraping and generate the specified CSV information.

url = 'https://github.com/subjects'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)

As soon as this code runs, it should generate a CSV file identify as ‘GitHubtopics.csv’, which appears like this. and that csv covers all the subject names, their description, and their URLs.

url = topic_url_extractor(topic_dataframe) 
identify = topic_dataframe['Title']
for i in vary(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/{identify[i]}.csv', index=None)

Then this code will execute to create the particular csv information based mostly on subjects we saved within the earlier ‘GitHubtopics.csv’ file. Then these CSV information are saved in a listing known as ‘GitHubTopic_CSV-Recordsdata’ with their very own particular matter names. These csv information seem like this.

These Subject csv information saved some details about the subject, resembling their Username, Repository identify, Stars of the Repository, and the Repository URL.