Web Scraper.
Description.
This program is used to collect scientific articles’ abstracts from journal sites, specifically SINTA and GARUDA, through a technique called web-scraping. Web-scraping is the act of collecting information from websites by extracting texts that are embedded in the pages’ HTML. It does this by creating an ‘agent’ that will read and save the texts inside the HTML tags of a page then travel to different pages by accessing the URL that’s available on that page.
In the case of this program, a single agent will traverse the table in http://sinta.ristekbrin.go.id/journals and check each row for the <img>
tag with the class stat-garuda-small
. If the tag exists, the agent will go deeper by accessing the URL listed in the href property that’s anchored to the tag in that specific row. The agent will then traverse the table in said URL, scraping text data from the <xmp>
tag with the class abstract-article
. The script will append "?page=2"
to the URL and increment the page number to continue traversing the following pages. Only after the pages have run out will the agent exit the nested traversal process and continue the main traversal process.
Since the is to collect Indonesian scientific journals and articles, the library langdetect is utilized to make sure that the text data that’s scraped is Indonesian. This process is done by extracting the first two sentences of the paragraph and checking the language of both sentences. If the language of one of the two sentences is not Indonesian, then the paragraph would not be scraped.
Here is the link to access the Github repository.
Background.
This is a sub-project for my Bachelor’s thesis. My main thesis project was to build a Journal Recommender Application using a Softmax Regression model as the classifier. But to create a machine learning model, I need to train the model using some sort of dataset. I tried searching for available dataset online that was relevant to my model, but none existed at the time. So, with the help of thenewboston, I decided to create my own dataset.
Features.
- Traverses the tables in SINTA - Science and Technology Index and GARUDA - Garda Rujukan Digital sequentially.
- Scrapes journal data from SINTA - Science and Technology Index.
- Scrapes article abstract data from GARUDA - Garda Rujukan Digital.
- Only scrapes Indonesian abstracts by detecting the language of the abstract.
Data Gathered.
The newly scraped data is saved in ./output/output.csv
directory with the headers JOURNAL_TITLE
, ARTICLE_TITLE
, and ARTICLE_ABSTRACT
. The last time the data is scraped is on April 1st, 2020. The amount of data scraped in total is 157,687 rows, consisting of 2,527 journals, and aggregated in ./data/master/
directory.
Tools.
How to Run in Local Environment.
$ python3 scrape_web