Creating text data for RAG from Wikipedia dump data

Motivation

I am experimenting with RAG using LangChain and was thinking about what to use for data for checking and decided to use wikipedia dump data. Since the volume of the whole is large, I decided to use data from the astronomy-related categories that I am interested in.

Here, I summarized a series of steps to extract only specific categories of data from the wikipedia dump data.

Sources.

  1. Index of /jawiki/ Top page of wikipedia dump in Japanese. I used the data under the directory “20240720”.
  2. Get only articles of specific categories in Wikipedia This is a page that shows what I was planning to do. It was a great help to me, thanks.
  3. Retrieving only articles under a specific category of Wikipedia (retrieving subcategories) At first I also tried to store wikipedia data, categories, and page information in MySQL (MariaDB) for searching. I started a mariadb container with docker-compose.yml, created a database named jawiki, and created the category, categorylinks, and page tables, but I got frustrated around the SQL creation for searching.
  4. attardi/wikiextractor This page introduces tools to format wiki dump data.
  5. ImportError in WikiExtractor3.0.4 with Python3.7 This page introduces what to do when an error occurs in the wiki dump data formatting tool.
  6. Wikipedia:PetScan A page of tools to search wikipedia article categories (and subcategories) and get information (title and page ID) on articles that match your criteria.

Procedure

Download jawiki dump data

Download “jawiki-20240720-pages-articles.xml.bz2” under ~/jawiki/20240720/, shown in Source 1, to a working directory.

$ ls -l
-rw-rw-r--  1 kenji kenji 4189767963  7月 29 22:03 jawiki-20240720-pages-articles.xml.bz2

Formatting the dump data

Format the downloaded bz2 file using “WikiExtractor.py” obtained from source 4.

$ python3 WikiExtractor.py -b 500K -o jawiki jawiki-20240720-pages-articles.xml.bz2

I dumped into the following error.

Traceback (most recent call last):
  File "/ext/nfs/workspace/wikipedia/WikiExtractor.py", line 66, in <module>
    from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces
ImportError: attempted relative import with no known parent package

Here, I searched the Internet and found the information source 5. I installed the necessary packages as follows.

$ sudo apt install python3-pip
$ pip3 install wikiextractor

Once again, the following command creates the “AA” to “DB” directories in the jawiki directory, under which 100 text files are stored from wiki_00 to wiki_99. Only the “DB” directory contains up to wiki_28.

$ python3 -m wikiextractor.WikiExtractor -b 500K -o jawiki jawiki-20240720-pages-articles.xml.bz2

In my environment, it took about 28 minutes.

Get the page ID of the category you are interested in 〜 PetScan

Open Source 6. and the tool opened by “Open PetScan” in the upper left corner. In this case, only the “Categories” and “Outputs” tabs will be used.

Concepts

As categories, I was thinking of “Astrophysics” and “Astronomy”, whereas I decided what the category depth (depth of subcategories) should be by looking at the titles I could get. Astrophysics was set to Category Depth: 4 and Astronomy to Category Depth: 2.

Search criteria

Based on the above ideas, the following settings were made.

In the “category” tag, language: ja, category: astrophysics|4 astronomy|2, combination: union (union set)

When selecting multiple categories, do not specify by category depth, but by category, for each category. In the “Output” tab, select the output format: CSV.

Click “Run” at the bottom left of the “Categories” tab to download “Download.csv”.

I rename it to “astronomy.csv”.

Program to create the text file

So far, the page IDs to be extracted are stored in the “pageid” column of astronomy.csv, and the text data is stored under the “jawiki” directory.

The program shown below is made under the following two assumptions.

  • The docid in the wiki_?? file stored in the “DB” from “AA” under the “jakiki” directory is the same as the docid in the “pageid” file. The docids in the file are in ascending order.
  • The “pageid” column in “astronomy.csv” is also in ascending order.
# Extract pages from jawiki dump data for astronomy-related categories (astrophysics and astronomy) and convert to text.
#
# This program runs with the following pre-prepared.
# jawiki dump data from AA under the directory given by "WIKI_PATH" into the DB directory
# The pages belonging to the category to be extracted are stored in files named "wiki_00-wiki_99" in the DB directory # under the directory given by "WIKI_PATH".
# The pages belonging to the category to be extracted are extracted using PetScan as "pageid" column in "INTEREST_PAGES".
Create file list and page ID list
import os
import glob
import csv

WIKI_PATH = "jawiki"
INTEREST_PAGES = "astronomy.csv"
EXTRACT = "textdb"

# Create a list of files under the target directory
# Sort the file list so that the page numbers are in ascending order.
file_list = []
dir_list = sorted(os.listdir(WIKI_PATH))
for d in dir_list:
    file_list += sorted(glob.glob(os.path.join(WIKI_PATH, d) + "/*"))

# Create a list of pages to be extracted
pageid_list = []
with open(INTEREST_PAGES, encoding='utf-8') as f:
    csvreader = csv.reader(f)
    header = next(csvreader)  # skip header
    count = 0
    for line in csvreader:
        # Use only articles where namespace is null. Otherwise, skip.
        # Other articles are subcategories, etc., 
        # and "pageid" is not guaranteed to be in ascending order under those subcategories.
        if line[3] != "":
            continue
        pageid_list.append(int(line[2]))
        count += 1

print("="*80)
print("抽出する記事数:{}".format(count))
print("="*80)

The results of the execution are as follows.

================================================================================
抽出する記事数:13842
================================================================================
Function to extract articles
import re

# Prepare a regular expression to extract id/url/title/body from the following structure.
"""
<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="アンパサンド">
アンパサンド

アンパサンド(&amp;, )は、並立助詞「…と…」を意味する記号である。・・・・

</doc>
"""

doc_re = re.compile(r'<doc id=.+?</doc>')
head_re = re.compile(r'<doc id=.+?">')
id_re = re.compile(r'id="\d+"')
url_re = re.compile(r'url=".+?"')
title_re = re.compile(r'title=".+"')

# Read the block given by the file path and
# Return a list of articles (ARTICLE).
# Remove line breaks.
def get_article_list(file):
    f = open(file, 'r', encoding='utf-8')
    block_data = f.read()
    block_data = block_data.replace('\n', '')
    return doc_re.findall(block_data)

# Extract one article (ARTICLE) given in "doc" and return them.
# extract id/url/title/text (body) and return them.
# skip the body text because the article title is at the beginning ([len(title):]).
def get_article(doc):
    head = head_re.search(doc)
    id_tag = id_re.search(head.group())
    doc_id = re.search(r'\d+', id_tag.group())
    doc_id = int(doc_id.group())
    url_tag = url_re.search(head.group())
    url = url_tag.group()[len("url="):].replace('"', '')
    title_tag = title_re.search(head.group())
    doc_title = re.search(r'".+"',title_tag.group())
    title = doc_title.group().replace('"', '')
#    text = doc.replace(head.group(), '').rstrip('</doc>')[len(title):]
    text = doc.replace(head.group(), '').rstrip('</doc>')
    # There are cases where titles do not overlap.
    # In that case, do not skip the first title.
    if text[len(title):(len(title)+len(title))] == title:
        text = text[len(title):]
    return doc_id, url, text
main loop
import time

# Output articles whose "docid" in the Wikipedia dump data matches "pageid" belonging to the category prepared in advance.
# Use print() for output, not write(). In case of text output, print() is easier to handle.

start_time = time.time()

ff = open(EXTRACT, 'w', encoding='utf-8')

no = 0
pageid = pageid_list[no]

id_list = []
no_page_list = []

for file in file_list:
    for page in get_article_list(file):
        docid, url, text = get_article(page)
        id_list.append(docid)
        if docid < pageid:
            continue
        if docid == pageid:
            print(text, file=ff)  # write text into file
            no += 1
            pageid = pageid_list[no]
        if docid > pageid:
            # Always keep docid <= pageid.
            # In this case, pageid was not found.
            no_page_list.append(pageid)
            no +=1
            pageid = pageid_list[no]

ff.close()

end_time = time.time()
processing_time = end_time - start_time

print("processing_time(sec): ", processing_time)
print("="*80)
print("全記事数:{}\t抽出した記事数:{}\t未検出記事数:{}".format(len(id_list), no, len(no_page_list)))
print("="*80)

he results of the execution are as follows.

processing_time(sec):  70.98843002319336
================================================================================
全記事数:2304095	抽出した記事数:13837	未検出記事数:1
================================================================================

Summary

The text file named “textdb” created this time will be used to store the data in RAG’s vector DB in the future.