BeautifulSoupでAAAI2018の採択論文タイトルをスクレイピングしてみる - 世界銀行で働くデータサイエンティストのブログ

機械学習・深層学習系のトップカンファレンスであるAAAIでは，毎年1000近い論文/発表が採択されています。AAAI2018採択論文のタイトルに含まれているワードをチェックすれば，2018年のAI研究で熱かったトピックを探れそうです。

今回はPythonのBeautifulsoupを用いて，AAAIから論文タイトルを抽出し，出現ワードを見ていきます。

採択論文のタイトルは1つのページに一覧として纏められており，簡単に情報が抽出できそうです。

ここからは以下のステップで分析を行ってみます。

ステップ1：BeautifulSoupで論文タイトルを抽出する。

BeautifulSoupの使い方についてはググれば幾らでも情報があるので，ここでは割愛します。URLからhtmlを吸い上げ，aタグのみ抽出します。

# Step1
from urllib.request import urlopen
from bs4 import BeautifulSoup
# URL to access
url = "https://aaai.org/Library/AAAI/aaai18contents.php"
html = urlopen(url)
# extract html with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# get a-tag
tag_a = soup.find_all('a')

tag_aの要素数と，適当な要素を確認してみます。

aタグのうち，抽出したいのはタイトルである"Transferring Decomposed Tensors for Scalable Energy Breakdown Across Regions"の部分のみです。また，1153要素の中には関係のないURLのリンク名なども含まれており，実際に論文タイトルが入っているのは10番目から1111番目なので，その部分のみ抽出します。抽出したタイトルをリストに格納します。

paper_title = []
for link in tag_a:
    paper_title.append(link.string)
paper_title = paper_title[10:1111]

paper_titleにちゃんと論文タイトルのみ含まれているのが確認できます。

ステップ2：タイトルを単語に分解する。

次は，paper_titleに含まれた論文タイトルを単語に分解していきます。タイトル中のカンマやコロン等は取り除き，単語は全て小文字に変換してから，splitします。splitしてリストに格納したものは，リストのリストなので，分解して一つのリストに変換します。

# Step2
import re
stop_symbol = [':', ',', '\\?']
words_list_list = []
for i in range(len(paper_title)):
    if paper_title[i] is not None:
        title = paper_title[i]
        title = re.sub("|".join(stop_symbol), "", title)
        title = title.lower() # make it lower case
        words_list_list.append(title.split())
        
# list of list to list
words_list = [item for sublist in words_list_list for item in sublist]

総単語数は9159になりました。リストの中身を確認してみると，こんな感じなってます。

ステップ3：単語の形を整える。

ここまでの段階の単語リストには，同じ単語の複数形や単数形が混ざっているので，統一する必要があります。singularizeモジュールを用いて単数形に統一します。

# Step 3
# plural to singular: https://stackoverflow.com/questions/31387905/converting-plural-to-singular-in-a-text-file-with-python
from pattern.text.en import singularize
words_list = [singularize(plural) for plural in words_list]

英単語の解析を行う多くの場合，-ingや-edなどの分詞を原形に直しますが，今回は分詞のまま進めて行きます。

次に，前置詞や接続詞など不必要な単語(ストップワード) が多く含まれているので，除いて行きます。自然言語処理の分野で最も用いられているnltkモジュールに，ストップワードの一覧が含まれているので，これを使います。

# stop words
import nltk
from nltk.corpus import stopwords
stop_words = list(stopwords.words('english'))

# remove words in stop_words from words_list
words_list = [x for x in words_list if x not in stop_words]

ストップワードを除いたことで，単語の数が7386まで減りました。

ステップ4：単語の出現頻度のランキングを作成する。

次に単語のリストに対し，Counterモジュールを用いて各単語の出現回数をカウントします。各単語と出現回数をpandasデータフレームに格納し，出現回数を降順に並び替え，ランキングを作成します。

# Step 4
import pandas as pd
from collections import Counter
df = pd.DataFrame()
df["word"] = Counter(words_list).keys() # equals to list(set(words))
df["count"] = Counter(words_list).values() # counts the elements' frequency
df = df.sort_values('count', ascending=False)

こんな感じになりました。

ステップ5：Seabornでランキングを可視化する。

最後に，Seabornを用いてトップ20の単語を棒グラフとして表してみます。

# step 5
import seaborn as sns
import matplotlib.pyplot as plt

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(7, 7))

# Plot the total crashes
n = 20
sns.barplot(x = "count", y = "word", data = df.head(n), palette="GnBu_d")

f:id:shinmee:20200519202905p:plain:w400

これで完成です。

考察

まずランキングから，ストップワードとして除くべきワードが他にも多くあることがわかりました。usingやdatum(dataの単数形), learningなどは除いても問題なさそうです。

単語について，ランキング上位にNeural, Network, Deepが入ってますが，AIの学会なので当然と言えます。また，"Embedding"や"Adversarial"が含まれていることから，Word embeddings (分散表現)やGenerative Adversarial Networks(敵対的生成ネットワーク)が流行っていることが見受けられます。

今回はAAAIの論文タイトルを抽出してみましたが，他の学会でやってみても面白いかもしれません。

今回のコードまとめ

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from pattern.text.en import singularize
import nltk
from nltk.corpus import stopwords
import pandas as pd
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
##########
# Step1
##########
url = "https://aaai.org/Library/AAAI/aaai18contents.php" # URL to access
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser") # extract html with BeautifulSoup
tag_a = soup.find_all('a') # get a-tag
paper_title = []
for link in tag_a:
    paper_title.append(link.string)
paper_title = paper_title[10:1111]
##########
# Step2
##########
stop_symbol = [':', ',', '\\?']
words_list_list = []
for i in range(len(paper_title)):
    if paper_title[i] is not None:
        title = paper_title[i]
        title = re.sub("|".join(stop_symbol), "", title)
        title = title.lower() # make it lower case
        words_list_list.append(title.split())
words_list = [item for sublist in words_list_list for item in sublist] # list of list to list
##########
# Step 3
##########
words_list = [singularize(plural) for plural in words_list] # plural to singular
stop_words = list(stopwords.words('english')) # stop words
words_list = [x for x in words_list if x not in stop_words] # remove words in stop_words from words_list
##########
# Step 4
##########
df = pd.DataFrame()
df["word"] = Counter(words_list).keys() # equals to list(set(words))
df["count"] = Counter(words_list).values() # counts the elements' frequency
df = df.sort_values('count', ascending=False)
##########
# step 5
##########
f, ax = plt.subplots(figsize=(7, 7)) # Initialize the matplotlib figure
n = 20
sns.barplot(x = "count", y = "word", data = df.head(n), palette="GnBu_d") # Plot the total crashes