Table of Contents の追加：Live Script から Markdown への自動変換

15903 ワード

はじめに

ライブスクリプトを Markdown に変換する際に Table of Contents (目次) を加えるために行った処理方法のメモです。正規表現を使った文字列処理例の 1 つになればとまとめておきます。

使い方

生成される markdown に Table of Contents を追加する場合は ToC オプションを true に設定します。

Code(Display)

livescript2markdown('README_JP_sup.mlx',TOC = true);

livescript2markdown 関数のインストール方法については【MATLAB】ライブスクリプトの Markdown 変換で楽して Qiita 投稿を確認ください。Table of Contents を処理する箇所は GitHub: livescript2markdown の latex2markdown.m で確認できます。

Table of Contents とは？

Livescript 上では「目次」ボタン（画像参照）から追加することができますが、今回追加した機能では livescript 上で目次が表示されているか否かに関わらず ToC オプションで markdown 側に追加するかどうかを選択可能です。

どうやったか

livescript2markdown 自体は livescript を一旦 tex ファイルに変換したのちに、markdown へと変換しています。tex ファイルへの変換は MATLAB で提供されています。

生成される tex を見るとタイトルやサブセクションは以下のように出力されます。

Code(Display)

\matlabtitle{タイトル}
\matlabheading{セクション１}
\matlabheadingtwo{サブセクション１}
\matlabheadingthree{サブサブセクション}

4つだけです。簡単ですね。さらに今回はタイトルは無視します。そして markdown で欲しい形は

Code(Display)

# Table of contents
- [セクション](#セクション)
  - [サブセクション](#サブセクション)
    - [サブサブセクション](#サブサブセクション)

こんな感じです。セクションとすることで、markdown 内のハイパーリンクとなります。便利！ただ、ここで (#セクション) と ID を設定しますが、注意が必要なのは以下の 3 点です。

スペースは - (ハイフン) に変換
アルファベットは小文字に統一
同じ ID があるとうまく機能しない

セクション名のリスト抽出

このサンプル文字列でやってみます。

Code

str = "\matlabtitle{タイトル}" + newline ...
+ "\matlabheading{セクション１}" + newline ...
+ "\matlabheadingtwo{サブセクション１}" + newline ...
+ "\matlabheading{Section 2}" + newline ...
+ "\matlabheadingtwo{Subsection 2-1}" + newline ...
+ "\matlabheadingthree{サブサブセクション}";

実際に処理する tex ファイルには、コードや結果やコメントなどたくさん混ざった文字列を処理することになります。

まず matlabheading 等に合致する箇所を探す場合はこれ、正規表現を使う regexp の出番です。"match" を指定して、合致した文字列そのものを返してみます。

Code

toc_str = regexp(str,"\\matlabheading(?:|two|three){([^{}]+)}","match");
toc_str'

Output

ans = 5x1 string    
"\matlabheading{セクション１}"         
"\matlabheadingtwo{サブセクション１}"    
"\matlabheading{Section 2}"      
"\matlabheadingtwo{Subsection …  
"\matlabheadingthree{サブサブセクション}"

ちゃんと 3 つ見つかっていますね。

() はグループ化（トークン化）
(?:) と ? を付けるとトークン化しません
(?:|two|three) で matlabheading/matlabheadingtwo/matlabheadingthree の3種をカバー
\item{ ^{}は、{と }以外の任意の文字 }
+ は[]で囲んだ文字の1 回以上の繰り返し

ってなところです。

目次作成

ここで一気に markdown の目次の形式に変換しちゃいます。

Code

% generate ToC with hyperlink for markdown
toc_md = regexprep(toc_str,"\\matlabheading{([^{}]+)}","- [$1](#$1)");
toc_md = regexprep(toc_md,"\\matlabheadingtwo{([^{}]+)}","  - [$1](#$1)");
toc_md = regexprep(toc_md,"\\matlabheadingthree{([^{}]+)}","    - [$1](#$1)");
toc_md'

Output

ans = 5x1 string    
"- [セクション１](#セクション１)"                
"  - [サブセクション１](#サブセクション１)"          
"- [Section 2](#Section 2)"          
"  - [Subsection 2-1](#Subsection …  
"    - [サブサブセクション](#サブサブセクション)"

() はグループ化（トークン化）
\item{ ^{}は、{と }以外の任意の文字 }
+ は[]で囲んだ文字のの1 回以上の繰り返し
$1 はトークン化した文字列

ってなところです。Markdown ではセクションの階層はインデント（スペース）で表しますので、変換時に空白を加えています。

これで形は完成！！なんですが上でも触れた通り (#セクション) と ID 設定している部分がちょっと面倒です。注意が必要の以下の3点について処理を行います。

スペースは - (ハイフン) に変換
アルファベットは小文字に統一
同じ ID があるとうまく機能しない

同じ ID があるとうまく機能しない

セクション名が被っている・・ここは潔くあきらめて warning をだすに留めます。

上では "match" で合致した文字列全体を取り出しましたが、代わりに "tokens" を使うと、以下の通り ([^{}]+) に該当する文字列だけを取り出せます。

Code

toc_id = regexp(str,"\\matlabheading(?:|two|three){([^{}]+)}","tokens");
toc_id'

	1
1	"セクション１"
2	"サブセクション１"
3	"Section 2"
4	"Subsection 2-1"
5	"サブサブセクション"

こんな感じ。これを ID に使いますが、重複があれば warning を出しておきます。

Code

% check if any duplicate id for toc
toc_id = string(toc_id);
if length(toc_id) ~= length(unique(toc_id))
    warning("latex2markdown:ToCdupID","Duplication in section title is found. Some hyperlinks in ToC may not work properly.")
end

ID の文字列操作

これが一気に処理できればいいのですが、パッと思いつかなかったのでループで実直に行きます。それぞれ使える関数があります。

スペースは - (ハイフン) に変換： replace 関数
アルファベットは小文字に統一：lower 関数

Code

ids = regexp(toc_md,"\(#.*\)","match"); % ID 部分の文字列を抽出
ids'

	1
1	"(#セクション１)"
2	"(#サブセクション１)"
3	"(#Section 2)"
4	"(#Subsection 2-1)"
5	"(#サブサブセクション)"

この ID を 1 つずつ処理して markdown 形式に変換した文字列内で置き換えます。

Code

for ii=1:length(ids) % for each IDs
    tmp1 = ids{ii}; 
    tmp2 = replace(tmp1," ","-"); % space is replased by -.
    tmp2 = lower(tmp2); % lower case
    % replace ID string with a new string
    toc_md = replace(toc_md,tmp1,tmp2);
end
toc_md'

Output

ans = 5x1 string    
"- [セクション１](#セクション１)"                
"  - [サブセクション１](#サブセクション１)"          
"- [Section 2](#section-2)"          
"  - [Subsection 2-1](#subsection-…  
"    - [サブサブセクション](#サブサブセクション)"

仕上げ！

あとは目次のタイトルと一緒に結合して完成です。ToC オプションを true にして実行すると、生成される markdown の冒頭に挿入されます。

Code

toc_md = ["# Table of contents", toc_md]; % add tile
% join the strings
toc_md = join(toc_md,newline)

Output

toc_md = 
    "# Table of contents
     - [セクション１](#セクション１)
       - [サブセクション１](#サブセクション１)
     - [Section 2](#section-2)
       - [Subsection 2-1](#subsection-2-1)
         - [サブサブセクション](#サブサブセクション)"

まとめ

結局正規表現の話でしたね。何か気になることがあれば遠慮なくコメントください。

Author And Source

この問題について(Table of Contents の追加：Live Script から Markdown への自動変換), 我々は、より多くの情報をここで見つけました https://qiita.com/eigs/items/ad27da605753cdce0a3e

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .