Skip to content

Commit c69dc59

Browse files
committed
Site updated: 2025-02-06 11:57:34
1 parent 8742cde commit c69dc59

File tree

2 files changed

+8
-7
lines changed

2 files changed

+8
-7
lines changed

2025/02/06/kernel-memory-textchunker/index.html

+5-4
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,15 @@
1010
<link rel="dns-prefetch" href="https://rainmakerho.github.io">
1111
<title>使用 Kernel Memory 的 TextChunker 來幫我們切 Chunk | 亂馬客</title>
1212
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
13-
<meta name="description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 1234567891011121314151617181920212223">
13+
<meta name="description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 123456789101112131415161718192021222">
1414
<meta property="og:type" content="article">
1515
<meta property="og:title" content="使用 Kernel Memory 的 TextChunker 來幫我們切 Chunk">
1616
<meta property="og:url" content="https://rainmakerho.github.io/2025/02/06/kernel-memory-textchunker/index.html">
1717
<meta property="og:site_name" content="亂馬客">
18-
<meta property="og:description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 1234567891011121314151617181920212223">
18+
<meta property="og:description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 123456789101112131415161718192021222">
1919
<meta property="og:locale" content="en_US">
2020
<meta property="article:published_time" content="2025-02-06T02:40:50.000Z">
21-
<meta property="article:modified_time" content="2025-02-06T02:54:22.210Z">
21+
<meta property="article:modified_time" content="2025-02-06T03:37:16.883Z">
2222
<meta property="article:author" content="亂馬客">
2323
<meta property="article:tag" content="Kernel Memory">
2424
<meta property="article:tag" content="RAG">
@@ -177,7 +177,8 @@ <h1 class="article-title" itemprop="name">
177177

178178
<div class="article-entry" itemprop="articleBody">
179179
<h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,<br><a target="_blank" rel="noopener" href="https://github.com/microsoft/kernel-memory">Kernel Memory</a> 的 TextChunker 正可以幫我們來做這件事。</p>
180-
<h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p>
180+
<h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 </p>
181+
<p>2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p>
181182
<figure class="highlight csharp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.AI;</span><br><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.DataFormats.Text;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">pragma</span> <span class="keyword">warning</span> disable KMEXP00</span></span><br><span class="line"><span class="built_in">string</span> contents = System.IO.File.ReadAllText(<span class="string">@&quot;你的文字檔&quot;</span>);</span><br><span class="line">TextChunker.TokenCounter tokenCounter = <span class="keyword">new</span> CL100KTokenizer().CountTokens;</span><br><span class="line"></span><br><span class="line"><span class="comment">//計算字串的Token數</span></span><br><span class="line"><span class="built_in">int</span> tokenCount = tokenCounter(contents);</span><br><span class="line">Console.WriteLine(<span class="string">$&quot;The text contains <span class="subst">&#123;tokenCount&#125;</span> tokens.&quot;</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//每行最多100個Token,超過就換到下一行</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerLine = <span class="number">100</span>;</span><br><span class="line"><span class="keyword">var</span> sentences = TextChunker.SplitPlainTextLines(contents, maxTokensPerLine: maxTokensPerLine, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextLines =======&quot;</span>);</span><br><span class="line"><span class="built_in">int</span> i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> sentence <span class="keyword">in</span> sentences) &#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;sentence&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">//每個段落最多400個Token,超過就換到下一個段落</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerParagraph = <span class="number">400</span>;</span><br><span class="line"><span class="built_in">int</span> overlapTokens = <span class="number">10</span>;</span><br><span class="line"><span class="keyword">var</span> partitions = TextChunker.SplitPlainTextParagraphs(sentences,</span><br><span class="line"> maxTokensPerParagraph: maxTokensPerParagraph, overlapTokens: overlapTokens, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextParagraphs =======&quot;</span>);</span><br><span class="line">i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> partition <span class="keyword">in</span> partitions)</span><br><span class="line">&#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;partition&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">//最後將這些 partition 呼叫 Embedding Model ,轉成 Vector 一併儲起來</span></span><br><span class="line"></span><br></pre></td></tr></table></figure>
182183

183184
<ul>

atom.xml

+3-3
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<link href="https://rainmakerho.github.io/atom.xml" rel="self"/>
77

88
<link href="https://rainmakerho.github.io/"/>
9-
<updated>2025-02-06T02:54:22.210Z</updated>
9+
<updated>2025-02-06T03:37:16.883Z</updated>
1010
<id>https://rainmakerho.github.io/</id>
1111

1212
<author>
@@ -21,9 +21,9 @@
2121
<link href="https://rainmakerho.github.io/2025/02/06/kernel-memory-textchunker/"/>
2222
<id>https://rainmakerho.github.io/2025/02/06/kernel-memory-textchunker/</id>
2323
<published>2025-02-06T02:40:50.000Z</published>
24-
<updated>2025-02-06T02:54:22.210Z</updated>
24+
<updated>2025-02-06T03:37:16.883Z</updated>
2525

26-
<content type="html"><![CDATA[<h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,<br>而 <a href="https://github.com/microsoft/kernel-memory">Kernel Memory</a> 的 TextChunker 正可以幫我們來做這件事。</p><h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p><figure class="highlight csharp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.AI;</span><br><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.DataFormats.Text;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">pragma</span> <span class="keyword">warning</span> disable KMEXP00</span></span><br><span class="line"><span class="built_in">string</span> contents = System.IO.File.ReadAllText(<span class="string">@&quot;你的文字檔&quot;</span>);</span><br><span class="line">TextChunker.TokenCounter tokenCounter = <span class="keyword">new</span> CL100KTokenizer().CountTokens;</span><br><span class="line"></span><br><span class="line"><span class="comment">//計算字串的Token數</span></span><br><span class="line"><span class="built_in">int</span> tokenCount = tokenCounter(contents);</span><br><span class="line">Console.WriteLine(<span class="string">$&quot;The text contains <span class="subst">&#123;tokenCount&#125;</span> tokens.&quot;</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//每行最多100個Token,超過就換到下一行</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerLine = <span class="number">100</span>;</span><br><span class="line"><span class="keyword">var</span> sentences = TextChunker.SplitPlainTextLines(contents, maxTokensPerLine: maxTokensPerLine, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextLines =======&quot;</span>);</span><br><span class="line"><span class="built_in">int</span> i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> sentence <span class="keyword">in</span> sentences) &#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;sentence&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">//每個段落最多400個Token,超過就換到下一個段落</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerParagraph = <span class="number">400</span>;</span><br><span class="line"><span class="built_in">int</span> overlapTokens = <span class="number">10</span>;</span><br><span class="line"><span class="keyword">var</span> partitions = TextChunker.SplitPlainTextParagraphs(sentences,</span><br><span class="line"> maxTokensPerParagraph: maxTokensPerParagraph, overlapTokens: overlapTokens, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextParagraphs =======&quot;</span>);</span><br><span class="line">i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> partition <span class="keyword">in</span> partitions)</span><br><span class="line">&#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;partition&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">//最後將這些 partition 呼叫 Embedding Model ,轉成 Vector 一併儲起來</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><ul><li>註: maxTokensPerLine, maxTokensPerParagraph, overlapTokens 請依文件內容來進行調整。</li></ul><h3 id="參考資源"><a href="#參考資源" class="headerlink" title="參考資源"></a>參考資源</h3><p><a href="https://github.com/microsoft/kernel-memory">Kernel Memory</a></p>]]></content>
26+
<content type="html"><![CDATA[<h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,<br>而 <a href="https://github.com/microsoft/kernel-memory">Kernel Memory</a> 的 TextChunker 正可以幫我們來做這件事。</p><h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 </p><p>2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p><figure class="highlight csharp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.AI;</span><br><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.DataFormats.Text;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">pragma</span> <span class="keyword">warning</span> disable KMEXP00</span></span><br><span class="line"><span class="built_in">string</span> contents = System.IO.File.ReadAllText(<span class="string">@&quot;你的文字檔&quot;</span>);</span><br><span class="line">TextChunker.TokenCounter tokenCounter = <span class="keyword">new</span> CL100KTokenizer().CountTokens;</span><br><span class="line"></span><br><span class="line"><span class="comment">//計算字串的Token數</span></span><br><span class="line"><span class="built_in">int</span> tokenCount = tokenCounter(contents);</span><br><span class="line">Console.WriteLine(<span class="string">$&quot;The text contains <span class="subst">&#123;tokenCount&#125;</span> tokens.&quot;</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//每行最多100個Token,超過就換到下一行</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerLine = <span class="number">100</span>;</span><br><span class="line"><span class="keyword">var</span> sentences = TextChunker.SplitPlainTextLines(contents, maxTokensPerLine: maxTokensPerLine, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextLines =======&quot;</span>);</span><br><span class="line"><span class="built_in">int</span> i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> sentence <span class="keyword">in</span> sentences) &#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;sentence&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">//每個段落最多400個Token,超過就換到下一個段落</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerParagraph = <span class="number">400</span>;</span><br><span class="line"><span class="built_in">int</span> overlapTokens = <span class="number">10</span>;</span><br><span class="line"><span class="keyword">var</span> partitions = TextChunker.SplitPlainTextParagraphs(sentences,</span><br><span class="line"> maxTokensPerParagraph: maxTokensPerParagraph, overlapTokens: overlapTokens, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">&quot;======= SplitPlainTextParagraphs =======&quot;</span>);</span><br><span class="line">i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> partition <span class="keyword">in</span> partitions)</span><br><span class="line">&#123;</span><br><span class="line"> Console.WriteLine(<span class="string">$&quot;<span class="subst">&#123;++i&#125;</span>=&gt;<span class="subst">&#123;partition&#125;</span>&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">//最後將這些 partition 呼叫 Embedding Model ,轉成 Vector 一併儲起來</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><ul><li>註: maxTokensPerLine, maxTokensPerParagraph, overlapTokens 請依文件內容來進行調整。</li></ul><h3 id="參考資源"><a href="#參考資源" class="headerlink" title="參考資源"></a>參考資源</h3><p><a href="https://github.com/microsoft/kernel-memory">Kernel Memory</a></p>]]></content>
2727

2828

2929

0 commit comments

Comments
 (0)