|
10 | 10 | <link rel="dns-prefetch" href="https://rainmakerho.github.io">
|
11 | 11 | <title>使用 Kernel Memory 的 TextChunker 來幫我們切 Chunk | 亂馬客</title>
|
12 | 12 | <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
|
13 |
| - <meta name="description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 1234567891011121314151617181920212223"> |
| 13 | + <meta name="description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 123456789101112131415161718192021222"> |
14 | 14 | <meta property="og:type" content="article">
|
15 | 15 | <meta property="og:title" content="使用 Kernel Memory 的 TextChunker 來幫我們切 Chunk">
|
16 | 16 | <meta property="og:url" content="https://rainmakerho.github.io/2025/02/06/kernel-memory-textchunker/index.html">
|
17 | 17 | <meta property="og:site_name" content="亂馬客">
|
18 |
| -<meta property="og:description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 1234567891011121314151617181920212223"> |
| 18 | +<meta property="og:description" content="前言使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,而 Kernel Memory 的 TextChunker 正可以幫我們來做這件事。 實作1.加入Microsoft.KernelMemory Nuget 套件 2.讀取文件內容後,交給TextChunker 來處理 123456789101112131415161718192021222"> |
19 | 19 | <meta property="og:locale" content="en_US">
|
20 | 20 | <meta property="article:published_time" content="2025-02-06T02:40:50.000Z">
|
21 |
| -<meta property="article:modified_time" content="2025-02-06T02:54:22.210Z"> |
| 21 | +<meta property="article:modified_time" content="2025-02-06T03:37:16.883Z"> |
22 | 22 | <meta property="article:author" content="亂馬客">
|
23 | 23 | <meta property="article:tag" content="Kernel Memory">
|
24 | 24 | <meta property="article:tag" content="RAG">
|
@@ -177,7 +177,8 @@ <h1 class="article-title" itemprop="name">
|
177 | 177 |
|
178 | 178 | <div class="article-entry" itemprop="articleBody">
|
179 | 179 | <h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>使用 RAG(Retrieval-Augmented Generation)需要將內容切成 Chunk,<br>而 <a target="_blank" rel="noopener" href="https://github.com/microsoft/kernel-memory">Kernel Memory</a> 的 TextChunker 正可以幫我們來做這件事。</p>
|
180 |
| -<h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p> |
| 180 | +<h3 id="實作"><a href="#實作" class="headerlink" title="實作"></a>實作</h3><p>1.加入<code>Microsoft.KernelMemory</code> Nuget 套件 </p> |
| 181 | +<p>2.讀取文件內容後,交給<code>TextChunker</code> 來處理</p> |
181 | 182 | <figure class="highlight csharp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.AI;</span><br><span class="line"><span class="keyword">using</span> Microsoft.KernelMemory.DataFormats.Text;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">pragma</span> <span class="keyword">warning</span> disable KMEXP00</span></span><br><span class="line"><span class="built_in">string</span> contents = System.IO.File.ReadAllText(<span class="string">@"你的文字檔"</span>);</span><br><span class="line">TextChunker.TokenCounter tokenCounter = <span class="keyword">new</span> CL100KTokenizer().CountTokens;</span><br><span class="line"></span><br><span class="line"><span class="comment">//計算字串的Token數</span></span><br><span class="line"><span class="built_in">int</span> tokenCount = tokenCounter(contents);</span><br><span class="line">Console.WriteLine(<span class="string">$"The text contains <span class="subst">{tokenCount}</span> tokens."</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//每行最多100個Token,超過就換到下一行</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerLine = <span class="number">100</span>;</span><br><span class="line"><span class="keyword">var</span> sentences = TextChunker.SplitPlainTextLines(contents, maxTokensPerLine: maxTokensPerLine, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">"======= SplitPlainTextLines ======="</span>);</span><br><span class="line"><span class="built_in">int</span> i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> sentence <span class="keyword">in</span> sentences) {</span><br><span class="line"> Console.WriteLine(<span class="string">$"<span class="subst">{++i}</span>=><span class="subst">{sentence}</span>"</span>);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">//每個段落最多400個Token,超過就換到下一個段落</span></span><br><span class="line"><span class="built_in">int</span> maxTokensPerParagraph = <span class="number">400</span>;</span><br><span class="line"><span class="built_in">int</span> overlapTokens = <span class="number">10</span>;</span><br><span class="line"><span class="keyword">var</span> partitions = TextChunker.SplitPlainTextParagraphs(sentences,</span><br><span class="line"> maxTokensPerParagraph: maxTokensPerParagraph, overlapTokens: overlapTokens, tokenCounter: tokenCounter);</span><br><span class="line">Console.WriteLine(<span class="string">"======= SplitPlainTextParagraphs ======="</span>);</span><br><span class="line">i = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">foreach</span> (<span class="keyword">var</span> partition <span class="keyword">in</span> partitions)</span><br><span class="line">{</span><br><span class="line"> Console.WriteLine(<span class="string">$"<span class="subst">{++i}</span>=><span class="subst">{partition}</span>"</span>);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//最後將這些 partition 呼叫 Embedding Model ,轉成 Vector 一併儲起來</span></span><br><span class="line"></span><br></pre></td></tr></table></figure>
|
182 | 183 |
|
183 | 184 | <ul>
|
|
0 commit comments