跳到主要内容
语言模型有Token限制。您不应该超过Token限制。当您将文本分块时,最好计算Token的数量。有许多分词器。当您计算文本中的Token时,应该使用与语言模型相同的分词器。

tiktoken

tiktoken是由OpenAI创建的一个快速BPE分词器。
我们可以使用tiktoken来估算使用的Token数量。对于OpenAI模型来说,它可能更准确。
  1. 文本如何分块:按传入的字符。
  2. 分块大小如何测量:按tiktoken分词器。
@[CharacterTextSplitter], @[RecursiveCharacterTextSplitter]和@[TokenTextSplitter]可以直接与tiktoken一起使用。
pip install --upgrade --quiet langchain-text-splitters tiktoken
from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
要使用@[CharacterTextSplitter]进行分块,然后使用tiktoken合并块,请使用其.from_tiktoken_encoder()方法。请注意,此方法的分块可能大于tiktoken分词器测量的块大小。 .from_tiktoken_encoder()方法接受encoding_name作为参数(例如cl100k_base),或者model_name(例如gpt-4)。所有其他参数,如chunk_sizechunk_overlapseparators,都用于实例化CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.
为了对分块大小实施硬性约束,我们可以使用RecursiveCharacterTextSplitter.from_tiktoken_encoder,其中每个分块如果大小过大,将被递归地再次分块。
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)
我们还可以加载一个TokenTextSplitter分词器,它直接与tiktoken配合使用,并确保每个分块都小于块大小。
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our
某些书面语言(例如中文和日文)的字符可以编码为两个或更多个Token。直接使用TokenTextSplitter可能会在两个块之间拆分字符的Token,导致Unicode字符格式错误。使用RecursiveCharacterTextSplitter.from_tiktoken_encoderCharacterTextSplitter.from_tiktoken_encoder以确保块包含有效的Unicode字符串。

spaCy

spaCy是一个用Python和Cython编程语言编写的用于高级自然语言处理的开源软件库。
LangChain基于spaCy分词器实现了分块器。
  1. 文本如何分块:由spaCy分词器。
  2. 块大小如何衡量:通过字符数。
pip install --upgrade --quiet  spacy
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.



Last year COVID-19 kept us apart.

This year we are finally together again.



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.



With a duty to one another to the American people to the Constitution.



And with an unwavering resolve that freedom will always triumph over tyranny.



Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.



He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.



He met the Ukrainian people.



From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

SentenceTransformers

SentenceTransformersTokenTextSplitter是一个专门用于句法转换器模型的文本分块器。默认行为是将文本分块成适合您想要使用的句法转换器模型的Token窗口的块。 要使用句法转换器分词器进行文本分块和限制Token计数,请实例化一个SentenceTransformersTokenTextSplitter。您可以选择指定:
  • chunk_overlap:Token重叠的整数计数;
  • model_name:句法转换器模型名称,默认为"sentence-transformers/all-mpnet-base-v2"
  • tokens_per_chunk:每个块所需的Token计数。
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])
lorem

NLTK

自然语言工具包,或更常见的NLTK,是一个用Python编程语言编写的用于英语的符号和统计自然语言处理(NLP)的库和程序套件。
除了简单地按“\n\n”进行分块,我们还可以使用NLTK基于NLTK分词器进行分块。
  1. 文本如何分块:由NLTK分词器。
  2. 块大小如何衡量:通过字符数。
# pip install nltk
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.

KoNLPy

KoNLPy: Python中的韩语NLP是一个用于韩语自然语言处理(NLP)的Python包。
Token分块涉及将文本分割成更小、更易于管理的单元,称为Token。这些Token通常是单词、短语、符号或其他对进一步处理和分析至关重要的有意义元素。在英语等语言中,Token分块通常涉及通过空格和标点符号分隔单词。Token分块的有效性在很大程度上取决于分词器对语言结构的理解,以确保生成有意义的Token。由于为英语设计的Token分词器无法理解其他语言(如韩语)独特的语义结构,因此无法有效地用于韩语处理。

使用KoNLPy的Kkma分析器对韩语进行Token分块

对于韩语文本,KoNLPy包含一个名为Kkma(韩语知识形态素分析器)的形态分析器。Kkma提供韩语文本的详细形态分析。它将句子分解为单词,并将单词分解为其各自的形态素,识别每个Token的词性。它可以将文本块分割成单独的句子,这对于处理长文本特别有用。

使用注意事项

虽然Kkma以其详细分析而闻名,但重要的是要注意,这种精度可能会影响处理速度。因此,Kkma最适合那些分析深度优先于快速文本处理的应用程序。
# pip install konlpy
# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
    korean_document = f.read()
from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()
texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])
춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.

그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.

한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.

춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.

어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.

두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.

하지만 좋은 날들은 오래가지 않았다.

도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.

이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.

그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.

춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.

이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.

이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.

두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.

- 춘향전 (The Tale of Chunhyang)

Hugging Face分词器

Hugging Face有许多分词器。 我们使用Hugging Face分词器,即GPT2TokenizerFast来计算文本的Token长度。
  1. 文本如何分块:按传入的字符。
  2. 分块大小如何测量:由Hugging Face分词器计算的Token数量。
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

以编程方式连接这些文档到 Claude、VSCode 等,通过 MCP 获取实时答案。
© . This site is unofficial and not affiliated with LangChain, Inc.