跳到主要内容
在摄取 HTML 文档以供后续检索时,我们通常只对网页的实际内容感兴趣,而不是其语义。使用 MozillaReadabilityTransformer 从文档中剥离 HTML 标签可以生成更具内容丰富度的块,从而使检索更有效。

设置

您需要安装 @mozilla/readabilityjsdom npm 包。
npm
npm install @mozilla/readability jsdom
虽然转换器本身不需要,但下面的使用示例需要 cheerio 进行抓取。
npm
npm install cheerio
有关安装 LangChain 软件包的一般说明,请参阅此部分
npm
npm install @langchain/community @langchain/core

用法

下面的示例抓取了一个 Hacker News 帖子,根据 HTML 标签进行分割,以根据标签的语义信息对块进行分组,然后从各个块中提取内容。
import { HTMLWebBaseLoader } from "@langchain/community/document_loaders/web/html";
import { MozillaReadabilityTransformer } from "@langchain/community/document_transformers/mozilla_readability";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const loader = new HTMLWebBaseLoader(
  "https://news.ycombinator.com/item?id=34817881"
);

const docs = await loader.load();

const splitter = RecursiveCharacterTextSplitter.fromLanguage("html");
const transformer = new MozillaReadabilityTransformer();

const sequence = transformer.pipe(splitter);

const newDocuments = await sequence.invoke(docs);

console.log(newDocuments);

/*
  [
    Document {
      pageContent: 'Hacker News new | past | comments | ask | show | jobs | submit login What Lights\n' +
        'the Universe’s Standard Candles? (quantamagazine.org) 75 points by Amorymeltzer\n' +
        '5 months ago | hide | past | favorite | 6 comments delta_p_delta_x 5 months ago\n' +
        '| next [–] Astrophysical and cosmological simulations are often insightful.\n' +
        "They're also very cross-disciplinary; besides the obvious astrophysics, there's\n" +
        'networking and sysadmin, parallel computing and algorithm theory (so that the\n' +
        'simulation programs are actually fast but still accurate), systems design, and\n' +
        'even a bit of graphic design for the visualisations.Some of my favourite\n' +
        'simulation projects:- IllustrisTNG:',
      metadata: {
        source: 'https://news.ycombinator.com/item?id=34817881',
        loc: [Object]
      }
    },
    Document {
      pageContent: 'that the simulation programs are actually fast but still accurate), systems\n' +
        'design, and even a bit of graphic design for the visualisations.Some of my\n' +
        'favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/-\n' +
        'SWIFT: https://swift.dur.ac.uk/- CO5BOLD:\n' +
        'https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations\n' +
        'of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)-\n' +
        'AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the\n' +
        'simulations in the article, too. froeb 5 months ago | parent | next [–]\n' +
        'Supernova simulations are especially interesting too. I have heard them\n' +
        'described as the only time in physics when all 4 of the fundamental forces are\n' +
        'important. The explosion can be quite finicky too. If I remember right, you\n' +
        "can't get supernova to explode",
      metadata: {
        source: 'https://news.ycombinator.com/item?id=34817881',
        loc: [Object]
      }
    },
    Document {
      pageContent: 'heard them described as the only time in physics when all 4 of the fundamental\n' +
        'forces are important. The explosion can be quite finicky too. If I remember\n' +
        "right, you can't get supernova to explode properly in 1D simulations, only in\n" +
        'higher dimensions. This was a mystery until the realization that turbulence is\n' +
        'necessary for supernova to trigger--there is no turbulent flow in 1D. andrewflnr\n' +
        "5 months ago | prev | next [–] Whoa. I didn't know the accretion theory of Ia\n" +
        'supernovae was dead, much less that it had been since 2011. andreareina 5 months\n' +
        'ago | prev | next [–] This seems to be the paper',
      metadata: {
        source: 'https://news.ycombinator.com/item?id=34817881',
        loc: [Object]
      }
    },
    Document {
      pageContent: 'andreareina 5 months ago | prev | next [–] This seems to be the paper\n' +
        'https://academic.oup.com/mnras/article/517/4/5260/6779709 andreareina 5 months\n' +
        "ago | prev [–] Wouldn't double detonation show up as variance in the brightness?\n" +
        'yencabulator 5 months ago | parent [–] Or widening of the peak. If one type Ia\n' +
        'supernova goes 1,2,3,2,1, the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3\n' +
        '0+1=1 Guidelines | FAQ | Lists |',
      metadata: {
        source: 'https://news.ycombinator.com/item?id=34817881',
        loc: [Object]
      }
    },
    Document {
      pageContent: 'the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3 0+1=1 Guidelines | FAQ |\n' +
        'Lists | API | Security | Legal | Apply to YC | Contact Search:',
      metadata: {
        source: 'https://news.ycombinator.com/item?id=34817881',
        loc: [Object]
      }
    }
  ]
*/

自定义

您可以向转换器传递 @mozilla/readability 包接受的任何参数,以自定义其工作方式。
以编程方式连接这些文档到 Claude、VSCode 等,通过 MCP 获取实时答案。
© . This site is unofficial and not affiliated with LangChain, Inc.