ScrapeGraph

本指南提供了 ScrapeGraph 工具的快速入门概述。有关所有 ScrapeGraph 功能和配置的详细文档，请参阅API 参考。有关 ScrapeGraph AI 的更多信息：

概览

集成详情

类别	包	可序列化	JS 支持
智能抓取工具	langchain-scrapegraph	✅	❌
智能爬虫工具	langchain-scrapegraph	✅	❌
MarkdownifyTool	langchain-scrapegraph	✅	❌
代理抓取工具	langchain-scrapegraph	✅	❌
获取积分工具	langchain-scrapegraph	✅	❌

工具特性

工具	目的	输入	输出
智能抓取工具	从网站提取结构化数据	URL + 提示	JSON
智能爬虫工具	通过爬取从多个页面提取数据	URL + 提示 + 爬取选项	JSON
MarkdownifyTool	将网页转换为 Markdown	URL	Markdown 文本
获取积分工具	检查 API 积分	无	积分信息

设置

此集成需要以下包

pip install --quiet -U langchain-scrapegraph

Note: you may need to restart the kernel to use updated packages.

凭据

您需要 ScrapeGraph AI API 密钥才能使用这些工具。请访问 scrapegraphai.com 获取。

import getpass
import os

if not os.environ.get("SGAI_API_KEY"):
    os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")

设置 LangSmith 以获得一流的可观察性也很有帮助（但不是必需的）

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

实例化

这里我们展示如何实例化 ScrapeGraph 工具

from scrapegraph_py.logger import sgai_logger
import json

from langchain_scrapegraph.tools import (
    GetCreditsTool,
    MarkdownifyTool,
    SmartCrawlerTool,
    SmartScraperTool,
)

sgai_logger.set_logging(level="INFO")

smartscraper = SmartScraperTool()
smartcrawler = SmartCrawlerTool()
markdownify = MarkdownifyTool()
credits = GetCreditsTool()

调用

直接使用参数调用

让我们单独尝试每个工具

智能爬虫工具

SmartCrawlerTool 允许您从网站爬取多个页面，并使用深度控制、页面限制和域限制等高级爬取选项提取结构化数据。

# SmartScraper
result = smartscraper.invoke(
    {
        "user_prompt": "Extract the company name and description",
        "website_url": "https://scrapegraphai.com",
    }
)
print("SmartScraper Result:", result)

# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])

# SmartCrawler
url = "https://scrapegraphai.com/"
prompt = (
    "What does the company do? and I need text content from their privacy and terms"
)

# Use the tool with crawling parameters
result_crawler = smartcrawler.invoke(
    {
        "url": url,
        "prompt": prompt,
        "cache_website": True,
        "depth": 2,
        "max_pages": 2,
        "same_domain_only": True,
    }
)

print("\nSmartCrawler Result:")
print(json.dumps(result_crawler, indent=2))

# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)

SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}

Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)

PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up

Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}

Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}

# SmartCrawler example
from scrapegraph_py.logger import sgai_logger
import json

from langchain_scrapegraph.tools import SmartCrawlerTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = SmartCrawlerTool()

# Example based on the provided code snippet
url = "https://scrapegraphai.com/"
prompt = (
    "What does the company do? and I need text content from their privacy and terms"
)

# Use the tool with crawling parameters
result = tool.invoke(
    {
        "url": url,
        "prompt": prompt,
        "cache_website": True,
        "depth": 2,
        "max_pages": 2,
        "same_domain_only": True,
    }
)

print(json.dumps(result, indent=2))

使用 ToolCall 调用

我们还可以使用模型生成的 ToolCall 调用该工具

model_generated_tool_call = {
    "args": {
        "user_prompt": "Extract the main heading and description",
        "website_url": "https://scrapegraphai.com",
    },
    "id": "1",
    "name": smartscraper.name,
    "type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)

ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')

链接

让我们使用我们的工具与 LLM 一起分析一个网站

# | output: false
# | echo: false

# pip install -qU langchain langchain-openai
from langchain.chat_models import init_chat_model

model = init_chat_model(model="gpt-4o", model_provider="openai")

Note: you may need to restart the kernel to use updated packages.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain

prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a helpful assistant that can use tools to extract structured information from websites.",
        ),
        ("human", "{user_input}"),
        ("placeholder", "{messages}"),
    ]
)

model_with_tools = model.bind_tools([smartscraper], tool_choice=smartscraper.name)
model_chain = prompt | model_with_tools


@chain
def tool_chain(user_input: str, config: RunnableConfig):
    input_ = {"user_input": user_input}
    ai_msg = model_chain.invoke(input_, config=config)
    tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
    return model_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)


tool_chain.invoke(
    "What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)

AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

API 参考

有关所有 ScrapeGraph 功能和配置的详细文档，请参阅LangChain API 参考。或访问官方 SDK 仓库。

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

热门提供商

按组件划分的集成

概览

集成详情

工具特性

设置

凭据

实例化

调用

直接使用参数调用

智能爬虫工具

使用 ToolCall 调用

链接

API 参考

热门提供商

按组件划分的集成

​概览

​集成详情

​工具特性

​设置

​凭据

​实例化

​调用

​直接使用参数调用

​智能爬虫工具

​使用 ToolCall 调用

​链接

​API 参考

概览

集成详情

工具特性

设置

凭据

实例化

调用

直接使用参数调用

智能爬虫工具

使用 ToolCall 调用

链接

API 参考