跳到主要内容
Apify Actors 是云程序,旨在处理各种网页抓取、爬网和数据提取任务。这些 Actors 促进了网络数据的自动化收集,使用户能够高效地提取、处理和存储信息。Actors 可用于执行诸如抓取电子商务网站以获取产品详细信息、监控价格变化或收集搜索引擎结果等任务。它们与 Apify 数据集无缝集成,允许将 Actor 收集的结构化数据存储、管理并以 JSON、CSV 或 Excel 等格式导出,以进行进一步分析或使用。

概览

本笔记将引导您使用 Apify Actors 与 LangChain 自动化网页抓取和数据提取。langchain-apify 包将 Apify 的云端工具与 LangChain 代理集成,从而实现 AI 应用程序高效的数据收集和处理。

设置

此集成存在于 langchain-apify 包中。该包可以使用 pip 安装。
pip install langchain-apify

先决条件

  • Apify 账户在此注册您的免费 Apify 账户。
  • Apify API 令牌:在 Apify 文档中了解如何获取您的 API 令牌。
import os

os.environ["APIFY_API_TOKEN"] = "your-apify-api-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

实例化

在这里,我们实例化 ApifyActorsTool 以便能够调用 RAG Web Browser Apify Actor。此 Actor 为 AI 和 LLM 应用程序提供网页浏览功能,类似于 ChatGPT 中的网页浏览功能。Apify Store 中的任何 Actor 都可以以这种方式使用。
from langchain_apify import ApifyActorsTool

tool = ApifyActorsTool("apify/rag-web-browser")

调用

ApifyActorsTool 接受一个参数,即 run_input - 一个字典,作为运行输入传递给 Actor。运行输入模式文档可以在 Actor 详细信息页面的输入部分找到。请参阅 RAG Web Browser 输入模式
tool.invoke({"run_input": {"query": "what is apify?", "maxResults": 2}})

链接

我们可以将创建的工具提供给 代理。当被要求搜索信息时,代理将调用 Apify Actor,该 Actor 将搜索网络,然后检索搜索结果。
pip install langgraph langchain-openai
from langchain.messages import ToolMessage
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent


model = ChatOpenAI(model="gpt-4o")
tools = [tool]
graph = create_agent(model, tools=tools)
inputs = {"messages": [("user", "search for what is Apify")]}
for s in graph.stream(inputs, stream_mode="values"):
    message = s["messages"][-1]
    # skip tool messages
    if isinstance(message, ToolMessage):
        continue
    message.pretty_print()
================================ Human Message =================================

search for what is Apify
================================== Ai Message ==================================
Tool Calls:
  apify_actor_apify_rag-web-browser (call_27mjHLzDzwa5ZaHWCMH510lm)
 Call ID: call_27mjHLzDzwa5ZaHWCMH510lm
  Args:
    run_input: {"run_input":{"query":"Apify","maxResults":3,"outputFormats":["markdown"]}}
================================== Ai Message ==================================

Apify is a comprehensive platform for web scraping, browser automation, and data extraction. It offers a wide array of tools and services that cater to developers and businesses looking to extract data from websites efficiently and effectively. Here's an overview of Apify:

1. **Ecosystem and Tools**:
   - Apify provides an ecosystem where developers can build, deploy, and publish data extraction and web automation tools called Actors.
   - The platform supports various use cases such as extracting data from social media platforms, conducting automated browser-based tasks, and more.

2. **Offerings**:
   - Apify offers over 3,000 ready-made scraping tools and code templates.
   - Users can also build custom solutions or hire Apify's professional services for more tailored data extraction needs.

3. **Technology and Integration**:
   - The platform supports integration with popular tools and services like Zapier, GitHub, Google Sheets, Pinecone, and more.
   - Apify supports open-source tools and technologies such as JavaScript, Python, Puppeteer, Playwright, Selenium, and its own Crawlee library for web crawling and browser automation.

4. **Community and Learning**:
   - Apify hosts a community on Discord where developers can get help and share expertise.
   - It offers educational resources through the Web Scraping Academy to help users become proficient in data scraping and automation.

5. **Enterprise Solutions**:
   - Apify provides enterprise-grade web data extraction solutions with high reliability, 99.95% uptime, and compliance with SOC2, GDPR, and CCPA standards.

For more information, you can visit [Apify's official website](https://apify.com/) or their [GitHub page](https://github.com/apify) which contains their code repositories and further details about their projects.

API 参考

有关如何使用此集成的更多信息,请参阅 Git 存储库Apify 集成文档

以编程方式连接这些文档到 Claude、VSCode 等,通过 MCP 获取实时答案。
© . This site is unofficial and not affiliated with LangChain, Inc.