News Extraction Automation Workflow with GPT Tools for Data

Description

Overview

This News Extraction automation workflow enables systematic retrieval and processing of recent news posts from a website without an RSS feed, employing a no-code integration pipeline. Designed for data engineers and content managers, it automates the extraction of URLs, publication dates, and full content from news listings, producing summarized insights and technical keywords using AI language models.

Key Benefits

Automates weekly extraction of news posts using CSS selectors from HTML content.
Filters news articles by publication date, ensuring only recent posts are processed.
Generates concise content summaries with AI, optimizing information consumption.
Extracts key technical keywords from articles via AI-driven natural language processing.
Stores enriched news data reliably in a structured NocoDB SQL database for further use.

Product Overview

This News Extraction orchestration pipeline triggers weekly based on a scheduled cron event. It initiates by sending an HTTP request to retrieve the HTML of the news listing page. Using HTML extraction nodes configured with precise CSS selectors, it pulls arrays of individual news post links (href attributes) and their corresponding publication dates from specified DOM elements. These arrays are split into individual items and merged by position to associate each link with its date.

A JavaScript code node then filters the combined set to retain only posts published within the last seven days. For each filtered link, the workflow fetches the full news article HTML, extracting the title and main content using targeted CSS selectors. The content is sent to an AI text generation node which produces a summary capped at 70 words, alongside another AI call that identifies the top three technical keywords without explanation.

Summary and keyword outputs are renamed and merged for clarity, then combined with the original metadata (title, date, link). The final enriched JSON objects are pushed into a NocoDB SQL database table configured with appropriate fields, enabling structured storage and downstream querying. The workflow operates synchronously within each step, with default platform error handling applied. Authentication for AI and database access is via API keys securely managed by n8n credentials.

Features and Outcomes

Core Automation

This news extraction pipeline accepts raw HTML from a news listing page as input, applies CSS selector-based extraction to isolate links and publication dates, and uses a date-based filter for recent content. AI-powered summarization and keyword extraction nodes generate concise and relevant metadata for each article.

Single-pass evaluation merges and filters items by publication date deterministically.
Seamless combination of structured metadata with AI-generated text enrichments.
Synchronous node execution ensures consistent data flow and output integrity.

Integrations and Intake

The workflow integrates with external HTTP endpoints for HTML retrieval and uses OpenAI’s GPT model for natural language processing, authenticated via API keys. The intake consists of HTML pages containing news listings, expected to have consistent CSS structure for reliable extraction.

HTTP Request nodes fetch listing and detail pages for content scraping.
OpenAI API leverages GPT for content summarization and keyword extraction.
NocoDB API authenticates via token to store processed news data securely.

Outputs and Consumption

Outputs include JSON objects combining news titles, publication dates, URLs, AI-generated summaries, and keyword lists. Data is delivered asynchronously to a structured SQL database, enabling efficient querying and integration into downstream systems.

Structured JSON objects with keys: Title, Date, Link, Summary, Keywords.
Data is stored in a NocoDB SQL table optimized for news metadata management.
All outputs maintain consistent formatting for automated consumption workflows.

Workflow — End-to-End Execution

Step 1: Trigger

The workflow is initiated by a scheduled trigger node configured to run weekly on a specific day and time. This time-based event ensures periodic retrieval of news updates without manual intervention.

Step 2: Processing

Upon trigger, an HTTP Request node retrieves the HTML of the news listing page. Two HTML extraction nodes then apply CSS selectors to extract arrays of post links and publication dates. These arrays are split into individual JSON items for further processing and merged by their index position. Basic presence checks ensure extracted data is valid before filtering.

Step 3: Analysis

A JavaScript code node filters posts to include only those published within the last seven days. For each filtered post, an HTTP Request node fetches the full article HTML, from which title and content are extracted. AI-powered nodes then generate a succinct summary (under 70 words) and identify three key technical keywords from the content.

Step 4: Delivery

The enriched news data—combining original metadata with AI-generated summaries and keywords—is merged and sent to a NocoDB SQL database node. Data is stored asynchronously in predefined fields, enabling persistent archival and downstream analysis.

Use Cases

Scenario 1

Content managers need timely updates from news sites lacking RSS feeds. This workflow scrapes the latest posts, summarizes content, and extracts keywords, delivering structured data weekly. The result is a consistent feed of relevant news summaries ready for editorial review.

Scenario 2

Data analysts require consolidated, searchable news metadata. By filtering posts by date and enriching them with AI-generated summaries and keywords, this automation pipeline reduces manual curation effort and produces structured insights for trend analysis.

Scenario 3

IT teams managing content ingestion pipelines benefit from automated extraction and storage of news articles in SQL databases. This workflow ensures deterministic extraction steps and reliable data enrichment for integration with BI tools or content management systems.

How to use

Import the workflow into your n8n instance and configure credentials for OpenAI API and NocoDB database access. Adjust the schedule trigger to the preferred weekly interval. Verify and update CSS selectors if the news site’s HTML structure changes. Run the workflow manually once to validate extraction and data flow. Once active, the workflow will run automatically, producing weekly batches of summarized news posts with technical keywords, stored in the configured database.

Comparison — Manual Process vs. Automation Workflow

Attribute	Manual/Alternative	This Workflow
Steps required	Multiple manual steps: browsing, copying links, summarizing, keyword extraction, and data entry.	Automates all steps from scraping to data storage in a single pipeline.
Consistency	Subject to human error and inconsistent summarization quality.	Deterministic extraction and AI-generated summaries ensure uniform output.
Scalability	Limited by manual effort; impractical for large volumes or frequent updates.	Scales automatically with scheduled triggers and batch processing.
Maintenance	High effort to maintain selectors and manual processes.	Requires occasional updates to CSS selectors and credential management only.

Technical Specifications

Environment	n8n automation platform with internet access
Tools / APIs	HTTP Request, HTML Extract, JavaScript, OpenAI GPT API, NocoDB API
Execution Model	Synchronous node execution with scheduled trigger
Input Formats	HTML pages of news listings and individual articles
Output Formats	Structured JSON objects stored in NocoDB SQL database
Data Handling	Transient processing with no persistence outside configured database
Known Constraints	Relies on stable CSS selectors and external API availability
Credentials	OpenAI API key, NocoDB API token

Implementation Requirements

Valid OpenAI API credentials with access to GPT model for summarization and keyword extraction.
NocoDB API token configured with write permissions to the target SQL database table.
Network access to the news website and external APIs (OpenAI, NocoDB) without firewall restrictions.

Configuration & Validation

Set up API credentials for OpenAI and NocoDB within n8n credentials manager.
Verify CSS selectors for link and date extraction match the current news site HTML structure.
Run the workflow manually to confirm correct extraction, AI summarization, keyword generation, and database insertion.

Data Provenance

Trigger: Scheduled cron node initiating weekly execution.
Extraction Nodes: HTML Extract nodes using CSS selectors for links and dates.
AI Processing: OpenAI GPT nodes for generating summaries and technical keywords.

FAQ

How is the News Extraction automation workflow triggered?

The workflow is triggered by a scheduled trigger node configured to run once per week at a specified day and time, initiating the entire extraction and processing pipeline.

Which tools or models does the orchestration pipeline use?

The pipeline integrates HTTP Request nodes for web scraping, HTML extraction nodes with CSS selectors, and OpenAI’s GPT model for AI-powered summarization and keyword extraction, all orchestrated within n8n.

What does the response look like for client consumption?

The output is a structured JSON object containing each news post’s title, publication date, link, AI-generated summary, and a list of three technical keywords, all stored in a NocoDB SQL database.

Is any data persisted by the workflow?

No data is persisted within the workflow itself. The final enriched news data is saved only in the configured NocoDB SQL database for persistent storage and downstream use.

How are errors handled in this integration flow?

The workflow relies on n8n’s default error handling mechanisms. No explicit retry or backoff strategies are configured within the nodes, so failures are logged and require manual intervention.

Conclusion

This News Extraction workflow provides a dependable, repeatable process to scrape, summarize, and keyword-extract recent news posts from sites lacking RSS feeds. By automating content ingestion and enrichment on a weekly schedule, it reduces manual overhead and delivers structured metadata suitable for databases and analytical systems. The workflow depends on stable web page structure and external API availability, which requires periodic validation to maintain accuracy over time. Overall, it offers a precise, scalable method to transform raw HTML news content into actionable insights.

Additional information

Use Case	Content & Media, Data Analytics
Platform	n8n, OpenAI GPT
Risk Level (EU)	GPAI
Tech Stack	Custom API
Trigger Type	Schedule Cron
Skill Level	Low Code
Data Sensitivity	No PII