Skip to content

Unified HTML/CSS/JS extractor that inlines styles and resolves script functions into a single JSON structure.

License

Notifications You must be signed in to change notification settings

osmn-byhn/htmlparser

Repository files navigation

@osmn-byhn/htmlparser 🚀

Unify HTML, CSS, and JS into a single, element-centric JSON structure.

This library is designed for developers who need to extract data from HTML while preserving its visual and functional context. Unlike traditional parsers, @osmn-byhn/htmlparser inlines styles from <style> tags and resolves JavaScript event handlers (like onclick) into their actual function bodies.

🌟 Why use this?

  • Deep Extraction: Don't just get the HTML; get the "computed" feel of it. Styles that live in the <head> are automatically mapped to the elements they target in the <body>.
  • Function Intelligence: If an element has an onclick="doSomething()", this library searches the <script> tags, finds doSomething, and includes its full source code in the JSON entry for that element.
  • AI Friendly: The unified, self-contained JSON output is perfect for feeding into LLMs (Large Language Models) for UI analysis, code generation, or automated testing.
  • Zero Heavy Dependencies: Built with performance and simplicity in mind.

📦 Installation

npm install @osmn-byhn/htmlparser
# or
pnpm add @osmn-byhn/htmlparser
# or
yarn add @osmn-byhn/htmlparser

🚀 Quick Start

TypeScript

import { extractUnifiedFromHTML } from "@osmn-byhn/htmlparser";

const html = `
  <html>
    <head>
      <style>.btn { color: red; }</style>
    </head>
    <body>
      <button class="btn" onclick="sayHi()">Click Me</button>
      <script>function sayHi() { console.log('Hi!'); }</script>
    </body>
  </html>
`;

async function main() {
  const result = await extractUnifiedFromHTML(html);
  
  const button = result.body.children[0];
  console.log(button.inlineStyle); // { color: 'red' }
  console.log(button.events.click.function); // "function sayHi() { ... }"
}

main();

JavaScript (ES Modules)

import { extractUnifiedFromHTML } from "@osmn-byhn/htmlparser";

const result = await extractUnifiedFromHTML('<div>Hello</div>');
console.log(result.body);

JavaScript (CommonJS)

const { extractUnifiedFromHTML } = require("@osmn-byhn/htmlparser");

extractUnifiedFromHTML('<div>Hello</div>').then(result => {
  console.log(result.body);
});

🛠️ Output Structure

The output is a UnifiedExtraction object:

Field Description
metadata Stats: totalElements, maxDepth, totalTextNodes, etc.
body The root UnifiedElement (usually the <body> tag).

The UnifiedElement object:

Every element in the tree has this structure:

{
  "tag": "div",
  "id": "main-container",
  "class": "active primitive",
  "attrs": { "data-custom": "value" },
  "inlineStyle": { 
    "color": "red", 
    "font-size": "16px" 
  },
  "events": {
    "click": {
      "handler": "myFunc()",
      "function": "function myFunc() { ... }"
    }
  },
  "children": [ ... ],
  "textContent": "Hello World"
}

🎯 Use Cases

  1. Web Scraping: Extract data from modern web pages while keeping the styling info associated with the data points.
  2. LLM / AI Processing: Convert messy HTML into a structured JSON format that AI can easily understand and reason about.
  3. UI-to-Code: Build tools that convert existing websites into React/Vue/Tailwind components by having all styles and logic per-element.
  4. Automated Audits: Programmatically check if elements have specific styles or correctly mapped event handlers.

📜 License

MIT © osmn-byhn

About

Unified HTML/CSS/JS extractor that inlines styles and resolves script functions into a single JSON structure.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published