Python Web Extractor, Scraper & Crawler
Before you start, make sure the browser is configured properly.
Arkalos includes a powerful built-in WebExtractor tool that allows you to automate the process of gathering data from websites, whether it's downloading full articles, scraping specific information, or crawling multiple pages linked from a start page.
- Crawling means starting from a base URL and automatically following internal links to extract data from every page it finds.
- Scraping means extracting specific data (like a title or price) from a specific page or part of a page.
🕸️ Crawling a Website: Saving Web Pages in Markdown
Want to download a website’s content for offline reading or AI training? Here's how to crawl the Arkalos documentation and save each page in Markdown format:
from arkalos.data.extractors import WebExtractor
arkalos = WebExtractor('https://arkalos.com', True, 'article')
arkalos.crawl()
This creates a folder at data/drive/crawl/arkalos.com/
containing one Markdown file per page.
WebExtractor parameters:
- Base URL – Starting page.
- Trailing slash flag – Set to
True
if URLs on the site end with a slash (e.g./docs/
). - CSS selector for content – For example
main
, or a specific.class
or#id
where the content is located.
Note
Arkalos waits for the page and main content area to fully load before scraping.
Calling crawl()
starts from the base page, discovers internal links, and saves every visited page as Markdown.
🎯 Scraping Specific Pages Only
You don’t always need the whole website. If you want to scrape specific pages, use:
Pass a list of relative URLs. Arkalos will visit each and extract the content using the same content selector.
🔍 Extracting Specific Details From Pages
Want to extract things like titles, prices, tags, or ratings from multiple articles on a page? You can create your own structured extractor.
Step 1: Define what data to extract
Here’s an example website that has this HTML:
<article data-id="6IF3WGP8V">
<a href="...">Title of the article</a>
<div data-item="description">Article text</div>
<div class="rating">
<img ...>
5
</div>
<div class="tags">
<span data-item="tag">Category A</span>
<span data-item="tag">Category B</span>
</div>
</article>
Step 2: Create a WebDetails
class
from arkalos.data.extractors import WebExtractor, WebDetails, _
from dataclasses import dataclass
import polars as pl
@dataclass
class ArticleDetails(WebDetails):
CONTAINER = 'article[data-id]'
id: _[str, None, 'data-id'] # Attribute from container
url: _[str, 'a', 'href'] # Link
title: _[str, 'a'] # Text from <a>
description: _[str, '[data-item="description"]']
tags: _[list[str], '[data-item="tag"]']
rating: _[int, '.rating', 1] # Second child (after image)
The _
is a special alias for Annotated
to provide typing and selector info in one line.
Step 3: Create your custom extractor
class MyWebsiteWebExtractor(WebExtractor):
BASE_URL = 'https://mywebsite.com'
PAGE_CONTENT_SELECTOR = 'main'
SCROLL = True
DETAILS = ArticleDetails
async def crawlTechArticles(self):
return await self.crawlSpecificDetails(['/category/tech'])
Step 4: Run your crawler
from app.data.extractors.my_website_web_extractor import MyWebsiteWebExtractor
mywebsite = MyWebsiteWebExtractor()
data = await mywebsite.crawlTechArticles()
df = pl.DataFrame(data)
df
You can view the results directly in VS Code using the Data Wrangler extension.
🧩 WebDetails and Annotations
To extract structured data from pages, you define a subclass of WebDetails
and annotate each property using a custom annotation format. These annotations tell the scraper what to extract and how.
The annotation format is:
-
data_type
: The expected Python type of the result. This can be:str
,int
,float
list[str]
— to extract multiple values as a list usingquerySelectorAll()
-
'CSS selector'
: A standard CSS selector string for locating the desired element inside the container. If set toNone
, the container itself will be used. -
optional_extraction_instruction
: Tells how to extract the data. It could be:- An attribute name (e.g.
'href'
) - A
slice
, like1:
or:3
- A regular expression string with one capture group
- An integer — to get a specific child node (e.g. the second text node)
- An attribute name (e.g.
Default Behavior:
If you only provide the data type and a CSS selector — without any extras — the text content of the element is extracted using .textContent
.
If the CSS selector doesn't match anything inside the container, the default value for the data type is returned (""
, 0
, etc.).
Supported Extraction Instructions (the 3rd argument)
-
Attribute Name
Extracts the value of the
href
attribute from the<a>
element. -
Slice
If
.price
contains text like"$99"
, the slice will extract"99"
and convert it toint
. -
Regular Expression
Extracts matched group from a string like
"Product ID-ABC12345"
. -
Child Node Index
Useful when the target number is inside a sibling text node, and you want to skip over images or other tags.
Lists with querySelectorAll()
When the type is list[str]
, querySelectorAll()
is used instead of querySelector()
, and all matched elements’ text contents are extracted into a list:
If You Want to Extract from the Container Itself
Set the selector to None
and use an attribute name or other instruction:
🧱 WebExtractor Constants
Your custom WebExtractor can set these config values:
Required:
BASE_URL
: Starting point for crawlingPAGE_CONTENT_SELECTOR
: The main content selector
Optional:
BROWSER_TYPE
: Choose browser type (default isEDGE_HEADLESS
)TRAIL_SLASH
: Add a trailing slash to URLs (defaultFalse
)WAIT_FOR_SELECTOR
: Wait for this selector to appear before scrapingCLICK_TEXT
: Auto-click this button (e.g. cookie popup) before scrapingSCROLL
: Scroll the page to load dynamic content (defaultFalse
)
For structured details:
DETAILS
: Your customWebDetails
class; orJSON_DETAILS
: or JS code to extract data from JSON blobs on the page
📦 Scraping JSON Data From Web Pages
Some websites store content as embedded JSON inside <script>
tags.
You can extract it using JavaScript instead of HTML selectors:
class MyWebsiteWebExtractor(WebExtractor):
BASE_URL = 'https://mywebsite.com'
PAGE_CONTENT_SELECTOR = 'main'
JSON_DETAILS = 'JSON.stringify(JSON.parse(document.querySelector("script#__DATA__").textContent).props.articles)'
async def crawlTechArticles(self):
return await self.crawlSpecificDetailsJSON(['/category/tech'])
Use crawlSpecificDetailsJSON()
instead of crawlSpecificDetails()
when using JSON extraction.
🧾 Get Raw HTML or a Parsed HTML Document
Arkalos can fetch the raw HTML or a parsed document using HTMLDocumentParser
.
html = await extractor.getPageHTML(url)
doc = await extractor.getPageHTMLDoc(url)
The doc
object has many useful properties:
doc.url
doc.baseURL
doc.domain
doc.path
doc.charset
doc.title
doc.meta
doc.metaProps
doc.links
doc.pageLinks
doc.internalPageLinks
doc.externalPageLinks
doc.cleanHTML
doc.contentHTML
doc.markdown
You can import HTMLDocumentParser
from:
It uses BeautifulSoup4 (with lxml) and markdownify for conversion to clean Markdown.