Extraction Rules
Customize your response by adding extraction rules.
Last updated
Customize your response by adding extraction rules.
Last updated
WebScrapingAPI allows you to extract specific sections of the webpage. You can do so by using the extract_rules
parameter.
This parameter's value can be a string
(the CSS selector or XPath) or a stringified object
. In the second case, the parameter accepts the following options:
Parameter | Type | Description |
---|---|---|
A full example of how this parameter would look in production is:
or:
GET
https://api.webscrapingapi.com/v2
The following examples shows how the extraction_rules
parameter is used in order to extract specific elements from the targeted website.
The full GET request for the extract_rules
should be:
Important! The url
& extract_rules
parameters have to be encoded.
( i.e. &url=https%3A%2F%2Fwww.webscrapingapi.com%2F&extract_rules=%7B%22title%22%3A%20%7B%22selector... )
extract_rules
object examplesHere are more examples that should help you better understand how the object passed to the extract_rules
parameter should look like:
Name | Type | Description |
---|---|---|
HTML Sample | Extraction Rule | Rule Description | JSON Output |
---|---|---|---|
selector
Required
string
The CSS selector or the XPath.
selector_type
string
The type of the selector
option. Accepted values are css
and xpath
. The default value is xpath
if the selector
option starts with /
, and css
otherwise.
output
string
The output format of the selected element. Accepted values are:
- html
- returns HTML format
- text
- (default) returns text format
- @[attr]
- returns the attribute of the element
- table_json
- returns the JSON format of a table
- table_array
- returns the array format of a table
- another extract_rules
object - used to parse nested elements.
all
int
Returns all possible elements. The default value for this parameter is "1"
.
clean
int
Removes leading and trailing white spaces, line terminator characters, and newlines from the result. The default value for this parameter is "1"
.
api_key*
String
<YOUR_API_KEY>
url*
String
https://webscrapingapi.com
extract_rules
Object
{
"title": {
"selector": "h1",
"output": "html"
},
"subtitle": {
"selector": ".font-light.max-w-6x",
"output": "text"
}
}
<div class="title">
This is my title
</div>
{"title": ".title"}
Return the text content of the elements having the CSS class .title
{
"title": [
"This is my title"
]
}
<div>
<a href="https://www.webscrapingapi.com/product/">
Product
</a>
<a href="https://www.webscrapingapi.com/pricing/">
Pricing
</a>
</div>
{
"links": {
"selector": "a",
"output": "@href",
"all": "1"
}
}
Return the href
attribute of all links on page
{
"links": [
"https://www.webscrapingapi.com/product/","https://www.webscrapingapi.com/pricing/"
]
}
<div>
<img src="https://www.webscrapingapi.com/assets/images/icons/full.svg?v=41d081a6f0"
>
</div>
{
"image": {
"selector": "img",
"output": "@src",
"all": 0,
}
}
Return the src
attribute of the first image available on page
{
"image": [
"https://www.webscrapingapi.com/assets/images/icons/full.svg?v=41d081a6f0"
]
}
<table class="ants">
<thead>
<tr>
<th>Region</th>
<th>No. species</th>
</tr>
</thead>
<tbody>
<tr>
<td>Europe</td>
<td>180</td>
</tr>
</tbody>
</table>
{
"table": {
"selector": ".ants",
"output": "table_json",
"all": 0
}
}
Return the JSON format of the first table having the CSS class .ants
{
"table": [
{
"Region: "Europe",
"No. species": "180"
}
]
}
<table class="ants">
<thead>
<tr>
<th>Region</th>
<th>No. species</th>
</tr>
</thead>
<tbody>
<tr>
<td>Europe</td>
<td>180</td>
</tr>
</tbody>
</table>
{
"table": {
"selector": ".ants",
"output": "table_array",
"all": 0
}
}
Return the array format of the first table having the CSS class .ants
{
"table": [
["Europe", "180"]
]
}
<ul>
<li>
<p class="name">Item1</p>
<p class="price">100</p>
</li>
<li>
<p class="name">Item2</p>
<p class="price">1000</p>
</li>
</ul>
{
"items": {
"selector": "li",
"output": {
"name": {
"selector": ".name",
"all": 0,
"price": {
"selector": ".price",
"all": 0
}
},
"all": 1
}
}
Return the name and the price of each list item.
{
"items": [
{
"name": "Item1",
"price": "100"
},
{
"name": "Item2",
"price": "1000"
}
]
}