ada-url

The urlib.parse module in Python does not follow the legacy RFC 3978 standard nor does it follow the newer WHATWG URL specification. It is also relatively slow.

This is ada_url, a fast standard-compliant Python library for working with URLs based on the Ada URL parser.

Installation

Install from PyPI:

pip install ada_url

Usage examples

Parsing URLs

The URL class is intended to match the one described in the WHATWG URL spec:.

>>> from ada_url import URL
>>> urlobj = URL('https://example.org/path/../file.txt')
>>> urlobj.href
'https://example.org/path/file.txt'

The parse_url function returns a dictionary of all URL elements:

>>> from ada_url import parse_url
>>> parse_url('https://user:pass@example.org:80/api?q=1#2')
{
    'href': 'https://user:pass@example.org:80/api?q=1#2',
    'username': 'user',
    'password': 'pass',
    'protocol': 'https:',
    'port': '80',
    'hostname': 'example.org',
    'host': 'example.org:80',
    'pathname': '/api',
    'search': '?q=1',
    'hash': '#2',
    'origin': 'https://example.org:80',
    'host_type': <HostType.DEFAULT: 0>,
    'scheme_type': <SchemeType.HTTPS: 2>
}

Altering URLs

Replacing URL components with the URL class:

>>> from ada_url import URL
>>> urlobj = URL('https://example.org/path/../file.txt')
>>> urlobj.host = 'example.com'
>>> urlobj.href
'https://example.com/file.txt'

Replacing URL components with the replace_url function:

>>> from ada_url import replace_url
>>> replace_url('https://example.org/path/../file.txt', host='example.com')
'https://example.com/file.txt'

Search parameters

The URLSearchParams class is intended to match the one described in the WHATWG URL spec.

>>> from ada_url import URLSearchParams
>>> obj = URLSearchParams('key1=value1&key2=value2')
>>> list(obj.items())
[('key1', 'value1'), ('key2', 'value2')]

The parse_search_params function returns a dictionary of search keys mapped to value lists:

>>> from ada_url import parse_search_params
>>> parse_search_params('key1=value1&key2=value2')
{'key1': ['value1'], 'key2': ['value2']}

Internationalized domain names

The idna class can encode and decode IDNs:

>>> from ada_url import idna
>>> idna.encode('Bücher.example')
b'xn--bcher-kva.example'
>>> idna.decode(b'xn--bcher-kva.example')
'bücher.example'

WHATWG URL compliance

This library is compliant with the WHATWG URL spec. This means, among other things, that it properly encodes IDNs and resolves paths:

>>> from ada_url import URL
>>> parsed_url = URL('https://www.GOoglé.com/./path/../path2/')
>>> parsed_url.hostname
'www.xn--googl-fsa.com'
>>> parsed_url.pathname
'/path2/'

Contrast that with the Python standard library’s urlib.parse module:

>>> from urllib.parse import urlparse
>>> parsed_url = urlparse('https://www.GOoglé.com/./path/../path2/')
>>> parsed_url.hostname
'www.googlé.com'
>>> parsed_url.path
'/./path/../path2/'

Alternative Python bindings

This package uses CFFI to call the Ada library’s functions, which has a performance cost. The alternative can_ada (Canadian Ada) package uses pybind11 to generate a Python extension module, which is more performant.

Building from source

You will need to have Python 3 development files installed. On macOS, you will have these if you installed Python with brew. On Linux, you may need to install some packages (e.g., python3-dev and python3-venv).

You will also need a C++ toolchain. On macOS, Xcode will provide this for you. On Linux, you may need to install some more pacakges (e.g. build-esential).

Clone the git repository to a directory for development:

git clone https://github.com/ada-url/ada-python.git ada_url_python
cd ada_url_python

Create a virtual environment to use for building:

python3 -m venv env
source ./env/bin/activate

After that, you’re ready to build the package:

python -m pip install -r requirements/development.txt
c++ -c "ada_url/ada.cpp" -fPIC -std="c++17" -O2 -o "ada_url/ada.o"
python -m build --no-isolation

This will create a .whl file in the dist directory. You can install it in other virtual environments on the same machine.

To run tests, first build a package. Then:

python -m pip install -e .
python -m unittest

Leave the virtual environment with the deactivate comamnd.

API Documentation

class ada_url.URL(url, base=None)[source]

Parses a url (with an optional base) according to the WHATWG URL parsing standard.

>>> from ada_url import URL
>>> old_url = 'https://example.org:443/file.txt?q=1'
>>> urlobj = URL(old_url)
>>> urlobj.host
'example.org'
>>> urlobj.host = 'example.com'
>>> new_url = urlobj.href
>>> new_url
'https://example.com:443/file.txt?q=1'

You can read and write the following attributes:

  • href

  • protocol

  • username

  • password

  • host

  • hostname

  • port

  • pathname

  • search

  • hash

You can additionally read these attributes:

  • origin, which will be a str

  • host_type, which will be a HostType enum

  • scheme_type, which will be a SchemeType enum

The class also exposes a static method that checks whether the input url (and optional base) can be parsed:

>>> url = 'file_2.txt'
>>> base = 'https://example.org:443/file_1.txt'
>>> URL.can_parse(url, base)
True

See the WHATWG docs for more details on the URL class.

class ada_url.HostType[source]

Enum for URL host types:

  • DEFAULT hosts like https://example.org are 0.

  • IPV4 hosts like https://192.0.2.1 are 1.

  • IPV6 hosts like https://[2001:db8::] are 2.

>>> from ada_url import HostType
>>> HostType.DEFAULT
<HostType.DEFAULT: 0>
class ada_url.SchemeType[source]

Enum for URL scheme types.

  • HTTP URLs like http://example.org are 0.

  • NOT_SPECIAL URLs like git://example.og are 1.

  • HTTPS URLs like https://example.org are 2.

  • WS URLs like ws://example.org are 3.

  • FTP URLs like ftp://example.org are 4.

  • WSS URLs like wss://example.org are 5.

  • FILE URLs like file://example are 6.

>>> from ada_url import SchemeType
>>> SchemeType.HTTPS
<SchemeType.HTTPS: 2>

ada_url.check_url(s)[source]

Returns True if s represents a valid URL, and False otherwise.

>>> from ada_url import check_url
>>> check_url('bogus')
False
>>> check_url('http://a/b/c/d;p?q')
True
ada_url.join_url(base_url, s)[source]

Return the URL that results from joining base_url to s. Raises ValueError if no valid URL can be constructed.

>>> from ada_url import join_url
>>> base_url = 'http://a/b/c/d;p?q'
>>> join_url(base_url, '../g')
'http://a/b/g'
ada_url.normalize_url(s)[source]

Returns a “normalized” URL with all '..' and '/' characters resolved.

>>> from ada_url import normalize_url
>>> normalize_url('http://a/b/c/../g')
'http://a/b/g'
ada_url.parse_url(s[, attributes])[source]

Returns a dictionary with the parsed components of the URL represented by s.

>>> from ada_url import parse_url
>>> url = 'https://user_1:password_1@example.org:8080/dir/../api?q=1#frag'
>>> parse_url(url)
{
    'href': 'https://user_1:password_1@example.org:8080/api?q=1#frag',
    'username': 'user_1',
    'password': 'password_1',
    'protocol': 'https:',
    'host': 'example.org:8080',
    'port': '8080',
    'hostname': 'example.org',
    'pathname': '/api',
    'search': '?q=1',
    'hash': '#frag'
    'origin': 'https://example.org:8080',
    'host_type': 0
    'scheme_type': 2
}

The names of the dictionary keys correspond to the components of the “URL class” in the WHATWG URL spec. host_type is a HostType enum. scheme_type is a SchemeType enum.

Pass in a sequence of attributes to limit which keys are returned.

>>> from ada_url import parse_url
>>> url = 'https://user_1:password_1@example.org:8080/dir/../api?q=1#frag'
>>> parse_url(url, attributes=('protocol'))
{'protocol': 'https:'}

Unrecognized attributes are ignored.

ada_url.replace_url(s, **kwargs)[source]

Start with the URL represented by s, replace the attributes given in the kwargs mapping, and return a normalized URL with the result.

Provide an empty string to unset an attribute.

>>> from ada_url import replace_url
>>> base_url = 'https://user_1:password_1@example.org/resource'
>>> replace_url(base_url, username='user_2', password='', protocol='http:')
'http://user_2@example.org/resource'

Unrecognized attributes are ignored. href is replaced first if it is given. hostname is replaced before host if both are given.

ValueError is raised if the input URL or one of the components is not valid.


class ada_url.URLSearchParams(params)[source]

Parses the given params string according to the WHATWG URL parsing standard.

The attribute and methods from the standard are implemented:

>>> from ada_url import URLSearchParams
>>> obj = URLSearchParams('key1=value1&key2=value2&key2=value3')
>>> obj.size
3
>>> obj.append('key2', 'value4')
>>> str(obj)
'key1=value1&key2=value2&key2=value3&key2=value4'
>>> obj.delete('key1')
>>> str(obj)
'key2=value2&key2=value3&key2=value4'
>>> obj.delete('key2', 'value2')
>>> str(obj)
'key2=value3&key2=value4'
>>> obj.get('key2')
'value3'
>>> obj.get_all('key2')
['value3', 'value4']
>>> obj.has('key2')
True
>>> obj.has('key2', 'value5')
False
>>> obj.set('key1', 'value6')
>>> str(obj)
'key2=value3&key2=value4&key1=value6'
>>> obj.sort()
>>> str(obj)
'key1=value6&key2=value3&key2=value4'

Iterators for the keys, values, and items are also implemented:

>>> obj = URLSearchParams('key1=value1&key2=value2&key2=value3')
>>> list(obj.keys())
['key1', 'key2', 'key2']
>>> list(obj.values())
['value1', 'value2', 'value3']
>>> list(obj.items())
[('key1', 'value1'), ('key2', 'value2'), ('key2', 'value3')]

See the WHATWG docs for more details on the URLSearchParams class.

class ada_url.parse_search_params(s)[source]

Returns a dictionary representing the parsed URL Parameters specified by s. The returned dictionary maps each key to a list of values associated with it.

>>> from ada_url import parse_search_params
>>> parse_search_params('key1=value1&key1=value2&key2=value3')
{'key1': ['value1', 'value2'], 'key2': ['value3']}
class ada_url.replace_search_params(s, *args)[source]

Returns a string representing the URL parameters specified by s, modified by the (key, value) pairs passed in as args.

>>> from ada_url import replace_search_params
>>> replace_search_params(
...     'key1=value1&key1=value2',
...     ('key1', 'value3'),
...     ('key2', 'value4')
... )
'key1=value3&key2=value4'

class ada_url.idna[source]

Process international domains according to the UTS #46 standard.

idna.encode() implements the UTS #46 ToASCII operation. Its output is a Python bytes object. It is also available as idna_to_ascii().

>>> from ada_url import idna
>>> idna.encode('meßagefactory.ca')
b'xn--meagefactory-m9a.ca'

idna.decode() implements the UTS #46 ToUnicode operation. Its oputput is a Python str object. It is also available as idna_to_unicode().

>>> from ada_url import idna
>>> idna.decode('xn--meagefactory-m9a.ca')
'meßagefactory.ca'

Both functions accept either str or bytes objects as input.