Fonction read_html - module pandas

Signature de la fonction read_html

def read_html(io: 'FilePath | ReadBuffer[str]', *, match: 'str | Pattern' = '.+', flavor: 'HTMLFlavors | Sequence[HTMLFlavors] | None' = None, header: 'int | Sequence[int] | None' = None, index_col: 'int | Sequence[int] | None' = None, skiprows: 'int | Sequence[int] | slice | None' = None, attrs: 'dict[str, str] | None' = None, parse_dates: 'bool' = False, thousands: 'str | None' = ',', encoding: 'str | None' = None, decimal: 'str' = '.', converters: 'dict | None' = None, na_values: 'Iterable[object] | None' = None, keep_default_na: 'bool' = True, displayed_only: 'bool' = True, extract_links: "Literal[None, 'header', 'footer', 'body', 'all']" = None, dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>, storage_options: 'StorageOptions' = None) -> 'list[DataFrame]'

Description

help(pandas.read_html)

Read HTML tables into a ``list`` of ``DataFrame`` objects.

Parameters
----------
io : str, path object, or file-like object
    String, path object (implementing ``os.PathLike[str]``), or file-like
    object implementing a string ``read()`` function.
    The string can represent a URL or the HTML itself. Note that
    lxml only accepts the http, ftp and file url protocols. If you have a
    URL that starts with ``'https'`` you might try removing the ``'s'``.

    .. deprecated:: 2.1.0
        Passing html literal strings is deprecated.
        Wrap literal string/bytes input in ``io.StringIO``/``io.BytesIO`` instead.

match : str or compiled regular expression, optional
    The set of tables containing text matching this regex or string will be
    returned. Unless the HTML is extremely simple you will probably need to
    pass a non-empty string here. Defaults to '.+' (match any non-empty
    string). The default value will return all tables contained on a page.
    This value is converted to a regular expression so that there is
    consistent behavior between Beautiful Soup and lxml.

flavor : {"lxml", "html5lib", "bs4"} or list-like, optional
    The parsing engine (or list of parsing engines) to use. 'bs4' and
    'html5lib' are synonymous with each other, they are both there for
    backwards compatibility. The default of ``None`` tries to use ``lxml``
    to parse and if that fails it falls back on ``bs4`` + ``html5lib``.

header : int or list-like, optional
    The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
    make the columns headers.

index_col : int or list-like, optional
    The column (or list of columns) to use to create the index.

skiprows : int, list-like or slice, optional
    Number of rows to skip after parsing the column integer. 0-based. If a
    sequence of integers or a slice is given, will skip the rows indexed by
    that sequence.  Note that a single element sequence means 'skip the nth
    row' whereas an integer means 'skip n rows'.

attrs : dict, optional
    This is a dictionary of attributes that you can pass to use to identify
    the table in the HTML. These are not checked for validity before being
    passed to lxml or Beautiful Soup. However, these attributes must be
    valid HTML table attributes to work correctly. For example, ::

        attrs = {'id': 'table'}

    is a valid attribute dictionary because the 'id' HTML tag attribute is
    a valid HTML attribute for *any* HTML tag as per `this document
    <https://html.spec.whatwg.org/multipage/dom.html#global-attributes>`__. ::

        attrs = {'asdf': 'table'}

    is *not* a valid attribute dictionary because 'asdf' is not a valid
    HTML attribute even if it is a valid XML attribute.  Valid HTML 4.01
    table attributes can be found `here
    <http://www.w3.org/TR/REC-html40/struct/tables.html#h-11.2>`__. A
    working draft of the HTML 5 spec can be found `here
    <https://html.spec.whatwg.org/multipage/tables.html>`__. It contains the
    latest information on table attributes for the modern web.

parse_dates : bool, optional
    See :func:`~read_csv` for more details.

thousands : str, optional
    Separator to use to parse thousands. Defaults to ``','``.

encoding : str, optional
    The encoding used to decode the web page. Defaults to ``None``.``None``
    preserves the previous encoding behavior, which depends on the
    underlying parser library (e.g., the parser library will try to use
    the encoding provided by the document).

decimal : str, default '.'
    Character to recognize as decimal point (e.g. use ',' for European
    data).

converters : dict, default None
    Dict of functions for converting values in certain columns. Keys can
    either be integers or column labels, values are functions that take one
    input argument, the cell (not column) content, and return the
    transformed content.

na_values : iterable, default None
    Custom NA values.

keep_default_na : bool, default True
    If na_values are specified and keep_default_na is False the default NaN
    values are overridden, otherwise they're appended to.

displayed_only : bool, default True
    Whether elements with "display: none" should be parsed.

extract_links : {None, "all", "header", "body", "footer"}
    Table elements in the specified section(s) with <a> tags will have their
    href extracted.

    .. versionadded:: 1.5.0

dtype_backend : {'numpy_nullable', 'pyarrow'}, default 'numpy_nullable'
    Back-end data type applied to the resultant :class:`DataFrame`
    (still experimental). Behaviour is as follows:

    * ``"numpy_nullable"``: returns nullable-dtype-backed :class:`DataFrame`
      (default).
    * ``"pyarrow"``: returns pyarrow-backed nullable :class:`ArrowDtype`
      DataFrame.

    .. versionadded:: 2.0

storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib.request.Request`` as header options. For other
    URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
    forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
    details, and for more examples on storage options refer `here
    <https://pandas.pydata.org/docs/user_guide/io.html?
    highlight=storage_options#reading-writing-remote-files>`_.

    .. versionadded:: 2.1.0

Returns
-------
dfs
    A list of DataFrames.

See Also
--------
read_csv : Read a comma-separated values (csv) file into DataFrame.

Notes
-----
Before using this function you should read the :ref:`gotchas about the
HTML parsing libraries <io.html.gotchas>`.

Expect to do some cleanup after you call this function. For example, you
might need to manually assign column names if the column names are
converted to NaN when you pass the `header=0` argument. We try to assume as
little as possible about the structure of the table and push the
idiosyncrasies of the HTML contained in the table to the user.

This function searches for ``<table>`` elements and only for ``<tr>``
and ``<th>`` rows and ``<td>`` elements within each ``<tr>`` or ``<th>``
element in the table. ``<td>`` stands for "table data". This function
attempts to properly handle ``colspan`` and ``rowspan`` attributes.
If the function has a ``<thead>`` argument, it is used to construct
the header, otherwise the function attempts to find the header within
the body (by putting rows with only ``<th>`` elements into the header).

Similar to :func:`~read_csv` the `header` argument is applied
**after** `skiprows` is applied.

This function will *always* return a list of :class:`DataFrame` *or*
it will fail, e.g., it will *not* return an empty list.

Examples
--------
See the :ref:`read_html documentation in the IO section of the docs
<io.read_html>` for some examples of reading in HTML tables.

Vous êtes un professionnel et vous avez besoin d'une formation ? Machine Learning
avec Scikit-Learn Voir le programme détaillé

Le tutoriel Python complet (Text+Vidéos)

Le tutoriel Python en vidéos

Evaluez vos compétences en Python

Améliorations / Corrections

Fonction read_html - module pandas

Signature de la fonction read_html

Description

help(pandas.read_html)