retriever.lib package


retriever.lib.cleanup module

class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)

Bases: future.types.newobject.newobject

This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.

retriever.lib.cleanup.correct_invalid_value(value, args)

This cleanup function replaces missing value indicators with None.


Check if a value can be converted to a float

retriever.lib.cleanup.no_cleanup(value, args)

Default cleanup function, returns the unchanged value.

retriever.lib.datapackage module

retriever.lib.datapackage.clean_input(prompt='', split_char='', ignore_empty=False, dtype=None)

Clean the user-input from the CLI before adding it.


Creates datapackage.JSON script. Takes input from user via command line.

Usage: retriever new_json


Delete the json file from the script write path’s directories.

retriever.lib.datapackage.edit_dict(obj, tabwidth=0)

Recursive helper function for edit_json() to edit a datapackage.JSON script file.


Edit existing datapackage.JSON script.

Usage: retriever edit_json <script_name> Note: Name of script is the dataset name.


Set contains_pk property.


Get the string delimiter for the dataset file(s).


Set do_not_bulk_insert property.


Set fixed_width property.


Get number of rows considered as the header.


Get list of strings that denote missing value in the dataset.


Get the replace values for columns from the user.


Return the file name of a script.

File names have ‘_’ while the script variable names have ‘-‘.


Check if a variable is an empty string or an empty list.

retriever.lib.datasets module


Return set with all available licenses.


Return list of all available dataset names.

retriever.lib.datasets.datasets(keywords=None, licenses=None)

Search all datasets by keywords and licenses.


Get the license for a dataset.

retriever.lib.defaults module module, path='./', quiet=False, subdir=False, debug=False)

Download scripts for retriever.

retriever.lib.dummy module

Dummy connection classes for connectionless engine instances

This module contains dummy classes required for non-db based children of the Engine class.

class retriever.lib.dummy.DummyConnection

Bases: object

class retriever.lib.dummy.DummyCursor

Bases: retriever.lib.dummy.DummyConnection

retriever.lib.engine module

class retriever.lib.engine.Engine

Bases: future.types.newobject.newobject

A generic database system. Specific database platforms will inherit from this class.


Adds data to a table from one or more lines specified in engine.table.source.

auto_create_table(table, url=None, filename=None, pk=None)

Create table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.

auto_get_datatypes(pk, source, columns)

Determine data types for each column.

For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.


Determine the delimiter.

Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.


Create a connection.


Create a connection.


Convert Retriever generic data types to database platform specific data types.


Create a new database based on settings supplied in Database object engine.db.


Return SQL statement to create a database.


Check to see if the archive directory exists and creates it if necessary.


Create new database table based on settings supplied in Table object engine.table.


Return SQL statement to create a table.


Get db cursor.


Return name of the database.

datatypes = []
db = None
debug = False

Disconnect a connection.


Files systems should override this method.

Enables commit per file object.

download_file(url, filename)

Download file to the raw data directory.

download_files_from_archive(url, file_names=None, archive_type='zip', keep_in_dir=False, archive_name=None)

Download files from an archive into the raw data directory.

drop_statement(object_type, object_name)

Return drop table or database SQL statement.

execute(statement, commit=True)

Execute given statement.

executemany(statement, values, commit=True)

Execute given statement with multiple values.


Split line based on the fixed width, returns list of the values.

extract_gz(archive_path, archivedir_write_path, file_name=None, open_archive_file=None, archive=None)

Extract gz files.

Extracts a given file name or all the files in the gz.

extract_tar(archive_path, archivedir_write_path, archive_type, file_name=None)

Extract tar or tar.gz files.

Extracts a given file name or the file in the tar or tar.gz. # gzip archives can only contain a single file

extract_zip(archive_path, archivedir_write_path, file_name=None)

Extract zip files.

Extracts a given file name or the entire files in the archive.


This can be overriden to return the tables of sqlite db as pandas data frame. Return False by default.


Close the database connection.


Check for an existing datafile.


Return correctly formatted raw data directory location.


Return full path of a file in the archive directory.

format_insert_value(value, datatype)

Format a value for an insert statement based on data type.

Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:

  1. Removing extra enclosing quotes
  2. Harmonizing null indicators
  3. Cleaning up badly formatted integers
  4. Obtaining consistent float representations of decimals

This method should be overridden by specific implementations of Engine.


Create cross tab data.


Returns the number of real lines for cross-tab data


Get db cursor.


Manually get user input for connection information when script is run from terminal.

insert_data_from_archive(url, filenames)

Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.


The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.


Insert data from a web resource, such as a text file.

insert_raster(path=None, srid=None)

Base function for installing raster data from path


Return SQL statement to insert a set of values.

insert_vector(path=None, srid=None)

Base function for installing vector data from path

instructions = 'Enter your database connection information:'

Generator returning lists of values from lines in a data file.

1. Works on both delimited (csv module) and fixed width data (extract_fixed_width) 2. Identifies the delimiter if not known 3. Removes extra line endings

name = ''
pkformat = '%s PRIMARY KEY %s '
placeholder = None
required_opts = []
script = None
script_table_registry = {}

Set up the encoding to be used.


Get the delimiter from the data file and set it.

spatial_support = False
supported_raster(path, ext=None)

“Spatial data is not currently supported for this database type or file format. PostgreSQL is currently the only supported output for spatial data.

table = None
table_name(name=None, dbname=None)

Return full table name.

to_csv(sort=True, path=None)
use_cache = True

Create a warning message using the current script and table.

warnings = []
write_fileobject(archivedir_write_path, file_name, file_obj=None, archive=None, open_object=False)

Write a file object from a archive object to a given path

open_object flag helps up with zip files, open the zip and the file


Return true if a file exists and its size is greater than 0.


Extract and returns the filename from the url.


Return generator from a source tuple.

Source tuples are of the form (callable, args) where callable(*args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.

retriever.lib.engine.reporthook(tqdm_inst, filename=None)

tqdm wrapper to generate progress bar for urlretriever

retriever.lib.engine.skip_rows(rows, source)

Skip over the header lines by reading them before processing.

retriever.lib.engine_tools module

Data Retriever Tools

This module contains miscellaneous classes and functions used in Retriever scripts.

retriever.lib.engine_tools.create_file(data, output='output_file')

Write lines to file from a list.


Create Directory for retriever.


Read in a csv file and return lines a list.


Perform final cleanup operations after all scripts have run.


This function gets the version number of the scripts and returns them in array form.

retriever.lib.engine_tools.getmd5(data, data_type='lines')

Get MD5 of a data source.

retriever.lib.engine_tools.json2csv(input_file, output_file=None, header_values=None)

Convert Json file to CSV.

Function is used for only testing and can handle the file of the size.

retriever.lib.engine_tools.name_matches(scripts, arg)

Check for a match of the script in available scripts

if all, return the entire script list if the exact script is available, return that script if no exact script name detected, match the argument with keywords title and name of all scripts and return the closest matches

retriever.lib.engine_tools.reset_retriever(scope='all', ask_permission=True)

Remove stored information on scripts, data, and connections.


Check for proxies and makes them available to urllib.


Sort CSV rows minus the header and return the file.

Function is used for only testing and can handle the file of the size.


Sort file by line and return the file.

Function is used for only testing and can handle the file of the size.

retriever.lib.engine_tools.xml2csv(input_file, outputfile=None, header_values=None, row_tag='row')

Convert xml to csv.

Function is used for only testing and can handle the file of the size.

retriever.lib.excel module

Data Retriever Excel Functions

This module contains optional functions for importing data from Excel.

class retriever.lib.excel.Excel

Bases: future.types.newobject.newobject

static cell_value(cell)

Return string value of an excel spreadsheet cell.

static empty_cell(cell)

Test if excel cell is empty or contains only whitespace.

retriever.lib.get_opts module

retriever.lib.install module

retriever.lib.install.install_csv(dataset, table_name='./{db}_{table}.csv', debug=False, use_cache=True)

Install datasets into csv.

retriever.lib.install.install_json(dataset, table_name='./{db}_{table}.json', debug=False, use_cache=True)

Install datasets into json.

retriever.lib.install.install_msaccess(dataset, file='./access.mdb', table_name='[{db} {table}]', debug=False, use_cache=True)

Install datasets into msaccess.

retriever.lib.install.install_mysql(dataset, user='root', password='', host='localhost', port=3306, database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True)

Install datasets into mysql.

retriever.lib.install.install_postgres(dataset, user='postgres', password='', host='localhost', port=5432, database='postgres', database_name='{db}', table_name='{db}.{table}', bbox=[], debug=False, use_cache=True)

Install datasets into postgres.

retriever.lib.install.install_sqlite(dataset, file='./sqlite.db', table_name='{db}_{table}', debug=False, use_cache=True)

Install datasets into sqlite.

retriever.lib.install.install_xml(dataset, table_name='./{db}_{table}.xml', debug=False, use_cache=True)

Install datasets into xml.

retriever.lib.load_json module

retriever.lib.load_json.read_json(json_file, debug=False)

Read Json dataset package files

Load each json and get the appropriate encoding for the dataset Reload the json using the encoding to ensure correct character sets

retriever.lib.models module

Data Retriever Data Model

This module contains basic class definitions for the Retriever platform.

retriever.lib.repository module

Checks the repository for updates.


Check for updates to datasets.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.scripts module


Return Loaded scripts.

Ensure that only one instance of SCRIPTS is created.

class retriever.lib.scripts.StoredScripts

Return true if a script’s version number is greater than the retriever’s version.


Return the script for a named dataset.

retriever.lib.scripts.open_csvw(csv_file, encode=True)

Open a csv writer forcing the use of Linux line endings on Windows.

Also sets dialect to ‘excel’ and escape characters to ‘’

retriever.lib.scripts.open_fr(file_name, encoding='ISO-8859-1', encode=True)

Open file for reading respecting Python version and OS differences.

Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes

retriever.lib.scripts.open_fw(file_name, encoding='ISO-8859-1', encode=True)

Open file for writing respecting Python version and OS differences.

Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes


Load scripts from scripts directory and return list of modules.

retriever.lib.scripts.to_str(object, object_encoding=<open file '<stdout>', mode 'w'>)

Convert a Python3 object to a string as in Python2.

Strings in Python3 are bytes.

retriever.lib.table module

class retriever.lib.table.Dataset(name=None, url=None)

Bases: future.types.newobject.newobject

Dataset generic properties

class retriever.lib.table.RasterDataset(name=None, url=None, dataset_type='RasterDataset', **kwargs)

Bases: retriever.lib.table.Dataset

Raster table implementation

class retriever.lib.table.TabularDataset(name=None, url=None, pk=True, contains_pk=False, delimiter=None, header_rows=1, column_names_row=1, fixed_width=False, cleanup=<retriever.lib.cleanup.Cleanup object>, record_id=0, columns=[], replace_columns=[], missingValues=None, cleaned_columns=False, **kwargs)

Bases: retriever.lib.table.Dataset

Tabular database table.


Initialize dialect table properties.

These include a table’s null or missing values, the delimiter, the function to perform on missing values and any values in the dialect’s dict.


Add a schema to the table object.

Define the data type for the columns in the table.


Get column names from the header row.

Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.


Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc.


Combine a list of values into a line of csv data.


Get set of column names for insert statements.

get_insert_columns(join=True, create=False)

Get column names for insert statements.

create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.

class retriever.lib.table.VectorDataset(name=None, url=None, dataset_type='VectorDataset', **kwargs)

Bases: retriever.lib.table.Dataset

Vector table implementation.

retriever.lib.templates module

Datasets are defined as scripts and have unique properties. The Module defines generic dataset properties and models the functions available for inheritance by the scripts or datasets.

class retriever.lib.templates.BasicTextTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Defines the pre processing required for scripts.

Scripts that need pre processing should use the download function from this class. Scripts that require extra tune up, should override this class.

download(engine=None, debug=False)

Defines the download processes for scripts that utilize the default pre processing steps provided by the retriever.

process_archived_data(table_obj, url)

Pre-process archived files.

Extract the files from the archived source based on the specifications. Either extact a single file or the entire files.

process_tables(table_obj, url)
process_tabular_insert(table_obj, url)
class retriever.lib.templates.HtmlTableTemplate(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: retriever.lib.templates.Script

Script template for parsing data in HTML tables.

class retriever.lib.templates.Script(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: future.types.newobject.newobject

This class defines the properties of a generic dataset.

Each Dataset inherits attributes from this class to define it’s Unique functionality.


Returns the required engine instance

download(engine=None, debug=False)

Generic function to prepare for installation or download.

reference_url() module, encode=True)

Open a csv writer forcing the use of Linux line endings on Windows.

Also sets dialect to ‘excel’ and escape characters to ‘’, encoding='ISO-8859-1', encode=True)

Open file for reading respecting Python version and OS differences.

Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes, encoding='ISO-8859-1', encode=True)

Open file for writing respecting Python version and OS differences.

Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes, object_encoding=<open file '<stdout>', mode 'w'>)

Return relative paths of files in the directory

retriever.lib.warning module

class retriever.lib.warning.Warning(location, warning)

Bases: future.types.newobject.newobject