retriever.lib package¶
Submodules¶
retriever.lib.cleanup module¶
-
class
retriever.lib.cleanup.
Cleanup
(function=<function no_cleanup>, **kwargs)¶ Bases:
future.types.newobject.newobject
This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.
-
retriever.lib.cleanup.
correct_invalid_value
(value, args)¶ This cleanup function replaces missing value indicators with None.
-
retriever.lib.cleanup.
floatable
(value)¶ Check if a value can be converted to a float
-
retriever.lib.cleanup.
no_cleanup
(value, args)¶ Default cleanup function, returns the unchanged value.
retriever.lib.datapackage module¶
-
retriever.lib.datapackage.
clean_input
(prompt='', split_char='', ignore_empty=False, dtype=None)¶ Clean the user-input from the CLI before adding it.
-
retriever.lib.datapackage.
create_json
()¶ Creates datapackage.JSON script. http://specs.frictionlessdata.io/data-packages/#descriptor-datapackagejson Takes input from user via command line.
Usage: retriever new_json
-
retriever.lib.datapackage.
delete_json
(json_file)¶ Delete the json file from the script write path’s directories.
-
retriever.lib.datapackage.
edit_dict
(obj, tabwidth=0)¶ Recursive helper function for edit_json() to edit a datapackage.JSON script file.
-
retriever.lib.datapackage.
edit_json
(json_file)¶ Edit existing datapackage.JSON script.
Usage: retriever edit_json <script_name> Note: Name of script is the dataset name.
-
retriever.lib.datapackage.
get_contains_pk
(dialect)¶ Set contains_pk property.
-
retriever.lib.datapackage.
get_delimiter
(dialect)¶ Get the string delimiter for the dataset file(s).
-
retriever.lib.datapackage.
get_do_not_bulk_insert
(dialect)¶ Set do_not_bulk_insert property.
-
retriever.lib.datapackage.
get_fixed_width
(dialect)¶ Set fixed_width property.
-
retriever.lib.datapackage.
get_header_rows
(dialect)¶ Get number of rows considered as the header.
-
retriever.lib.datapackage.
get_nulls
(dialect)¶ Get list of strings that denote missing value in the dataset.
-
retriever.lib.datapackage.
get_replace_columns
(dialect)¶ Get the replace values for columns from the user.
-
retriever.lib.datapackage.
get_script_filename
(shortname)¶ Return the file name of a script.
File names have ‘_’ while the script variable names have ‘-‘.
-
retriever.lib.datapackage.
is_empty
(val)¶ Check if a variable is an empty string or an empty list.
retriever.lib.datasets module¶
-
retriever.lib.datasets.
dataset_licenses
()¶ Return set with all available licenses.
-
retriever.lib.datasets.
dataset_names
()¶ Return list of all available dataset names.
-
retriever.lib.datasets.
datasets
(keywords=None, licenses=None)¶ Search all datasets by keywords and licenses.
-
retriever.lib.datasets.
license
(dataset)¶ Get the license for a dataset.
retriever.lib.defaults module¶
retriever.lib.download module¶
-
retriever.lib.download.
download
(dataset, path='./', quiet=False, subdir=False, debug=False)¶ Download scripts for retriever.
retriever.lib.dummy module¶
Dummy connection classes for connectionless engine instances
This module contains dummy classes required for non-db based children of the Engine class.
-
class
retriever.lib.dummy.
DummyCursor
¶
retriever.lib.engine module¶
-
class
retriever.lib.engine.
Engine
¶ Bases:
future.types.newobject.newobject
A generic database system. Specific database platforms will inherit from this class.
-
add_to_table
(data_source)¶ Adds data to a table from one or more lines specified in engine.table.source.
-
auto_create_table
(table, url=None, filename=None, pk=None)¶ Create table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.
-
auto_get_datatypes
(pk, source, columns)¶ Determine data types for each column.
For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.
-
auto_get_delimiter
(header)¶ Determine the delimiter.
Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.
-
connect
(force_reconnect=False)¶ Create a connection.
-
connection
¶ Create a connection.
-
convert_data_type
(datatype)¶ Convert Retriever generic data types to database platform specific data types.
-
create_db
()¶ Create a new database based on settings supplied in Database object engine.db.
-
create_db_statement
()¶ Return SQL statement to create a database.
-
create_raw_data_dir
(path=None)¶ Check to see if the archive directory exists and creates it if necessary.
-
create_table
()¶ Create new database table based on settings supplied in Table object engine.table.
-
create_table_statement
()¶ Return SQL statement to create a table.
-
cursor
¶ Get db cursor.
-
database_name
(name=None)¶ Return name of the database.
-
datatypes
= []¶
-
db
= None¶
-
debug
= False¶
-
disconnect
()¶ Disconnect a connection.
-
disconnect_files
()¶ Files systems should override this method.
Enables commit per file object.
-
download_file
(url, filename)¶ Download file to the raw data directory.
-
download_files_from_archive
(url, file_names=None, archive_type='zip', keep_in_dir=False, archive_name=None)¶ Download files from an archive into the raw data directory.
-
drop_statement
(object_type, object_name)¶ Return drop table or database SQL statement.
-
execute
(statement, commit=True)¶ Execute given statement.
-
executemany
(statement, values, commit=True)¶ Execute given statement with multiple values.
-
extract_fixed_width
(line)¶ Split line based on the fixed width, returns list of the values.
-
extract_gz
(archive_path, archivedir_write_path, file_name=None, open_archive_file=None, archive=None)¶ Extract gz files.
Extracts a given file name or all the files in the gz.
-
extract_tar
(archive_path, archivedir_write_path, archive_type, file_name=None)¶ Extract tar or tar.gz files.
Extracts a given file name or the file in the tar or tar.gz. # gzip archives can only contain a single file
-
extract_zip
(archive_path, archivedir_write_path, file_name=None)¶ Extract zip files.
Extracts a given file name or the entire files in the archive.
-
fetch_tables
(table_names)¶ This can be overriden to return the tables of sqlite db as pandas data frame. Return False by default.
-
final_cleanup
()¶ Close the database connection.
-
find_file
(filename)¶ Check for an existing datafile.
-
format_data_dir
()¶ Return correctly formatted raw data directory location.
-
format_filename
(filename)¶ Return full path of a file in the archive directory.
-
format_insert_value
(value, datatype)¶ Format a value for an insert statement based on data type.
Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:
- Removing extra enclosing quotes
- Harmonizing null indicators
- Cleaning up badly formatted integers
- Obtaining consistent float representations of decimals
-
get_connection
()¶ This method should be overridden by specific implementations of Engine.
-
get_ct_data
(lines)¶ Create cross tab data.
-
get_ct_line_length
(lines)¶ Returns the number of real lines for cross-tab data
-
get_cursor
()¶ Get db cursor.
-
get_input
()¶ Manually get user input for connection information when script is run from terminal.
-
insert_data_from_archive
(url, filenames)¶ Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.
-
insert_data_from_file
(filename)¶ The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.
-
insert_data_from_url
(url)¶ Insert data from a web resource, such as a text file.
-
insert_raster
(path=None, srid=None)¶ Base function for installing raster data from path
-
insert_statement
(values)¶ Return SQL statement to insert a set of values.
-
insert_vector
(path=None, srid=None)¶ Base function for installing vector data from path
-
instructions
= 'Enter your database connection information:'¶
-
load_data
(filename)¶ Generator returning lists of values from lines in a data file.
1. Works on both delimited (csv module) and fixed width data (extract_fixed_width) 2. Identifies the delimiter if not known 3. Removes extra line endings
-
name
= ''¶
-
pkformat
= '%s PRIMARY KEY %s '¶
-
placeholder
= None¶
-
required_opts
= []¶
-
script
= None¶
-
script_table_registry
= {}¶
-
set_engine_encoding
()¶ Set up the encoding to be used.
-
set_table_delimiter
(file_path)¶ Get the delimiter from the data file and set it.
-
spatial_support
= False¶
-
supported_raster
(path, ext=None)¶ “Spatial data is not currently supported for this database type or file format. PostgreSQL is currently the only supported output for spatial data.
-
table
= None¶
-
table_name
(name=None, dbname=None)¶ Return full table name.
-
to_csv
(sort=True, path=None)¶
-
use_cache
= True¶
-
warning
(warning)¶ Create a warning message using the current script and table.
-
warnings
= []¶
-
write_fileobject
(archivedir_write_path, file_name, file_obj=None, archive=None, open_object=False)¶ Write a file object from a archive object to a given path
open_object flag helps up with zip files, open the zip and the file
-
-
retriever.lib.engine.
file_exists
(path)¶ Return true if a file exists and its size is greater than 0.
-
retriever.lib.engine.
filename_from_url
(url)¶ Extract and returns the filename from the url.
-
retriever.lib.engine.
gen_from_source
(source)¶ Return generator from a source tuple.
Source tuples are of the form (callable, args) where callable(*args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.
-
retriever.lib.engine.
reporthook
(tqdm_inst, filename=None)¶ tqdm wrapper to generate progress bar for urlretriever
-
retriever.lib.engine.
skip_rows
(rows, source)¶ Skip over the header lines by reading them before processing.
retriever.lib.engine_tools module¶
Data Retriever Tools
This module contains miscellaneous classes and functions used in Retriever scripts.
-
retriever.lib.engine_tools.
create_file
(data, output='output_file')¶ Write lines to file from a list.
-
retriever.lib.engine_tools.
create_home_dir
()¶ Create Directory for retriever.
-
retriever.lib.engine_tools.
file_2list
(input_file)¶ Read in a csv file and return lines a list.
-
retriever.lib.engine_tools.
final_cleanup
(engine)¶ Perform final cleanup operations after all scripts have run.
-
retriever.lib.engine_tools.
get_script_version
()¶ This function gets the version number of the scripts and returns them in array form.
-
retriever.lib.engine_tools.
getmd5
(data, data_type='lines')¶ Get MD5 of a data source.
-
retriever.lib.engine_tools.
json2csv
(input_file, output_file=None, header_values=None)¶ Convert Json file to CSV.
Function is used for only testing and can handle the file of the size.
-
retriever.lib.engine_tools.
name_matches
(scripts, arg)¶ Check for a match of the script in available scripts
if all, return the entire script list if the exact script is available, return that script if no exact script name detected, match the argument with keywords title and name of all scripts and return the closest matches
-
retriever.lib.engine_tools.
reset_retriever
(scope='all', ask_permission=True)¶ Remove stored information on scripts, data, and connections.
-
retriever.lib.engine_tools.
set_proxy
()¶ Check for proxies and makes them available to urllib.
-
retriever.lib.engine_tools.
sort_csv
(filename)¶ Sort CSV rows minus the header and return the file.
Function is used for only testing and can handle the file of the size.
-
retriever.lib.engine_tools.
sort_file
(file_path)¶ Sort file by line and return the file.
Function is used for only testing and can handle the file of the size.
-
retriever.lib.engine_tools.
xml2csv
(input_file, outputfile=None, header_values=None, row_tag='row')¶ Convert xml to csv.
Function is used for only testing and can handle the file of the size.
retriever.lib.excel module¶
Data Retriever Excel Functions
This module contains optional functions for importing data from Excel.
retriever.lib.get_opts module¶
retriever.lib.install module¶
-
retriever.lib.install.
install_csv
(dataset, table_name='./{db}_{table}.csv', debug=False, use_cache=True)¶ Install datasets into csv.
-
retriever.lib.install.
install_json
(dataset, table_name='./{db}_{table}.json', debug=False, use_cache=True)¶ Install datasets into json.
-
retriever.lib.install.
install_msaccess
(dataset, file='./access.mdb', table_name='[{db} {table}]', debug=False, use_cache=True)¶ Install datasets into msaccess.
-
retriever.lib.install.
install_mysql
(dataset, user='root', password='', host='localhost', port=3306, database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True)¶ Install datasets into mysql.
-
retriever.lib.install.
install_postgres
(dataset, user='postgres', password='', host='localhost', port=5432, database='postgres', database_name='{db}', table_name='{db}.{table}', bbox=[], debug=False, use_cache=True)¶ Install datasets into postgres.
-
retriever.lib.install.
install_sqlite
(dataset, file='./sqlite.db', table_name='{db}_{table}', debug=False, use_cache=True)¶ Install datasets into sqlite.
-
retriever.lib.install.
install_xml
(dataset, table_name='./{db}_{table}.xml', debug=False, use_cache=True)¶ Install datasets into xml.
retriever.lib.load_json module¶
-
retriever.lib.load_json.
read_json
(json_file, debug=False)¶ Read Json dataset package files
Load each json and get the appropriate encoding for the dataset Reload the json using the encoding to ensure correct character sets
retriever.lib.models module¶
Data Retriever Data Model
This module contains basic class definitions for the Retriever platform.
retriever.lib.repository module¶
Checks the repository for updates.
-
retriever.lib.repository.
check_for_updates
(quiet=False)¶ Check for updates to datasets.
This updates the HOME_DIR scripts directory with the latest script versions
retriever.lib.scripts module¶
-
retriever.lib.scripts.
SCRIPT_LIST
()¶ Return Loaded scripts.
Ensure that only one instance of SCRIPTS is created.
-
retriever.lib.scripts.
check_retriever_minimum_version
(module)¶ Return true if a script’s version number is greater than the retriever’s version.
-
retriever.lib.scripts.
get_script
(dataset)¶ Return the script for a named dataset.
-
retriever.lib.scripts.
open_csvw
(csv_file, encode=True)¶ Open a csv writer forcing the use of Linux line endings on Windows.
Also sets dialect to ‘excel’ and escape characters to ‘’
-
retriever.lib.scripts.
open_fr
(file_name, encoding='ISO-8859-1', encode=True)¶ Open file for reading respecting Python version and OS differences.
Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
-
retriever.lib.scripts.
open_fw
(file_name, encoding='ISO-8859-1', encode=True)¶ Open file for writing respecting Python version and OS differences.
Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
-
retriever.lib.scripts.
reload_scripts
()¶ Load scripts from scripts directory and return list of modules.
-
retriever.lib.scripts.
to_str
(object, object_encoding=<open file '<stdout>', mode 'w'>)¶ Convert a Python3 object to a string as in Python2.
Strings in Python3 are bytes.
retriever.lib.table module¶
-
class
retriever.lib.table.
Dataset
(name=None, url=None)¶ Bases:
future.types.newobject.newobject
Dataset generic properties
-
class
retriever.lib.table.
RasterDataset
(name=None, url=None, dataset_type='RasterDataset', **kwargs)¶ Bases:
retriever.lib.table.Dataset
Raster table implementation
-
class
retriever.lib.table.
TabularDataset
(name=None, url=None, pk=True, contains_pk=False, delimiter=None, header_rows=1, column_names_row=1, fixed_width=False, cleanup=<retriever.lib.cleanup.Cleanup object>, record_id=0, columns=[], replace_columns=[], missingValues=None, cleaned_columns=False, **kwargs)¶ Bases:
retriever.lib.table.Dataset
Tabular database table.
-
add_dialect
()¶ Initialize dialect table properties.
These include a table’s null or missing values, the delimiter, the function to perform on missing values and any values in the dialect’s dict.
-
add_schema
()¶ Add a schema to the table object.
Define the data type for the columns in the table.
-
auto_get_columns
(header)¶ Get column names from the header row.
Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.
-
clean_column_name
(column_name)¶ Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc.
-
combine_on_delimiter
(line_as_list)¶ Combine a list of values into a line of csv data.
-
get_column_datatypes
()¶ Get set of column names for insert statements.
-
get_insert_columns
(join=True, create=False)¶ Get column names for insert statements.
create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.
-
values_from_line
(line)¶
-
-
class
retriever.lib.table.
VectorDataset
(name=None, url=None, dataset_type='VectorDataset', **kwargs)¶ Bases:
retriever.lib.table.Dataset
Vector table implementation.
retriever.lib.templates module¶
Datasets are defined as scripts and have unique properties. The Module defines generic dataset properties and models the functions available for inheritance by the scripts or datasets.
-
class
retriever.lib.templates.
BasicTextTemplate
(**kwargs)¶ Bases:
retriever.lib.templates.Script
Defines the pre processing required for scripts.
Scripts that need pre processing should use the download function from this class. Scripts that require extra tune up, should override this class.
-
download
(engine=None, debug=False)¶ Defines the download processes for scripts that utilize the default pre processing steps provided by the retriever.
-
process_archived_data
(table_obj, url)¶ Pre-process archived files.
Extract the files from the archived source based on the specifications. Either extact a single file or the entire files.
-
process_spatial_insert
(table_obj)¶
-
process_tables
(table_obj, url)¶
-
process_tabular_insert
(table_obj, url)¶
-
-
class
retriever.lib.templates.
HtmlTableTemplate
(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)¶ Bases:
retriever.lib.templates.Script
Script template for parsing data in HTML tables.
-
class
retriever.lib.templates.
Script
(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)¶ Bases:
future.types.newobject.newobject
This class defines the properties of a generic dataset.
Each Dataset inherits attributes from this class to define it’s Unique functionality.
-
checkengine
(engine=None)¶ Returns the required engine instance
-
download
(engine=None, debug=False)¶ Generic function to prepare for installation or download.
-
matches_terms
(terms)¶
-
reference_url
()¶
-
retriever.lib.tools module¶
-
retriever.lib.tools.
open_csvw
(csv_file, encode=True)¶ Open a csv writer forcing the use of Linux line endings on Windows.
Also sets dialect to ‘excel’ and escape characters to ‘’
-
retriever.lib.tools.
open_fr
(file_name, encoding='ISO-8859-1', encode=True)¶ Open file for reading respecting Python version and OS differences.
Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
-
retriever.lib.tools.
open_fw
(file_name, encoding='ISO-8859-1', encode=True)¶ Open file for writing respecting Python version and OS differences.
Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
-
retriever.lib.tools.
to_str
(object, object_encoding=<open file '<stdout>', mode 'w'>)¶
-
retriever.lib.tools.
walk_relative_path
(dir_name)¶ Return relative paths of files in the directory