All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Hugging Face Transformers tensorflow based NER models.
- PNG ConfigLoader for reading images as arrays to predict using MNIST trained models
- Docstrings and doctestable examples to
record.py
. - Inputs can be validated using operations
validate
parameter inInput
takesOperation.instance_name
- Logistic Regression with SAG optimizer
- Test tensorflow DNNEstimator documentation exaples in CI
- Add python code for tensorflow DNNEstimator
- Ability to run a subflow as if it were an operation using the
dffml.dataflow.run
operation.
- New model tutorial mentions file paths that should be edited.
- DataFlow is no longer a dataclass to prevent it from being exported incorrectly.
operations_parameter_set_pairs
moved toMemoryOrchestratorContext
- Ignore generated files in
docs/plugins/
- Treat
"~"
as the the home directory rather than a literal - Windows support by selecting
asyncio.ProactorEventLoop
and not usingasyncio.FastChildWatcher
.
- Parent flows can now forward inputs to active contexts of subflows.
forward
parameter inDataFlow
subflow
inOperationImplementationContext
- Documentation on writing examples and running doctests
- Doctestable Examples to high-level API.
- Shouldi got an operation to run npm-audit on JavaScript code
- Docstrings and doctestable examples for
record.py
(features and evaluated) - Simplified model API with SimpleModel
- Documentation on how DataFlows work conceptually.
- Style guide now contains information on class, variable, and function naming.
- Restructured contributing documentation
- Use randomly generated data for scikit tests
- Change Core to Official to clarify who maintains each plugin
- Name of output of unsupervised model from "Prediction" to "cluster"
- Test scikit LR documentation examples in CI
- Create a fresh archive of the git repo for release instead of cleaning
existing repo with
git clean
for development service release command. - Simplified SLR tests for scratch model
- Test tensorflow DNNClassifier documentation exaples in CI
- config directories and files associated with ConfigLoaders have been renamed to configloader.
- Model config directory parameters are now
pathlib.Path
objects - New model tutorial and
skel/model
use simplifeid model API.
- Tensorflow hub NLP models.
- Notes on development dependencies in
setup.py
files to codebase notes. - Test for
cached_download
dffml.util.net.cached_download_unpack_archive
to run a cached download and unpack the archive, very useful for testing. Documented on the Networking Helpers API docs page.- Directions on how to read the CI under the Git and GitHub page of the contributing documentation.
- HTTP API
- Static file serving from a dirctory with
-static
api.js
file serving with the-js
flag- Docs page for JavaScript example
- Static file serving from a dirctory with
- shouldi got an operation to run golangci-lint on Golang code
- Note about using black via VSCode
- Port assignment for the HTTP API via the
-port
flag
repo
/Repo
torecord
/Record
- Definitions with a
spec
can use thesubspec
parameter to declare that they are a list or a dict where the values are of thespec
type. Rather than the list or dict itself being of thespec
type. - Fixed the URL mentioned in example to configure a model.
- Sphinx doctests are now run in the CI in the DOCS task.
- Lint JavaScript files with js-beautify and enforce with CI
- Unused imports
- Moved from TensorFlow 1 to TensorFlow 2.
- IDX Sources to read binary data files and train models on MNIST Dataset
- scikit models
- Clusterers
- KMeans
- Birch
- MiniBatchKMeans
- AffinityPropagation
- MeanShift
- SpectralClustering
- AgglomerativeClustering
- OPTICS
- Clusterers
allowempty
added to source config parameters.- Quickstart document to show how to use models from Python.
- The latest release of the documentation now includes a link to the documentation for the master branch (on GitHub pages).
- Virtual environment, GitPod, and Docker development environment setup notes to the CONTRIBUTING.md file.
- Changelog now included in documenation website.
- Database abstraction
dffml.db
- SQLite connector
- MySQL connector
- Documented style for imports.
- Documented use of numpy docstrings.
Inputs
can now be sanitized using function passed invalidate
parameter- Helper utilities to take callables with numpy style docstrings and
create config classes out of them using
make_config
. - File listing endpoint to HTTP service.
- When an operation throws an exception the name of the instance and the
parameters it was executed with will be thrown via an
OperationException
. - Network utilities to preformed cached downloads with hash validation.
- Development service got a new command, which can retrieve an argument passed
to setuptools
setup
function within asetup.py
file.
- All instances of
src_url
changed tokey
. readonly
parameter in source config is now changed toreadwrite
.predict
parameter of all model config classes has been changed fromstr
toFeature
.- Defining features on the command line no longer requires that defined features
be prefixed with
def:
- The model predict operation will now raise an exception if the model it is passed via it's config is a class rather than an instance.
entry_point
and friends have been renamed toentrypoint
.- Use
FastChildWatcher
when run via the CLI to preventBlockingIOError
s. - TensorFlow based neural network classifier had the
classification
parameter in it's config changed topredict
. - SciKit models use
make_config_numpy
. - Predictions in
repos
are now dictionary. - All instances of
label
changed totag
- Subclasses of
BaseConfigurable
will now auto instantiate their respective config classes usingkwargs
if the config argument isn't given and keyword arguments are. - The quickstart documentation was improved as well as the structure of docs.
- CONTRIBUTING.md has
-e
in the wrong place in the getting setup section. - Since moving to auto
args()
andconfig()
, BaseConfigurable no longer produces odd typenames in conjunction with docs.py. - Autoconvert Definitions with spec into their spec
- The model predict operation erroneously had a
msg
parameter in it's config. - Unused imports identified by deepsource.io
- Evaluation code from feature.py file as well as tests for those evaluations.
- scikit models
- Classifiers
- LogisticRegression
- GradientBoostingClassifier
- BernoulliNB
- ExtraTreesClassifier
- BaggingClassifier
- LinearDiscriminantAnalysis
- MultinomialNB
- Regressors
- ElasticNet
- BayesianRidge
- Lasso
- ARDRegression
- RANSACRegressor
- DecisionTreeRegressor
- GaussianProcessRegressor
- OrthogonalMatchingPursuit
- Lars
- Ridge
- Classifiers
AsyncExitStackTestCase
which instantiates and enters async and non-asynccontextlib
exit stacks. Provides temporary file creation.- Automatic releases to PyPi via GitHub Actions
- Automatic documentation deployment to GitHub Pages
- Function to create a config class dynamically, analogous to
make_dataclass
ConfigLoaders
class which loads config files from a file or directory to a dictionary.
- CLI tests and integration tests derive from
AsyncExitStackTestCase
- SciKit models now use the auto args and config methods.
- Correctly identify when functions decorated with
op
useself
to reference theOperationImplementationContext
. - shouldi safety operation uses subprocess communicate method instead of stdin pipe writes.
- Negative values are correctly parsed when input via the command line.
- Do not lowercase development mode install location when reporting version.
- Integration tests using the command line interface.
Operation
run_dataflow
to run a dataflow and test for the same.
- Features were moved from ModelContext to ModelConfig
- CI is now run via GitHub Actions
- CI testing script is now verbose
- args and config methods of all classes no longer require implementation. BaseConfigurable handles exporting of arguments and creation of config objects for each class based off of the CONFIG property of that class. The CONFIG property is a class which has been decorated with dffml.base.config to make it a dataclass.
- Speed up development service install of all plugins in development mode
- Speed up named plugin load times
- DataFlows with multiple possibilities for a source for an input, now correctly look through all possible sources instead of just the first one.
- DataFlow MemoryRedundancyCheckerContext was using all inputs in an input set and all their ancestors to check redundancy (a hold over from pre uid days). It now correctly only uses the inputs in the parameter set. This fixes a major performance issue.
- MySQL packaging issue.
- Develop service running one off operations correctly json-loads dict types.
- Operations with configs can be run via the development service
- JSON dumping numpy int* and float* caused crash on dump.
- CSV source always loads
src_urls
as strings.
- CLI command
operations
removed in favor ofdataflow run
- Duplicate dataflow diagram code from development service
- Real DataFlows, see operations tutorial and usage examples
- Async helper concurrently nocancel optional keyword argument which, if set is a set of tasks not to cancel when the concurrently execution loop completes.
- FileSourceTest has a
test_label
method which checks that a FileSource knows how to properly load and save repos under a given label. - Test case for Merge CLI command
- Repo.feature method to select a single piece of feature data within a repo.
- Dev service to help with hacking on DFFML and to create models from templates in the skel/ directory.
- Classification type parameter to DNNClassifierModelConfig to specifiy data type of given classification options.
- util.cli CMD classes have their argparse description set to their docstring.
- util.cli CMD classes can specify the formatter class used in
argparse.ArgumentParser
via theCLI_FORMATTER_CLASS
property. - Skeleton for service creation was added
- Simple Linear Regression model from scratch
- Scikit Linear Regression model
- Community link in CONTRIBUTING.md.
- Explained three main parts of DFFML on docs homepage
- Documentation on how to use ML models on docs Models plugin page.
- Mailing list info
- Issue template for questions
- Multiple Scikit Models with dynamic config
- Entrypoint listing command to development service to aid in debugging issues with entrypoints.
- HTTP API service to enable interacting with DFFML over HTTP. Currently includes APIs for configuring and using Sources and Models.
- MySQL protocol source to work with data from a MySQL protocol compatible db
- shouldi example got a bandit operation which tells users not to install if there are more than 5 issues of high severity and confidence.
- dev service got the ability to run a single operation in a standalone fashion.
- About page to docs.
- Tensorflow DNNEstimator based regression model.
- feature/codesec became it's own branch, binsec
- BaseOrchestratorContext
run_operations
strict is default to true. With strict as true errors will be raised and not just logged. - MemoryInputNetworkContext got an
sadd
method which is shorthand for creating a MemoryInputSet with a StringInputSetContext. - MemoryOrchestrator
basic_config
method takes list of operations and optional config for them. - shouldi example uses updated
MemoryOrchestrator.basic_config
method and includes more explanation in comments. - CSVSource allows for setting the Repo's
src_url
from a csv column - util Entrypoint defines a new class for each loaded class and sets the
ENTRY_POINT_LABEL
parameter within the newly defined class. - Tensorflow model removed usages of repo.classifications methods.
- Entrypoint prints traceback of loaded classes to standard error if they fail to load.
- Updated Tensorflow model README.md to match functionality of DNNClassifierModel.
- DNNClassifierModel no longer splits data for the user.
- Update
pip
in Dockerfile. - Restructured documentation
- Ran
black
on whole codebase, including all submodules - CI style check now checks whole codebase
- Merged HACKING.md into CONTRIBUTING.md
- shouldi example runs bandit now in addition to safety
- The way safety gets called
- Switched documentation to Read The Docs theme
- Models yield only a repo object instead of the value and confidence of the prediction as well. Models are not responsible for calling the predicted method on the repo. This will ease the process of making predict feature specific.
- Updated Tensorflow model README.md to include usage of regression model
- Docs get version from dffml.version.VERSION.
- FileSource zipfiles are wrapped with TextIOWrapper because CSVSource expects the underlying file object to return str instances rather than bytes.
- FileSourceTest inherits from SourceTest and is used to test json and csv sources.
- A temporary directory is used to replicate
mktemp -u
functionality so as to provide tests using a FileSource with a valid tempfile name. - Labels for JSON sources
- Labels for CSV sources
- util.cli CMD's correcly set the description of subparsers instead of their
help, they also accept the
CLI_FORMATTER_CLASS
property. - CSV source now has
entrypoint
decoration - JSON source now has
entrypoint
decoration - Strict flag in df.memory is now on by default
- Dynamically created scikit models get config args correctly
- Renamed
DNNClassifierModelContext
first init arg fromconfig
tofeatures
- BaseSource now has
base_entry_point
decoration
- Repo objects are no longer classification specific. Their
classify
,classified
, andclassification
methods were removed.
- Definition spec field to specify a class representative of key value pairs for definitions with primitives which are dictionaries
- Auto generation of documentation for operation implementations, models, and sources. Generated docs include information on configuration options and inputs and outputs for operation implementations.
- Async helpers got an
aenter_stack
method which creates and returns andcontextlib.AsyncExitStack
after entering all the context's passed to it. - Example of how to use Data Flow Facilitator / Orchestrator / Operations by writing a Python meta static analysis tool, shouldi
- OperationImplementation
add_label
andadd_orig_label
methods now use op.name instead ofENTRY_POINT_ORIG_LABEL
andENTRY_POINT_NAME
. - Make output specs and remap arguments optional for Operations CLI commands.
- Feature skeleton project is now operations skeleton project
- MemoryOperationImplementationNetwork instantiates OperationImplementations
using their
withconfig()
method. - MemorySource now decorated with
entrypoint
- MemorySource takes arguments correctly via
config_set
andconfig_get
- skel modules have
long_description_content_type
set to "text/markdown" - Base Orchestrator
__aenter__
and__aexit__
methods were moved to the Memory Orchestrator because they are specific to that config. - Async helper
aenter_stack
usesinspect.isfunction
so it will bind lambdas
- Support for zip file source
- Async helper for running tasks concurrently
- Gitter badge to README
- Documentation on the Data Flow Facilitator subsystem
- codesec plugin containing operations which gather security related metrics on code and binaries.
- auth plugin containing an scrypt operation as an example of thread pool usage.
- Standardized the API for most classes in DFFML via inheritance from dffml.base
- Configuration of classes is now done via the args() and config() methods
- Documentation is now generated using Sphinx
- Corrected maxsplit in util.cli.parser
- Check that dtype is a class in Tensorlfow DNN
- CI script no longer always exits 0 for plugin tests
- Corrected render type in setup.py to markdown
- Contribution guidelines
- Logging documentation
- Example usage of Git features
- New Model and Feature creation script
- New Feature skeleton directory
- New Model skeleton directory
- New Feature creation tutorial
- New Model creation tutorial
- Update functionality to the CSV source
- Support for Gzip file source
- Support for bz2 file source
- Travis checks for additions to CHANGELOG.md
- Travis checks for trailing whitespace
- Support for lzma file source
- Support for xz file source
- Data Flow Facilitator
- Restructured documentation to docs folder and moved from rST to markdown
- Git feature cloc logs if no binaries are in path
- Enable source.file to read from /dev/fd/XX
- Corrected formatting in README for PyPi
- Feature class to collect a feature in a dataset
- Git features to collect feature data from Git repos
- Model class to wrap implementations of machine learning models
- Tensorflow DNN model for generic usage of the DNN estimator
- CLI interface and framework
- Source class to manage dataset storage