Porter is a data import abstraction library to import any data from anywhere. To achieve this she must be able to generalize about the structure of data. Porter believes all data sets are either a single record or repeating collection of records with consistent structure, where record is either a list or tree of name and value pairs.
Porter must be able to abstract data importing requirements so that she can import any type of data, in a similar way that a database must be able to abstract data storage requirements such that it can store any type of data. To be clear, Porter is only interested in data import, not storage. To facilitate this Porter's interfaces use arrays, also known as records and array iterators, also known as record collections. Arrays are useful because we can store any data type as values and iterators are useful because we can iterate over an unlimited number of records, thus allowing Porter to theoretically import any data format of any size.
The Provider organization hosts projects that use Porter to provide useful data. These repositories are ready-to-use Porter providers that grant access to popular third-party APIs and data services. Check it out before writing a new provider to see if it has already been written! Anyone writing new providers is encouraged to contribute them to the organization.
- Usage
- Import specifications
- Record collections
- Filtering
- Mapping
- Caching
- Architecture
- Providers
- Resources
- Connectors
- Requirements
- Limitations
- Testing
- License
Porter's import
method accepts an ImportSpecification
that describes which data should be imported and how the data should be transformed. To import MyResource
we might write the following.
$records = $porter->import(new ImportSpecification(new MyResource));
Provider resources, such as MyResource
, specify the Provider
class name they work with. Imports will only work when a resource's provider has been added to Porter, otherwise ProviderNotFoundException
is thrown. To find which provider MyResource
requires we examine its getProviderClassName
method, which returns MyProvider::class
, in this case. In the following example we register MyProvider
with Porter.
$porter = (new Porter)->registerProvider(new MyProvider);
Calling import()
returns an instance of PorterRecords
, which implements Iterator
, allowing us to enumerate each record in the collection using foreach
as in the following example.
foreach ($records as $record) {
// Insert breakpoint or var_dump() here to examine each $record.
}
Import specifications specify what to import, and optionally, how it should be transformed thereafter and whether to use caching. The only mandatory parameter, passed to the constructor, is a ProviderResource
that specifies the data we want to import.
Options may be configured by the setters listed below.
setFilter(callable)
– Specifies a predicate that may remove records; see filtering for more.setMapping(Mapping)
– Specifies a mapping to transform each record; see mapping for more.setContext(mixed)
– Specifies user-defined data to be passed to Mapper and filters.setCacheAdvice(CacheAdvice)
– Specifies a caching strategy; see caching for more.
The order of operations is fixed and occur in the following order.
- Fetch records from
ProviderResource
. - Filtering.
- Mapping.
Since the order is fixed, it is not currently possible to exclude records based on data that only exists after mapping.
The result of a successful Porter::import
call is an instance of PorterRecords
or one of its specialisations. All collection types returned by Porter extend RecordCollection
, which implements Iterator
, and guarantees the collection is enumerable using foreach
.
Record collections are composed by Porter using the decorator pattern. If provider data is not modified, PorterRecords
will decorate the ProviderRecords
returned from a ProviderResource
. That is, PorterRecords
has a pointer back to the previous collection, which could be written as: PorterRecords
→ ProviderRecords
. If a mapping was applied, the collection stack would be PorterRecords
→ MappedRecords
→ ProviderRecords
. In general this is an unimportant detail for most users but it can be useful for debugging. The stack of record collection types informs us of the transformations a collection has undergone and each type holds a pointer to relevant objects that participated in the transformation, for example, PorterRecords
holds a reference to the ImportSpecification
that was used to create it and can be accessed using PorterRecords::getSpecification
.
A collection may be Countable
, depending on whether the imported data set was countable and whether any destructive operations were performed after import. Filtering is a destructive operation since it may remove records and therefore the count reported by a ProviderResource
may no longer be accurate. It is the responsibility of the resource to supply the number of records in its collection by returning an iterator that implements Countable
, such as CountableProviderRecords
. When a countable iterator is detected, Porter returns CountablePorterRecords
as long as no destructive operations were performed, which is only possible because all non-destructive operation's collection types have a countable analogue.
Filtering provides a way to remove some of the records. For each record, if the specified predicate function returns false or another falsy value the record will be removed, otherwise the record will be kept. The predicate receives the current record as an array as its first parameter and context as its second parameter.
In general we would like to avoid filtering because it is inefficient to import data and then immediately remove some of it, but some immature APIs do not provide a way to reduce the data set on the server, so filtering on the client is our only option. Filtering also invalidates the record count reported by some resources, meaning we no longer know how many records are in the collection before iteration.
The following example filters out any records that do not have an id field present.
$records = $porter->import(
(new ImportSpecification(new MyResource))
->setFilter(function (array $record) {
return isset($record['id']);
})
);
Porter integrates Mapper to support data transformations using Mapping
objects. A full discussion of Mapper is beyond the scope of this document but the linked repository contains comprehensive documentation. Porter builds on Mapper by providing a powerful mapping strategy called SubImport
.
Porter's SubImport
strategy provides a way to join data sets together. A mapping may contain any number of sub-imports, each of which may receive a different ImportSpecification
. A sub-import causes Porter to begin a new import operation and thus supports all import options without limitation, including importing from different providers and applying a separate mapping to each sub-import.
SubImport(ImportSpecification|callable $specificationOrCallback)
$specificationOrCallback
– Either anImportSpecification
instance orcallable
that returns such an instance.
The following example imports MyImportSpecification
and copies the foo field from the input data into the output mapping. Next it performs a sub-import using MyDetailsSpecification
and stores the result in the details key of the output mapping.
$records = $porter->import(
(new MyImportSpecification)
->setMapping(new AnonymousMapping([
'foo' => new Copy('foo'),
'details' => new SubImport(MyDetailsSpecification),
]))
);
The following example is the same as the previous except MyDetailsSpecification
now requires an identifier that is copied from details_id present in the input data. This is only possible using a callback since we cannot inject strategies inside specifications.
$records = $porter->import(
(new MyImportSpecification)
->setMapping(new AnonymousMapping([
'foo' => new Copy('foo'),
'details' => new SubImport(
function (array $record) {
return new MyDetailsSpecification($record['details_id']);
}
),
]))
);
Caching is available at the connector level if the connector implements CacheToggle
. Connectors typically extend CachingConnector
which implements PSR-6-compatible caching. Porter ships with just one cache implementation, MemoryCache
, which stores data in memory but this can be substituted for any PSR-6 cache if the connector permits it.
When available, the connector caches raw responses for each unique cache key which is comprised of source and options parameters. Options are sorted before the cache key is created so the order of options is unimportant.
Caching behaviour is specified by one of the CacheAdvice
enumeration constants listed below.
SHOULD_CACHE
– Response should be cached if a cache is available.SHOULD_NOT_CACHE
– Response should not be cached even if a cache is available.MUST_CACHE
– Response must be cached otherwise an exception may be thrown.MUST_NOT_CACHE
– Response must not be cached otherwise an exception may be thrown.
The default cache advice is SHOULD_NOT_CACHE
, meaning connectors supporting caching will not cache responses and connectors not supporting caching will not throw any exceptions.
The follow example enables connector-level response caching, if available.
$records = $porter->import(
(new ImportSpecification(new MyResource))
->setCacheAdvice(CacheAdvice:SHOULD_CACHE())
);
Porter talks to providers to fetch data. Providers represent one or more resources from which data can be fetched. Providers pass a connector needed by their resources to fetch data. Resources define the provider they are compatible with and receive the provider's connector when fetching data. Resources must transform their data into one or more records, collectively known as record collections, which present data sets as an enumeration of array values.
The following UML class diagram shows a partial architectural overview illustrating Porter's main components. Note that Mapper is a separate project with optional integration into Porter but is included for completeness.
Providers fetch data from their ProviderResource
objects by supplying them with a valid Connector
. A provider implements Provider
that defines one method with the following signature.
public function fetch(ProviderResource $resource) : Iterator;
When fetch()
is called it is passed the resource from which data must be fetched. The provider must supply the resource with its connector which it typically does by calling $resource->fetch($connector)
.
A provider knows whether a given resource belongs to it by calling ProviderResource::getProviderClassName()
and checking for equality, but a provider does not know how many resources it has nor maintains a list of such resources and neither does any other part of the application. That is, a resource class can be created at any time and claim to belong to a given provider without any formal registration, and the provider must accept all such objects.
Note: before writing a provider be sure to check out the Provider organization to see if it has already been written!
Providers must implement the Provider
interface, however it is common to extend AbstractProvider
instead. The abstract class provides a fetch()
implementation, forwards options, stores a connector and proxies cache methods for the connector. A typical AbstractProvider
implementation only needs to override the constructor with a specialized type hint for the connector it requires.
Providers may also store common state applicable to their resources, such as authentication data, that is passed to a resource's second fetch()
parameter when the provider's fetch()
method is called. The recommended way to pass state to resources is calling AbstractProvider::setOptions()
in the provider's constructor, which causes the options to be forwarded automatically during fetch()
.
In the following example we create a provider that only accepts HttpConnector
instances. We also create a default connector in case one is not supplied. Note it is not always possible to create a default connector and it is perfectly valid to insist the caller supplies a connector.
final class MyProvider extends AbstractProvider
{
public function __construct(HttpConnector $connector = null)
{
parent::__construct($connector ?: new HttpConnector);
}
}
Resources fetch data using the supplied connector and format it as a collection of arrays. A resource implements ProviderResource
that defines the following three methods.
public function getProviderClassName() : string;
public function getProviderTag() : string;
public function fetch(Connector $connector, EncapsulatedOptions $options = null) : Iterator;
A resource supplies the class name of the provider it expects a connector from when getProviderClassName()
is called. A used-defined tag can be specified to identify a particular Provider instance when getProviderTag()
is called.
When fetch()
is called it is passed the connector from which data must be fetched. The resource must ensure data is formatted as an iterator of array values whilst remaining as true to the original format as possible; that is, we must avoid renaming or restructuring data because it is the caller's prerogative to perform data customization if desired.
Providers may also supply options to fetch()
. Such options are typically used to convey API keys or other options common to all of a provider's resources. When specified, a resource must ensure the options are transmitted to the connector.
Resources must implement the ProviderResource
interface, however it is common to extend AbstractResource
instead because it provides a working implementation for provider tagging. A typical AbstractResource
implementation implements getProviderClassName()
with a hard-coded provider class name and a valid fetch()
implementation.
It is important to understand fetch()
must always return an iterator of array values. Suppose we want to return the numeric series one to three. The following implementation would be invalid because it returns an iterator of integer values instead of an iterator of array values.
public function fetch(Connector $connector)
{
return new ArrayIterator(range(1, 3)); // Invalid return type.
}
Either of the following examples would be valid fetch()
implementations.
public function fetch(Connector $connector)
{
foreach (range(1, 3) as $number) {
yield [$number];
}
}
Since the total number of records is known, the iterator can be wrapped in CountableProviderRecords
to enrich the caller with this information.
public function fetch(Connector $connector)
{
$series = function ($limit) {
foreach (range(1, $limit) as $number) {
yield [$number];
}
}
return new CountableProviderRecords($series($count = 3), $count, $this);
}
In the following example we create a resource that receives a connector from MyProvider
and uses it to retrieve data from a hard-coded URL. We expect the data to be JSON encoded so we decode it into an array and use yield
to return it as a single-item iterator.
class MyResource extends AbstractResource
{
const URL = 'https://example.com';
public function getProviderClassName()
{
return MyProvider::class;
}
public function fetch(Connector $connector)
{
$data = $connector->fetch(self::URL);
yield json_decode($data, true);
}
}
Connectors fetch remote data from the specified source. A connector implements Connector
that defines one method with the following signature.
public function fetch(string $source, EncapsulatedOptions $options = null);
When fetch()
is called the connector fetches data from the specified source whilst applying any options that may have been specified. If a connector accepts options it must define its own options class and ensure its type is passed. Connectors may return data in any format that's convenient for resources to consume, but in general, such data should be as raw as possible and without modification.
The following connectors are provided with Porter.
HttpConnector
– Fetches data from an HTTP server via the PHP wrapper.SoapConnector
– Fetches data from a SOAP service.
- Filtering always occurs before mapping.
- Imports must complete synchronously. That is, calls to
import()
are blocking. - Sub-imports must complete synchronously. That is, the previous sub-import must finish before the next starts.
Porter is almost fully unit tested. Run the tests with bin/test
from a shell.
Porter is published under the open source GNU Lesser General Public License v3.0. However, the original Porter character and artwork is copyright © 2016 Bilge and may not be reproduced or modified without express written permission.