@InterfaceAudience.Public @InterfaceStability.Evolving public interface EnsemblePlacementPolicy
EnsemblePlacementPolicy encapsulates the algorithm that bookkeeper client uses to select a number of bookies
from the cluster as an ensemble for storing entries.
The algorithm is typically implemented based on the data input as well as the network topology properties.
This interface basically covers three parts:
The ensemble placement policy is constructed by jvm reflection during constructing bookkeeper client.
After the EnsemblePlacementPolicy is constructed, bookkeeper client will call
#initialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger) to initialize
the placement policy.
The #initialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger)
method takes a few resources from bookkeeper for instantiating itself. These resources include:
FeatureProvider that the policy could use for enabling or disabling its offered
features. For example, a RegionAwareEnsemblePlacementPolicy could offer features
to disable placing data to a specific region at runtime.
StatsLogger for exposing stats.
The ensemble placement policy is a single instance per bookkeeper client. The instance will
be uninitalize() when closing the bookkeeper client. The implementation of a placement policy should be
responsible for releasing all the resources that allocated during
#initialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger).
The bookkeeper client discovers list of bookies from zookeeper via BookieWatcher - whenever there are
bookie changes, the ensemble placement policy will be notified with new list of bookies via
onClusterChanged(Set, Set). The implementation of the ensemble placement policy will react on those
changes to build new network topology. Subsequent operations like newEnsemble(int, int, int, Map, Set) or
#replaceBookie(int, int, int, java.util.Map, java.util.Set,
org.apache.bookkeeper.net.BookieSocketAddress, java.util.Set)
hence can operate on the new
network topology.
Both RackawareEnsemblePlacementPolicy and RegionAwareEnsemblePlacementPolicy are
TopologyAwareEnsemblePlacementPolicys. They build a NetworkTopology on
bookie changes, use it for ensemble placement and ensure rack/region coverage for write quorums.
The network topology is presenting a cluster of bookies in a tree hierarchical structure. For example, a bookie cluster may be consists of many data centers (aka regions) filled with racks of machines. In this tree structure, leaves represent bookies and inner nodes represent switches/routes that manage traffic in/out of regions or racks.
For example, there are 3 bookies in region `A`. They are `bk1`, `bk2` and `bk3`. And their network locations are
/region-a/rack-1/bk1, /region-a/rack-1/bk2 and /region-a/rack-2/bk3. So the network topology
will look like below:
root
|
region-a
/ \
rack-1 rack-2
/ \ \
bk1 bk2 bk3
Another example, there are 4 bookies spanning in two regions `A` and `B`. They are `bk1`, `bk2`, `bk3` and `bk4`.
And their network locations are /region-a/rack-1/bk1, /region-a/rack-1/bk2,
/region-b/rack-2/bk3 and /region-b/rack-2/bk4. The network topology will look like below:
root
/ \
region-a region-b
| |
rack-1 rack-2
/ \ / \
bk1 bk2 bk3 bk4
The network location of each bookie is resolved by a DNSToSwitchMapping. The DNSToSwitchMapping
resolves a list of DNS-names or IP-addresses into a list of network locations. The network location that is returned
must be a network path of the form `/region/rack`, where `/` is the root, and `region` is the region id representing
the data center where `rack` is located. The network topology of the bookie cluster would determine the number of
RackawareEnsemblePlacementPolicy basically just chooses bookies from different racks in the built
network topology. It guarantees that a write quorum will cover at least two racks. It expects the network locations
resolved by DNSToSwitchMapping have at least 2 levels. For example, network location paths like
/dc1/rack0 and /dc1/row1/rack0 are okay, but /rack0 is not acceptable.
RegionAwareEnsemblePlacementPolicy is a hierarchical placement policy, which it chooses
equal-sized bookies from regions, and within each region it uses RackawareEnsemblePlacementPolicy to choose
bookies from racks. For example, if there is 3 regions - region-a, region-b and region-c,
an application want to allocate a 15-bookies ensemble. First, it would figure out there are 3 regions and
it should allocate 5 bookies from each region. Second, for each region, it would use
RackawareEnsemblePlacementPolicy to choose 5 bookies.
Since RegionAwareEnsemblePlacementPolicy is based on RackawareEnsemblePlacementPolicy, it expects
the network locations resolved by DNSToSwitchMapping have at least 3 levels.
#reorderReadSequence(List, BookiesHealthInfo, WriteSet) and
#reorderReadLACSequence(List, BookiesHealthInfo, WriteSet) are
two methods exposed by the placement policy, to help client determine a better read sequence according to the
network topology and the bookie failure history.
For example, in RackawareEnsemblePlacementPolicy, the reads will be attempted in following sequence:
In RegionAwareEnsemblePlacementPolicy, the reads will be tried in similar following sequence
as `RackAware` placement policy. There is a slight different on trying writable bookies: after trying every 2
bookies from local region, it would try a bookie from remote region. Hence it would achieve low latency even
there is network issues within local region.
Currently there are 3 implementations available by default. They are:
You can configure the ensemble policy by specifying the placement policy class in
ClientConfiguration.setEnsemblePlacementPolicy(Class).
DefaultEnsemblePlacementPolicy randomly pickups bookies from the cluster, while both
RackawareEnsemblePlacementPolicy and RegionAwareEnsemblePlacementPolicy choose bookies based on
network locations. So you might also consider configuring a proper DNSToSwitchMapping in
BookKeeper.Builder to resolve the correct network locations for your cluster.
| Modifier and Type | Interface and Description |
|---|---|
static class |
EnsemblePlacementPolicy.PlacementPolicyAdherence
enum for PlacementPolicyAdherence.
|
static class |
EnsemblePlacementPolicy.PlacementResult<T>
Result of a placement calculation against a placement policy.
|
| Modifier and Type | Method and Description |
|---|---|
default boolean |
areAckedBookiesAdheringToPlacementPolicy(java.util.Set<org.apache.bookkeeper.net.BookieId> ackedBookies,
int writeQuorumSize,
int ackQuorumSize)
Returns true if the bookies that have acknowledged a write adhere to the minimum fault domains as defined in the
placement policy in use.
|
default int |
getStickyReadBookieIndex(LedgerMetadata metadata,
java.util.Optional<java.lang.Integer> currentStickyBookieIndex)
Select one bookie to the "sticky" bookie where all reads for a particular
ledger will be directed to.
|
EnsemblePlacementPolicy |
initialize(ClientConfiguration conf,
java.util.Optional<org.apache.bookkeeper.net.DNSToSwitchMapping> optionalDnsResolver,
io.netty.util.HashedWheelTimer hashedWheelTimer,
FeatureProvider featureProvider,
org.apache.bookkeeper.stats.StatsLogger statsLogger,
org.apache.bookkeeper.proto.BookieAddressResolver bookieAddressResolver)
Initialize the policy.
|
default EnsemblePlacementPolicy.PlacementPolicyAdherence |
isEnsembleAdheringToPlacementPolicy(java.util.List<org.apache.bookkeeper.net.BookieId> ensembleList,
int writeQuorumSize,
int ackQuorumSize)
returns AdherenceLevel if the Ensemble is strictly/softly/fails adhering
to placement policy, like in the case of
RackawareEnsemblePlacementPolicy, bookies in the writeset are from
'minNumRacksPerWriteQuorum' number of racks.
|
EnsemblePlacementPolicy.PlacementResult<java.util.List<org.apache.bookkeeper.net.BookieId>> |
newEnsemble(int ensembleSize,
int writeQuorumSize,
int ackQuorumSize,
java.util.Map<java.lang.String,byte[]> customMetadata,
java.util.Set<org.apache.bookkeeper.net.BookieId> excludeBookies)
Choose numBookies bookies for ensemble.
|
java.util.Set<org.apache.bookkeeper.net.BookieId> |
onClusterChanged(java.util.Set<org.apache.bookkeeper.net.BookieId> writableBookies,
java.util.Set<org.apache.bookkeeper.net.BookieId> readOnlyBookies)
A consistent view of the cluster (what bookies are available as writable, what bookies are available as
readonly) is updated when any changes happen in the cluster.
|
void |
registerSlowBookie(org.apache.bookkeeper.net.BookieId bookieSocketAddress,
long entryId)
Register a bookie as slow so that it is tried after available and read-only bookies.
|
DistributionSchedule.WriteSet |
reorderReadLACSequence(java.util.List<org.apache.bookkeeper.net.BookieId> ensemble,
BookiesHealthInfo bookiesHealthInfo,
DistributionSchedule.WriteSet writeSet)
Reorder the read last add confirmed sequence of a given write quorum writeSet.
|
DistributionSchedule.WriteSet |
reorderReadSequence(java.util.List<org.apache.bookkeeper.net.BookieId> ensemble,
BookiesHealthInfo bookiesHealthInfo,
DistributionSchedule.WriteSet writeSet)
Reorder the read sequence of a given write quorum writeSet.
|
EnsemblePlacementPolicy.PlacementResult<org.apache.bookkeeper.net.BookieId> |
replaceBookie(int ensembleSize,
int writeQuorumSize,
int ackQuorumSize,
java.util.Map<java.lang.String,byte[]> customMetadata,
java.util.List<org.apache.bookkeeper.net.BookieId> currentEnsemble,
org.apache.bookkeeper.net.BookieId bookieToReplace,
java.util.Set<org.apache.bookkeeper.net.BookieId> excludeBookies)
Choose a new bookie to replace bookieToReplace.
|
void |
uninitalize()
Uninitialize the policy.
|
default void |
updateBookieInfo(java.util.Map<org.apache.bookkeeper.net.BookieId,BookieInfoReader.BookieInfo> bookieInfoMap)
Send the bookie info details.
|
EnsemblePlacementPolicy initialize(ClientConfiguration conf, java.util.Optional<org.apache.bookkeeper.net.DNSToSwitchMapping> optionalDnsResolver, io.netty.util.HashedWheelTimer hashedWheelTimer, FeatureProvider featureProvider, org.apache.bookkeeper.stats.StatsLogger statsLogger, org.apache.bookkeeper.proto.BookieAddressResolver bookieAddressResolver)
conf - client configurationoptionalDnsResolver - dns resolverhashedWheelTimer - timerfeatureProvider - feature providerstatsLogger - stats loggervoid uninitalize()
java.util.Set<org.apache.bookkeeper.net.BookieId> onClusterChanged(java.util.Set<org.apache.bookkeeper.net.BookieId> writableBookies,
java.util.Set<org.apache.bookkeeper.net.BookieId> readOnlyBookies)
The implementation should take actions when the cluster view is changed. So subsequent
newEnsemble(int, int, int, Map, Set) and
#replaceBookie(int, int, int, java.util.Map, java.util.Set,
org.apache.bookkeeper.net.BookieSocketAddress, java.util.Set)
can choose proper bookies.
writableBookies - All the bookies in the cluster available for write/read.readOnlyBookies - All the bookies in the cluster available for readonly.EnsemblePlacementPolicy.PlacementResult<java.util.List<org.apache.bookkeeper.net.BookieId>> newEnsemble(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.Set<org.apache.bookkeeper.net.BookieId> excludeBookies) throws BKException.BKNotEnoughBookiesException
BKException.BKNotEnoughBookiesException is thrown.
The implementation should respect to the replace settings. The size of the returned bookie list
should be equal to the provide ensembleSize.
customMetadata is the same user defined data that user provides
when BookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[], Map).
If 'enforceMinNumRacksPerWriteQuorum' config is enabled then the bookies belonging to default faultzone (rack) will be excluded while selecting bookies.
ensembleSize - Ensemble SizewriteQuorumSize - Write Quorum SizeackQuorumSize - the value of ackQuorumSize (added since 4.5)customMetadata - the value of customMetadata. it is the same user defined metadata that user
provides in BookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[])excludeBookies - Bookies that should not be considered as targets.BKException.BKNotEnoughBookiesException - if not enough bookies available.EnsemblePlacementPolicy.PlacementResult<org.apache.bookkeeper.net.BookieId> replaceBookie(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.List<org.apache.bookkeeper.net.BookieId> currentEnsemble, org.apache.bookkeeper.net.BookieId bookieToReplace, java.util.Set<org.apache.bookkeeper.net.BookieId> excludeBookies) throws BKException.BKNotEnoughBookiesException
BKException.BKNotEnoughBookiesException is thrown.
If 'enforceMinNumRacksPerWriteQuorum' config is enabled then the bookies belonging to default faultzone (rack) will be excluded while selecting bookies.
ensembleSize - the value of ensembleSizewriteQuorumSize - the value of writeQuorumSizeackQuorumSize - the value of ackQuorumSize (added since 4.5)customMetadata - the value of customMetadata. it is the same user defined metadata that user
provides in BookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[])currentEnsemble - the value of currentEnsemblebookieToReplace - bookie to replaceexcludeBookies - bookies that should not be considered as candidate.BKException.BKNotEnoughBookiesExceptionvoid registerSlowBookie(org.apache.bookkeeper.net.BookieId bookieSocketAddress,
long entryId)
bookieSocketAddress - Address of bookie hostentryId - Entry ID that caused a speculative timeout on the bookie.DistributionSchedule.WriteSet reorderReadSequence(java.util.List<org.apache.bookkeeper.net.BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
ensemble - Ensemble to read entries.bookiesHealthInfo - Health info for bookieswriteSet - Write quorum to read entries. This will be modified, rather than
allocating a new WriteSet.DistributionSchedule.WriteSet reorderReadLACSequence(java.util.List<org.apache.bookkeeper.net.BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
ensemble - Ensemble to read entries.bookiesHealthInfo - Health info for bookieswriteSet - Write quorum to read entries. This will be modified, rather than
allocating a new WriteSet.default void updateBookieInfo(java.util.Map<org.apache.bookkeeper.net.BookieId,BookieInfoReader.BookieInfo> bookieInfoMap)
bookieInfoMap - A map that has the bookie to BookieInfodefault int getStickyReadBookieIndex(LedgerMetadata metadata, java.util.Optional<java.lang.Integer> currentStickyBookieIndex)
The default implementation will pick a bookie randomly from the ensemble. Other placement policies will be able to do better decisions based on additional informations (eg: rack or region awareness).
metadata - the LedgerMetadata objectcurrentStickyBookieIndex - if we are changing the sticky bookie after a read failure, the
current sticky bookie is passed in so that we will avoid
choosing it againdefault EnsemblePlacementPolicy.PlacementPolicyAdherence isEnsembleAdheringToPlacementPolicy(java.util.List<org.apache.bookkeeper.net.BookieId> ensembleList, int writeQuorumSize, int ackQuorumSize)
ensembleList - list of BookieId of bookies in the ensemblewriteQuorumSize - writeQuorumSize of the ensembleackQuorumSize - ackQuorumSize of the ensembledefault boolean areAckedBookiesAdheringToPlacementPolicy(java.util.Set<org.apache.bookkeeper.net.BookieId> ackedBookies,
int writeQuorumSize,
int ackQuorumSize)
ackedBookies - list of BookieId of bookies that have acknowledged a write.writeQuorumSize - writeQuorumSize of the ensembleackQuorumSize - ackQuorumSize of the ensembleCopyright © 2011–2024 The Apache Software Foundation. All rights reserved.