Kevin Benton [Thu, 3 Sep 2015 17:01:40 +0000 (10:01 -0700)]
Add utility function for checking trusted port
Ports that have a device_owner that starts with 'network:'
are trusted in several places throughout the codebase. Each
of these did a startswith check on each field and it's not
immediately obvious why it's done.
This patch adds a utility function called 'is_port_trusted'
that performs the same check and makes it obvious what is
being done.
Cloud deployed at scale most likely will use these scheduler
drivers because they allow a fairer resource allocation compared
to chance schedulers (which randomly place resources on the hosts).
Because of their importance, it's only wise to test them in
the gate on a continuous basis, so that we do not get surprised
by accidental regressions.
Rather than pushing this down through devstack-gate/project-config
patches, this chance alters the default of the scheduler
drivers, so that users can also pick these up out of the box.
This means that after an upgrade they would observe a change in
the scheduling behavior, if they relied on the default config.
Fix BadRequest error on add_router_interface for DVR
This operation for DVR is made of multiple steps, some of
which are not within the same DB transaction. For this
reason, if a failure occurs, the rollback will be partial.
This inconsistent state leads the retry logic to fail with
BadRequest, because the router is believed to be already
connected to the subnet.
To fix this condition, it is necessary to delete the port
should the DB deadlock occur.
This test initial design is problematic: it spawns keepalived,
it asserts the process is up, then it attempts to kill it.
However, this is when problems may arise:
a) it does so by using the disable method on the process - we
should be more rude than that if we want to simulate a crash!
b) keepalived may be forking while it is starting and it is
possible that for a moment the ppid changes and the process
owner invoking the kill has no rights to kill the spawned
process. This is the most plausible explaination I could find
as to why kill returns 1 with no standard error
c) it does not verify that the process has indeed disappeared
(what if the pm.disable didn't work?) - this means that the
test can pass, and yet the monitor may not work.
Bottom line: this test relied on the correctness of the very code
that was meant to validate...and that's not cool. To this aim, we
wait for the process to be active, kill the process with a kill -9
and verify that the process after the kill is indeed different.
Reservations: Don't count usage if resource is unlimited
If a resource is unlimited (ie: limit<0) then there is no need
to verify headroom for it. This also means that there no need for
counting it; therefore it is possible to save some DB operations
by skipping the count phase.
ARP does not support IPv6 addresses, so when we try to apply the flow, it
fails, with all other flows deferred for the same transaction. It results in
random flow breakages, depending on the order of the bad flow in the
transaction.
Change-Id: I0ecf167653e5a7d0916e091e05050406a026a1e2 Co-Authored-By: Thomas Carroll <Thomas.Carroll@pnnl.gov>
Closes-Bug: #1477253
Configure gw_iface for RAs only in Master HA Router
For an HA Router which does not have any IPv6 subnets in the external network
and when ipv6_gateway is not set, Neutron configures the gateway interface of
the router to receive Router Advts for default route. In an HA router, only
the Master instance has the IP addresses while the Backup instance does not
have any addresses (including LLA). In Kernel version 3.10, when the last
IPv6 address is removed from the interface, IPv6 proc entries corresponding
to the iface are also deleted. This is however reverted in the later versions
of kernel code.
This patch addresses this issue by configuring the proc entry only for the
Master HA Router instance instead of doing it un-conditionally.
Ryan Moats [Fri, 11 Sep 2015 12:41:38 +0000 (07:41 -0500)]
Remove useless log from periodic_sync_routers_task
Logging that peridoic_sync_routers_task is starting with fullsync
False just adds noise to devstack logs. Reposition the log
statement to indicate that the task is starting if it is going
to be doing real processing.
Change-Id: I73def1e20218b01c135769d0b8fbce449dad17ea Signed-off-by: Ryan Moats <rmoats@us.ibm.com>
Assaf Muller [Thu, 11 Jun 2015 21:13:44 +0000 (17:13 -0400)]
Add l2pop support to full stack tests
Add the l2pop mechanism driver to the ML2 plugin configuration, and set
l2_population = True, in the OVS agent configuration.
Each test class can enable or disable l2pop in its environment.
Assaf Muller [Tue, 16 Jun 2015 12:56:41 +0000 (08:56 -0400)]
Add tunneling support to full stack tests
* EnvironmentDescription class now accepts 'network_type'.
It sets the ML2 segmentation type, passes it to the OVS agents
configuration files, and sets up the host configuration. If
tunnelling type is selected, it sets up a veth pair with an IP
address from the 240.0.0.1+ range. The addressed end of
this pair is configured as the local_ip for tunneling purposes
in each of the OVS agents. If network type is not tunnelled, it
sets up provider bridges instead and interconnects them.
* For now we run the basic L3 HA test with VLANs and tunneling just
so we have something to show for.
* I started using scenarios in fullstack tests to run the same test
with VLANs or tunneling, and because test names are used for log
dirs, and testscenarios changes test names to include characters
that are not shell friendly (Space, parenthesis), I 'sanitized'
some of those characters.
Handle ObjectDeletedError when deleting network ports/subnets
It appeared there is still a race on port deletion when deleting
networks. So commit a55e10cfd6369533f0cc22edd6611c9549b8f1b4
introduced a regression. It's a bit of ironic that commit message
was "Avoid DB errors when deleting network's ports and subnets".
Shame on me!
Stephen Ma [Fri, 28 Aug 2015 14:00:48 +0000 (14:00 +0000)]
Descheduling DVR routers when ports are unbound from VM
When a VM is deleted, the DVR port used by the VM could be unbound
from the compute node. When it is unbounded, it is no longer
in use on the node. Currently the unbind doesn't trigger a check
to determine whether the DVR router can be unscheduled from the
L3-agent running on the compute node. This patch makes the check
and unschedule the router, if necessary.
Reduce the chance of random check/gate test failures
As previously implemented, the TestTrackedResource class is designed
to inject random failures into the gate. It generates random numbers
within the range of 0..10000, and will fail if it generates duplicate
random numbers during its run.
This patch creates UUIDs instead of random numbers, and makes the
chance of an collision vanishingly small.
Carl Baldwin [Tue, 1 Sep 2015 16:58:22 +0000 (16:58 +0000)]
Make ip address optional to add_route and delete_route
The add_route and delete_route methods require that the ip (actually
"via" in ip route terms) be passed. Some routes don't require this.
This patch makes it optional while maintaining the position for those
callers who do pass it by position.
Carl Baldwin [Fri, 28 Aug 2015 21:28:39 +0000 (21:28 +0000)]
Add list routes
This adds list routes while refactoring list_onlink_routes to share
implementation. It changes test_onlink_routes to be consistent in the
type of data that it returns with the new list_routes.
lzklibj [Mon, 2 Mar 2015 10:13:41 +0000 (02:13 -0800)]
Fix dvr update for subnet attach multi subnets
Fix method dvr_update_router_addvm to notify every
router attached to subnet where the vm will boot
on.
In dvr case, when a subnet only attaches to one router,
the subnet will only have one distributed router interface,
which device_owner is "network:router_interface_distributed".
So in this case, get_ports in this method will only get
one port, and it should be unnecessary to break in for loop.
But when a subnet attaches multiple routers, get_ports in
this method will return all distributed router interfaces
and the routers hold those interfaces should be notified
when an instance booted on the subnet. So it should also
be unnecessary to break in for loop.
Change-Id: I3a5808e5b6e8b78abd1a5b924395844507da0764
Closes-Bug: #1427122 Co-Authored-By: Ryan Moats <rmoats@us.ibm.com>
Carl Baldwin [Fri, 28 Aug 2015 21:19:40 +0000 (21:19 +0000)]
Make ip rule comparison more robust
I found that ip rules would be added multiple times in new address
scopes code because the _exists method was unable to reliably
determine if the rule already existed. This commit improves this by
more robustly canonicalizing what it reads from the ip rule command so
that like rules always compare the same.
Add test to check that correct functions is used in expand/contract
This test will check that expand branch does not contain drop SQLAlchemy
operations and contract branch does not contain create/add SQLAlchemy
operations.
Mike Bayer [Fri, 14 Aug 2015 18:44:28 +0000 (14:44 -0400)]
Add non-model index names to autogen exclude filters
The SQLAlchemy MySQL dialect generates implicit indexes
in the less-common case of an integer column within a composite
primary key where autoincrement is not set to False.
Add a rule to ignore these indexes when performing
autogenerate against a target database.
Mike Bayer [Mon, 20 Jul 2015 22:34:15 +0000 (18:34 -0400)]
Implement expand/contract autogenerate extension
Makes use of new Alembic 0.8 features to allow
altering of the "alembic revision" stream such
that operations for expand and contract are
directed into separate branches.
Today FloatingIP Agent gateway port is deleted and
re-created for DVR based routers based on floatingip
association and disassociation with VMs on compute
nodes by the plugin.
This introduces lot more strain on the plugin to
create and delete these ports when VMs come up and
get deleted that are associated with FloatingIps.
This patch will introduce an RPC call for the agent
to initiate a agent gateway port delete.
Also the agent will look for the last floatingip that
it manages, and if condition satisfies, the agent will
request the server to remove the FloatingIP Agent
Gateway port.
Delete FIP agent gateway port with external gw port
FIP agent gateway ports are associated with external
networks and specific host.
Today FIP agent gateway ports are deleted for
every floatingip associate and disassociate. This
introduces race conditions in the port delete and also
un-necessary access to the db.
This patch will delete the FIP agent gateway port when
the last gateway port of the external network is deleted.
The child patch linked to this parent patch will clean
up the FIP agent gateway port delete when associate,
disassociate and delete of floatingip happens.
This should also cover the case when an agent for some
reason was unable to request agent gw port delete.
(agent died).
Previous changes[1] have been merged as enablers[2] to fix the bug 1274034 but an alternative solution has been choosen and now we can
consider the introduced code as dead code.
This changes removes [2], associated tests and rootwrap filters.
Kevin Benton [Wed, 26 Aug 2015 05:03:27 +0000 (22:03 -0700)]
Stop device_owner from being set to 'network:*'
This patch adjusts the FieldCheck class in the policy engine to
allow a regex rule. It then leverages that to prevent users from
setting the device_owner field to anything that starts with
'network:' on networks which they do not own.
This policy adjustment is necessary because any ports with a
device_owner that starts with 'network:' will not have any security
group rules applied because it is assumed they are trusted network
devices (e.g. router ports, DHCP ports, etc). These security rules
include the anti-spoofing protection for DHCP, IPv6 ICMP messages,
and IP headers.
Without this policy adjustment, tenants can abuse this trust when
connected to a shared network with other tenants by setting their
VM port's device_owner field to 'network:<anything>' and hijack other
tenants' traffic via DHCP spoofing or MAC/IP spoofing.
Aman Kumar [Tue, 17 Mar 2015 10:41:54 +0000 (03:41 -0700)]
ovs agent resync may miss port remove event
In OVS Agent rpc_loop() resync mechanism clears the registered ports and
rescans them again, and it might result in missing some "port removed"
event and treat_devices_removed will not be called.
This fix rescans the newly updated ports when resync mechanism called,
without clearing the current registered ports.
The registered ports will be cleared only if there are too many
consecutive resyncs to avoid resycing forever because of the same
faulty port.
Retry metadata request on connection refused error
This testcase may fail intermittently on 'Connection refused' error.
This could be due to the fact that the metadata proxy setup is not exactly
complete at the time the request is issued; in fact there is no
synchronization between the router being up and the metadata request being
issued, and clearly this may be the reason of accidental but seldom failures.
In order to rule out this possibility and stabilize the test, let's retry
on connection refused only. If we continue to fail, then the next step would
be to dump the content of iptables to figure out why the error occurs.
This patch doesn't changes behaviour of dhcp-agent
but adds the opportunity to use user-defined config,
that will make dhcp-agent more flexible
and allows to run functional tests correctly
(without changing global oslo.config CONF)
This patch deals with the lock wait timeout and the deadlock errors
observed under high concurrency (api_workers >= 4) with the pymysql
driver. It includes the following changes:
- Stop setting dirty status for resource usage when creating
reservation, as usage of reserved resources is not tracked anymore;
- Add a variable, increasing delay when retrying make_reservation
upon a DBDeadlock error in order to reduce the chances of further
collisions;
- Enable transaction retry upon DBDeadlock errors for set_quota_usage;
- Do not resync quota usage while making reservation. This puts a lot
of stress on the database and is also wasteful since resource usage
is very likely to change again once the transaction is committed;
- Use autonested_transaction to simplify logic around when the
nested flag should be used.
Moshe Levi [Tue, 18 Aug 2015 05:48:24 +0000 (08:48 +0300)]
Qos SR-IOV: Refactor extension delete to get mac and pci slot
When calling delete we need the pci slot details to reset the VF rate. The problem
is that when the VM is deleted libvirt return the VF to the hypervisor and eswitch
manager will mark the pci_slot as unassigned so can't know from the mac which pci slot (VF)
to reset. Also newer libvirt version reset the mac when deleteing VM, so than it is
not possible at all.
The solution is to keep pci slot details locally in the agent since upon removal event
you cannot get pci_slot from the neutron server as it is for create/update since port
is already removed from neutron.
This patch pairs the mac and pci_slot for a device (VF) so when calling the extension
port delete api we can have the pci_slot and reset the VF rate.
It is also add a mapping between mac to port_id so we can pass the port_id
when calling the extention port delete api.