It is easy to oversubscribe ourselves during a release
cycle. We approve specs only to experience that we fail
to review consistently; code is posted last minute,
we get lost into chasing race conditions, the odd dependency
throws the gate under the bus, and the ever-present cosmetic
refactoring patch steers our attention from what matters,
which is *reviewing what we commit to deliver*.
In order to keep ourselves focused on the bigger picture,
this patch proposes/formalizes changes to the way we
register blueprints so that we can improve our ability
to better foresee what we'll deliver at the end of a
release.
We obviously can't predict the future or chain people
to a desk, but a regular pulse check of a well planned
task list should at least help mitigate the feeding frenzy
that usually happens closer to FF. That inevitably
leads to half-baked and buggy solutions.
The patch broke notifications about FIP updates and triggered 100%
gate failures for Ironic gate.
I believe that I0cbe8c51c3714e6cbdc48ca37135b783f8014905 is also
breaking notifications, but for FIP create, which probably was not
utilized in any gate before and hence not caught in time.
The change the reverted patch introduced made update_floatingip to
fetch router based on FIP router_id field on every call, which was
not the case before the patch. For some reason unknown at the
moment, we get NotFound from database on this fetch.
The patch does not answer the question why we get NotFound from
database on fetching a FIP router_id, but that's another issue that
should be investigated while Ironic gate is happy.
Fix AttributeError on port_bound for missing ports
If the port is concurrently deleted, db_get_val returns None, and
that causes the Exception to be raised. However, the exception is just
log noise if the port has been deleted concurrently and it does not
lead to failures.
This can happen if port_update and port_delete operations occur
in short sequence and interleave. To prevent this trace from
occurring, this patch checks that the port is being eliminated and
emits an error trace only if the port is indeed to be expected
amongst the list of ports to be updated.
We do not raise an exception to avoid disrupting the agent sync
process, and leave to the admin the investigation of the issue
(should that be cronic rather than transient).
Under normal circumstances, if the port is expected it should
be there, and if it isn't this should be treated as a bug to be
investigated further.
changzhi [Thu, 20 Aug 2015 13:40:42 +0000 (21:40 +0800)]
DVR: Notify specific agent when update floatingip
The L3 agent was determined when update floatingip.
So notify the specific agent rather than notify all agents.
This will save some RPC resources. This is only for DVR routers.
Legacy and HA routers notify only the relevant agents.
Do not try to delete a veth from a nonexistent namespace
Currently, VethFixture cleanup tries to delete veth from their
namespaces even if they don't exist anymore (without crashing the
cleanup). It implies an extra trace if an error is raised which can be
confusing.
This change avoids to try deleting a veth from a nonexistent namespace
in the fixture VethFixture.
Ihar Hrachyshka [Thu, 15 Oct 2015 15:57:56 +0000 (17:57 +0200)]
Lower the log level for the message about concurrent port delete
There is nothing wrong or interesting in that log message that would be
useful during general cloud operation. It may be useful when debugging
an issue, so leaving the message, but with lowered log level.
John Davidge [Wed, 14 Oct 2015 14:09:45 +0000 (15:09 +0100)]
Update RFE documentation to clarify when the tag is not appropriate
Adds a passage to clarify that the 'rfe' tag is not to be used for work
that is already well-defined, to avoid code being submitted before proper
discussion can take place.
There seems to be a timing issue between the
ARP entries that arrive from the server to
the agent and the internal qr-device getting
created by the agent.
So those unsuccessful arp entries are dropped.
This patch makes sure that the early ARP entries
are cached in the agent and then utilized when
the internal device is up.
This is a null-merge of the 7.0.0 release tag back into the master
branch so that the 7.0.0 tag will appear in the git commit history of
the master branch. It contains no actual changes to the master branch,
regardless of how our code review system's UI represents it. Please
ask in #openstack-infra if you have any questions, and otherwise try
to merge this as quickly as possible to avoid later conflicts on the
master branch.
Brian Haley [Wed, 14 Oct 2015 02:56:24 +0000 (22:56 -0400)]
Create ipset set_name_exists() method
All the callers of set_exists() already know the set name,
so there's no reason to re-calculate it based on the id and
ethertype. Renamed to set_name_exists() to be clear on what
it is checking.
Sachi King [Mon, 28 Sep 2015 05:20:19 +0000 (15:20 +1000)]
Add -constraints sections for base CI jobs
Using factors with sections is not a thing and likely will not be for
a while, as such we are going to have to duplicate sections to be able
to set the constraints based install_command.
Cyril Roelandt [Tue, 13 Oct 2015 12:46:39 +0000 (14:46 +0200)]
Python 3: make post_test_hook work with more tox targets
This makes it possible to run the script with different "flavors" of
"dsvm-functional" and "dsvm-fullstack". In particular, this is needed to
have a dsvm-functional-py34 gate running (see
https://review.openstack.org/#/c/233177/).
Kevin Benton [Wed, 14 Oct 2015 05:33:53 +0000 (22:33 -0700)]
Move retries out of ML2 plugin
There are several DB retry decorators in the ML2 plugin.
We want to keep them in the API layer as much possible so
this patch removes all but 1 and adds the retryrequest catch
to the API-level decorator.
The remaining one is the update_port_status which is an internal
RPC called method so it's not clear if there is a benefit to
moving it to the ml2 rpc file.
Liberty is behind us. The RFE process has been in place for a cycle
and some refinements have been necessary to make it more streamlined.
The cut-over section has outlived its purpose so it can be removed.
In its place, this patch clarifies how we are going to track RFE in
Launchpad, once the RFE has been 'approved'.
The tl;dr version of this patch is: from Mitaka onwards, feature
submission must happen via RFE bug report submission. Launchpad
blueprints are going to become a tool to be used at discretion of
the project release manager for milestone tracking purposes.
John Kasperski [Thu, 24 Sep 2015 23:16:18 +0000 (18:16 -0500)]
Improve performance of ensure_namespace
The ensure_namespace method calls IpNetnsCommand.exists to
determine if the specified namespace exists or not. This is
accomplished by listing all namespaces with "ip netns list"
and then looping through the output to determine if the specified
namespace was included in the output.
Research of various Linux operating systems has indicated that
namespaces are represented as files in /var/run/netns and root
authority is "typically" not required in order to look at the
files in this subdirectory.
The existing configuration option "use_helper_for_ns_read"
will be used to determine if the root-helper should be used to
to retrieve the list of namespaces. If this configuraton option
is set to False, the native python os.listdir(/var/run/netns)
will be used.
Assaf Muller [Tue, 13 Oct 2015 14:19:36 +0000 (10:19 -0400)]
Kill conntrackd state on HA routers FIP disassociation
Legacy routers kill conntrackd states on FIP disassociation,
so that traffic to FIPs that have been disassociated is properly
dropped. This is not the case with HA routers, and this patch
changes that.
Make test_server work with older versions of oslo.service
Change I18a11283925369bc918002477774f196010a1bc3 fixed the test for
oslo.service >= 0.10.0, but it also broke it for older versions of
oslo.service. Since the library has minimal version of >= 0.7.0 in
requirements.txt, test should pass for those versions too.
Now, instead of validating that either reset() or restart() of workers
are triggered on SIGHUP, just validate that .start() is triggered the
expected number of times (either way, no matter how oslo.service decide
to clean up the children, they exit and then are respawned).
By mocking 'open' at the module level, we can avoid affecting
'open' calls from other modules.
2. Stop using LOG.exception in contexts with no sys.exc_info set
Python 3.4 logger fills in record.exc_info with sys.exc_info() result
[1], and then it uses it to determine the current exception [2] to
append to the log message. Since there is no exception, exc_info[1] is
None, and we get AttributeError inside traceback module.
It's actually a bug in Python interpreter that it attempt to access the
attribute when there is no exception. It turns out that it's fixed in
latest master of cPython [3] (the intent of the patch does not seem
relevant, but it removes the offending code while reshuffling the code).
Note that now cPython correctly checks the exception value before
accessing its attributes [4].
The patch in cPython that resulted in the failure is [5] and is present
since initial Python 3k releases.
Sergey Belous [Tue, 13 Oct 2015 14:50:05 +0000 (17:50 +0300)]
Avoid DuplicateOptError in functional tests
Some test require to register options with oslo.Config,
but now some of them use global cfg.CONF for it.
This can cause the DuplicateOptError if two tests tried to
register the same option with different values of config.
We should use the oslo_config fixture [1]
in functional tests to avoid it.
Ihar Hrachyshka [Tue, 13 Oct 2015 12:46:02 +0000 (14:46 +0200)]
Make test_server work with older versions of oslo.service
Change I18a11283925369bc918002477774f196010a1bc3 fixed the test for
oslo.service >= 0.10.0, but it also broke it for older versions of
oslo.service. Since the library has minimal version of >= 0.7.0 in
requirements.txt, test should pass for those versions too.
Now, instead of validating that either reset() or restart() of workers
are triggered on SIGHUP, just validate that .start() is triggered the
expected number of times (either way, no matter how oslo.service decide
to clean up the children, they exit and then are respawned).
Oleg Bondarev [Tue, 13 Oct 2015 08:55:02 +0000 (11:55 +0300)]
Always send status update for processed floating ips
Currently l3 agent skips status update for floating ips in case
status didn't change: this might be wrong if status has changed
on server side while agent was processing. See bug for details.
L3 agent skips floating ip processing in case ip address exists
on external device. So we can still skip status update for such
floating ips.
Fix inconsistency in DHCPv6 hosts and options generation
The DHCP agent is inconsistent in how it handles subnets whose
ipv6_address_mode is not slaac. While the DHCP agent writes out both
DHCPv6 host entries and DHCPv6 options for ports scoped by the subnet
for dnsmasq to use, subnet specific options are not written.
This patch addresses this inconsistency by generating subnet specific
options when the subnet's ipv6_address_mode is not slaac.
Assaf Muller [Mon, 12 Oct 2015 20:17:13 +0000 (16:17 -0400)]
Fix error returned when an HA router is updated to DVR
Before this patch, the code compares the 'ha' flag that
comes in from the user, and the current state of the 'distributed'
flag in the DB. This is wrong because if a router is currently
HA in the DB, and the update request contains only
{'distributed': True}, then the 'ha' flag from the request
is None and the error condition is never raised!
The reason the unit tests
(Specifically test_migrate_ha_router_to_distributed)
did not catch this issue is because
of another bug: The _update_router helper method in the L3 HA
unit tests had an 'ha' default value of True, when it should
have had a default value of None. Setting it to None fails
the unit test (Because it raises the wrong exception),
and the contents of the patch makes the unit test pass.
Assaf Muller [Mon, 12 Oct 2015 14:40:49 +0000 (10:40 -0400)]
Remove disable_service from DBs configuration
Remove disable_service from configure_for_func_testing.
A recent Devstack patch (Linked in bug report) checks
that a disabled service is not enabled later. This breaks
the code this patch touches. I believe the DBs were disabled
and enabled with the assumption that Devstack expects only
a single DB to be configured at a time, but that doesn't
seem to be the case. Simply removing the disable calls seems
to work fine.
Also exclude oslo.messaging==2.6.0 as per global-requirements.txt.
Oleg Bondarev [Mon, 5 Oct 2015 14:34:45 +0000 (17:34 +0300)]
DVR: notify specific agent when creating floating ip
Currently when floating ip is created, a lot of useless action
is happening: floating ip router is scheduled, all l3 agents where
router is scheduled are notified about router update, all agents
request full router info from server. All this becomes a big
performance problem at scale with lots of compute nodes.
In fact on (associated) Floating IP creation we really need
to notify specific l3 agent on compute node where associated
VM port is located and do not need to schedule router and
bother other agents where rourter is scheduled. This should
significally decrease unneeded load on neutron server at scale.