Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of some kernel issues #341

Open
philo-micas opened this issue Oct 9, 2023 · 4 comments
Open

Summary of some kernel issues #341

philo-micas opened this issue Oct 9, 2023 · 4 comments

Comments

@philo-micas
Copy link
Contributor

We recently encountered some kernel driver issues on M2-W6510-48V8C:

  1. Problems encountered by optoe drivers
    Code Path:drivers/misc/eeprom/optoe.c
    Issue 1:The write_timeout value is fixed at 25ms. In the test of our products, it is found that the 25ms timeout time of some optical modules is not enough, and I2C read and write failures will occur occasionally.
    Issue 2:When switch optoe’s dev_class from 2 to 1 then 1 to 3, clash occurs.

  2. Problems encountered by tmp401 drivers
    Code Path:drivers/hwmon/tmp401.c
    Issue:We used the TMP411 temperature sensor chip, which occasionally read the Issue of abnormal temperature values (255.2℃).

  3. Problems encountered by i2c-i801 drivers
    Code Path:drivers/i2c/busses/i2c-i801.c
    Issue 1:“SMBus is busy, can't use it!”Error happens sometimes. The I2C controller cannot be used.
    Issue 2:An occasional I2C deadlock (SCL is high voltage level and SDA is low voltage level) cannot be recovered, making the I2C controller unusable.

  4. Problems encountered by i2c-ismt drivers
    Code Path:drivers/i2c/busses/i2c-ismt.c
    Issue:An occasional I2C deadlock(SCL is high voltage level and SDA is low voltage level) cannot be recovered, making the I2C controller unusable.

  5. Problems encountered by i2c-gpio drivers
    Code Path:drivers/i2c/algos/i2c-algo-bit.c
    Issue 1:An occasional I2C deadlock (SCL is high voltage level and SDA is low voltage level) cannot be recovered, making the I2C controller unusable.
    Issue 2:Insert a faulty optical module with SDA and GND short-circuited, the I2C GPIO controller cannot identify the exception, and all the data read is 0 (The access is successful from the I2C protocol)

  6. Problems encountered by i2c-mux-pca954x drivers
    Code Path:drivers/i2c/muxes/i2c-mux-pca954x.c
    Issue 1:After accessing the peripherals under PCA9548, the PCA9548 channel is not closed. One I2C controller of our product is connected to multiple PCA9548s, and each PCA9548 channel is connected to an optical module (address is 0x50). After accessing the device under PCA9548, the channel is not closed, and I2C address conflicts will occur, resulting in access exceptions.
    Issue 2:After connecting the faulty optical module with SCL and GND short-circuited, and then opening the PCA9548 channel where the optical module is located. SCL is pulled to a low voltage level, causing the entire I2C controller to not work properly, and all I2C peripherals under the entire controller to be inaccessible and cannot be recovered (the fault scope has expanded)

Below are our solutions to reslove above issues:

  1. Problems encountered by optoe drivers
    Issue1: It may be related to the module characteristics of different manufacturers.
    At present, the write_timeout is changed to 50ms in our product, which runs well. Can this parameter be opened to users to set their own timeout time (the default is 25ms, the original logic remains unchanged)?
    Issue2:below is the issue code:
	if (optoe->client[1])
		i2c_unregister_device(optoe->client[1]);
	optoe->bin.size = ONE_ADDR_EEPROM_SIZE;
	optoe->num_addresses = 1;

After run the i2c_unregister_device(optoe->client[1]), optoe->client[1] is needed to set to null.If it is not set to null, clash will occur when switch again. Below is the repaired code.

	if (optoe->client[1]) {
		i2c_unregister_device(optoe->client[1]);
		optoe->client[1] = NULL;
}
	optoe->bin.size = ONE_ADDR_EEPROM_SIZE;
	optoe->num_addresses = 1;
  1. Problems encountered by tmp401 drivers
    Issue reason:The TMP411 temperature reading exception is caused by the timeout mechanism of TMP411.
    The following content is excerpted from the TMP411 data sheet
TIMEOUT FUNCTION
When bit 7 of the Consecutive Alert Register is set high,the TMP411 timeout function is enabled. The TMP411resets the serial interface if either SCL or SDA are held lowfor 30ms (typical) between a START and STOP condition.If the TMP411 is holding the bus low, it releases the busand waits for a START condition. To avoid activating thetimeout function, it is necessary to maintain acommunication speed of at least 1kHz for the SCLoperating frequency. The default state of the timeoutfunction is enabled (bit 7 = high).

At present, this issue can be resolved by modifying the driver and disabling the timeout mechanism of TMP411 when driving probe.

  1. Problems encountered by i2c-i801 drivers
    Issue1:The cause of SMBus Busy is unknown. At present, we have added controller reset logic. When SMBus Busy occurs, the controller is reset, so that the I2C controller can be recovered after a fault occurs.
    Issue2: The cause of I2C deadlock is unknown. Now, we have added the I2C nine clocks function. Before I2C transmission, the driver checks whether I2C deadlock occurs (SCL is high voltage level and SDA is low voltage level). If the I2C deadlock occurs, nine clocks are sent to unlock the device

  2. Problems encountered by i2c-ismt drivers
    Code Path:drivers/i2c/busses/i2c-ismt.c
    Issue:The solution is the same as the i2c-i801 driver, which also imports 9 clock unlock functions

  3. Problems encountered by i2c-gpio drivers
    Code Path:drivers/i2c/algos/i2c-algo-bit.c
    Issue1:The solution is the same as the i2c-i801 driver, which also imports 9 clock unlock functions
    Issue2:Add I2C-GPIO voltage level detection to determine SCL and SDA voltage levels before I2C begins transmission. If the SCL and SDA voltage levels are not high at the same time, a failure is returned.

  4. Problems encountered by i2c-mux-pca954x drivers
    Code Path:drivers/i2c/muxes/i2c-mux-pca954x.c
    Issue1:By default, idle_state is MUX_IDLE_AS_IS (that is, after accessing the devices under PCA9548, maintaining the original state without closing the channels). The default idle_state can only be changed by using the device tree, and X86 products do not support device trees. At present, we decided to directly modify the driver code and change the default idle_state status to MUX_IDLE_DISCONNECT
    Issue2:Add the PCA9548 reset logic to the PCA9548 closed channel logic. If the PCA9548 channel fails to be closed, perform a PCA9548 reset (after the PCA9548 reset, all channels are closed) to isolate the faulty device.

Are there any better solutions or suggestions from the kernel community regarding these issues? Thanks.

@philo-micas
Copy link
Contributor Author

To avoid making unauthorized modifications to the native kernel code that could affect everyone's compatibility, we have made some revisions under our platform. You can check out the code-level solutions in our PR [(https://github.com/sonic-net/sonic-buildimage/pull/15888).] As for whether the kernel community has better solutions, we can discuss and welcome any suggestions.

@paulmenzel
Copy link
Contributor

Thank you so much for reporting these issues and even supplying solutions. Could you please create one issue per problem?

@philo-micas
Copy link
Contributor Author

philo-micas commented Oct 11, 2023

Thank you so much for reporting these issues and even supplying solutions. Could you please create one issue per problem?

Done.

@philo-micas
Copy link
Contributor Author

These issues are separate in #343 #344 #345 #346 #347 #348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants